mrphobos Posted October 5, 2011 Share Posted October 5, 2011 I'm attempting to praise(if that's how you say it) txt data into xml with php and have come across a problem I've been unable to solve over the past two days so I'm coming here to ask the php gods for their assistance. The file I'm prasing contains line after line of real estate data. I understand what I'm doing I think. I've gotten the data in a format that is more usable, taking out the tabs and replacing them with spaces and such. I am creating a series of if than statements which will select the address out of each line even though each line can be different. At the end of the address on each line there is a "S" character which stands for some property I'm not concerned with. I'm simply using the single 'S' to find the end of the address. The lines look like so: \/ 403089 RESIDENTIAL Residential 385000 7610 N Lakeshore Dr. Harbor Springs S 3 2 0 None 3 Litzenburger, Boo Schaffer Real Estate 399562 RESIDENTIAL Condominium 155000 4749 Pleasantview Road Harbor Springs S 2 2 0 One Hartwick, Bob Coldwell Banker Schmidt With a bunch of extra text following that I've trimmed off for our purposes here. See the 'S' after the town? I've created the following code to look for the 's' in relation to the word order. <?php // Listings file $listings= file('listingsTest.txt'); $i = 0; $j = 0; $_ENV['a'] = 0; foreach($listings as $value) { //Replace all spaces of every kinds with single spaces $listings[$i] = preg_replace("'\s+'", ' ', $listings[$i]); //Put all characters into an array corisopndings to each line in $listings $_ENV['chars'.$i] = preg_split('//', $listings[$i]); //Place all words and uninterupted numbers and place in array $words $_ENV['words'.$i] = preg_split('/ /', $listings[$i]); $i++; } //echo $_ENV['chars'.'1']['1']; foreach($_ENV['words'.$_ENV['a']] as $char){ $countedf = preg_split('//', $_ENV['words'.$_ENV['a']][$j]); $counted = count($countedf) - 2; $wordBeforef = preg_split('//', $_ENV['words'.$_ENV['a']][$j-1]); $wordBefore = count($wordBeforef) - 2; $wordAfterf = preg_split('//', $_ENV['words'.$_ENV['a']][$j+1]); $wordAfter = count($wordAfterf) - 2; if( ($counted == 1) && ($wordAfter == 1) && (is_numeric($_ENV['words'.$_ENV['a']][$j+1])) //&& ($wordBefore == 1) //&& (!is_numeric($_ENV['words'.$_ENV['a']][$j])) //&& (is_numeric($_ENV['words'.$_ENV['a']][$j+2])) //&& ($_ENV['words'.$_ENV['a']][$j+3] == ' ' ) ){ echo '*'; echo $_ENV['words'.$_ENV['a']][$j]; echo '*'; $_ENV['a']++; $j =0; //$j=1 } //echo $_ENV['chars'.$_ENV['a']][$j]; $j++; } ?> As you can see from the if then statements, I've gotten to the point where It's replying to the 'S' at the end of the address thus telling me where the address ends. I am however having a problem I believe is a server issue. The code works fine when applied to 12 lines like the ones above, when I apply it to more of those lines it does not return the 'S' for them even if I used the exact same line more than 12 times. The main file which I'd like to automate the parsing of has thousands of these such lines in it. If I try to apply this code to the file with these thousands of lines, the browser returns a "The website encountered an error while retrieving http://localhost. It may be down for maintenance or configured incorrectly". I take this to mean the server is doing too much work for it to be completed. I think when it reaches it's twelfth, the temporary memory of my program/server or some thing else, is exhausted. I'm applying these if then statements to every single word in the file. Is this a processing issue on the server? I was applying this code to every character in the file and thought I could fix the problem by applying instead to every word given there are less words than characters. I have the processing time on the server set to 10000 and it's not taking along time to return the error message. I would be very grateful to any help any of you could provide. Thank you for your time. Quote Link to comment Share on other sites More sharing options...
PFMaBiSmAd Posted October 5, 2011 Share Posted October 5, 2011 A couple of questions - 1) Are there always 4 fields (i.e. 403089 RESIDENTIAL Residential 385000) before the start of the address? 2) Is the 'S x y z' pattern at the end of the address always an S followed by 3 numbers? 3) How about addresses that contain an 'S' for South (i.e. 7610 S Lakeshore Dr. Harbor Springs)? With those answers, someone can probably come up with a preg_match that will get the address using one statement. Quote Link to comment Share on other sites More sharing options...
PFMaBiSmAd Posted October 5, 2011 Share Posted October 5, 2011 Assuming that you can read the whole file into memory (what your code is doing now with the file() statement), the following will match the address portion of each line - <?php // 403089 RESIDENTIAL Residential 385000 7610 N Lakeshore Dr. Harbor Springs S 3 2 0 None ... $file = 'listingsTest.txt'; $string = file_get_contents($file); $pattern = "/\d+\s+\w+\s+\w+\s+\d+\s+(.*?)\s+S\s+\d{1}\s+\d{1}\s+\d{1}"; preg_match_all($pattern,$string,$matches); echo '<pre>',print_r($matches[1],true),'</pre>'; ?> Quote Link to comment Share on other sites More sharing options...
mrphobos Posted October 5, 2011 Author Share Posted October 5, 2011 1) Are there always 4 fields (i.e. 403089 RESIDENTIAL Residential 385000) before the start of the address? No, some lines have more than one word for the second and third field.. 2) Is the 'S x y z' pattern at the end of the address always an S followed by 3 numbers? No, it's always followed by numbers but some times the S a R or some thing else 3) How about addresses that contain an 'S' for South (i.e. 7610 S Lakeshore Dr. Harbor Springs)? Yes, I've run in to this problem. That's what the other if then statements are for. The value after the 'S' are always numeric and the value before is never numeric. I'm unfamilier with the code you've posted. $pattern = "/\d+\s+\w+\s+\w+\s+\d+\s+(.*?)\s+S\s+\d{1}\s+\d{1}\s+\d{1}"; is this line finding all the 's''es? Quote Link to comment Share on other sites More sharing options...
mrphobos Posted October 6, 2011 Author Share Posted October 6, 2011 Upon further reading, is using PCRE conditions really the easiest way to do this. It would appear I'll have to spend days learning the PCRE system. Can any one confirm the cpu work load is whats causing the scrypt to stop? In which case, wouldn't it be easier to do this part in python? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.