Jump to content

Prasing txt file. Max number of operations stopping script.


mrphobos

Recommended Posts

I'm attempting to praise(if that's how you say it) txt data into xml with php and have come across a problem I've been unable to solve over the past two days so I'm coming here to ask the php gods for their assistance.

 

The file I'm prasing contains line after line of real estate data. I understand what I'm doing I think. I've gotten the data in a format that is more usable, taking out the tabs and replacing them with spaces and such. I am creating a series of if than statements which will select the address out of each line even though each line can be different. At the end of the address on each line there is a "S" character which stands for some property I'm not concerned with. I'm simply using the single 'S' to find the end of the address. The lines look like so:

                                                                                                                                                                        \/

403089 RESIDENTIAL Residential 385000 7610 N Lakeshore Dr. Harbor Springs S 3 2 0 None 3 Litzenburger, Boo Schaffer Real Estate

399562 RESIDENTIAL Condominium 155000 4749 Pleasantview Road Harbor Springs S 2 2 0 One Hartwick, Bob Coldwell Banker Schmidt

 

With a bunch of extra text following that I've trimmed off for our purposes here. See the 'S' after the town? I've created the following code to look for the 's' in relation to the word order.

 

<?php
// Listings file
$listings= file('listingsTest.txt');
$i = 0;
$j = 0;
$_ENV['a'] = 0;

foreach($listings as  $value) {
//Replace all spaces of every kinds with single spaces
$listings[$i] = preg_replace("'\s+'", ' ', $listings[$i]);
//Put all characters into an array corisopndings to each line in $listings
$_ENV['chars'.$i] = preg_split('//', $listings[$i]);

//Place all words and uninterupted numbers and place in array $words
$_ENV['words'.$i] = preg_split('/ /', $listings[$i]); 

$i++;
}

//echo $_ENV['chars'.'1']['1'];

foreach($_ENV['words'.$_ENV['a']] as $char){
$countedf = preg_split('//', $_ENV['words'.$_ENV['a']][$j]);
$counted = count($countedf) - 2;

$wordBeforef = preg_split('//', $_ENV['words'.$_ENV['a']][$j-1]);
$wordBefore = count($wordBeforef) - 2;

$wordAfterf = preg_split('//', $_ENV['words'.$_ENV['a']][$j+1]);
$wordAfter = count($wordAfterf) - 2;

if( ($counted == 1) 
&& ($wordAfter == 1)
&& (is_numeric($_ENV['words'.$_ENV['a']][$j+1]))
//&& ($wordBefore == 1)
//&& (!is_numeric($_ENV['words'.$_ENV['a']][$j]))
//&& (is_numeric($_ENV['words'.$_ENV['a']][$j+2]))
//&& ($_ENV['words'.$_ENV['a']][$j+3] == ' ' )
){
	echo '*';
	echo $_ENV['words'.$_ENV['a']][$j];
	echo '*';
	$_ENV['a']++;
	$j =0;
//$j=1
}
//echo $_ENV['chars'.$_ENV['a']][$j];

$j++;
}

?>

 

 

As you can see from the if then statements, I've gotten to the point where It's replying to the 'S' at the end of the address thus telling me where the address ends. I am however having  a problem I believe is a server issue. The code works fine when applied to 12 lines like the ones above, when I apply it to more of those lines it does not return the 'S' for them even if I used the exact same line more than 12 times. The main file which I'd like to automate the parsing of has thousands of these such lines in it.  If I try to apply this code to the file with these thousands of lines, the browser returns a "The website encountered an error while retrieving http://localhost. It may be down for maintenance or configured incorrectly".

 

I take this to mean the server is doing too much work for it to be completed. I think when it reaches it's twelfth, the temporary memory of my program/server or some thing else, is exhausted. I'm applying these if then statements to every single word in the file. Is this a processing issue on the server? I was applying this code to every character in the file and thought I could fix the problem by applying instead to every word given there are less words than characters. I have the processing time on the server set to 10000 and it's not taking along time to return the error message.

 

I would be very grateful to any help any of you could provide. Thank you for your time.

Link to comment
Share on other sites

A couple of questions -

 

1) Are there always 4 fields (i.e. 403089  RESIDENTIAL  Residential  385000) before the start of the address?

 

2) Is the 'S  x  y  z' pattern at the end of the address always an S followed by 3 numbers?

 

3) How about addresses that contain an 'S' for South (i.e. 7610 S Lakeshore Dr.  Harbor Springs)?

 

With those answers, someone can probably come up with a preg_match that will get the address using one statement.

Link to comment
Share on other sites

Assuming that you can read the whole file into memory (what your code is doing now with the file() statement), the following will match the address portion of each line -

 

<?php
// 403089   RESIDENTIAL   Residential   385000   7610 N Lakeshore Dr.   Harbor Springs   S   3   2   0   None  ...

$file  = 'listingsTest.txt';
$string = file_get_contents($file);

$pattern = "/\d+\s+\w+\s+\w+\s+\d+\s+(.*?)\s+S\s+\d{1}\s+\d{1}\s+\d{1}";

preg_match_all($pattern,$string,$matches);

echo '<pre>',print_r($matches[1],true),'</pre>';


?>

 

Link to comment
Share on other sites

1) Are there always 4 fields (i.e. 403089  RESIDENTIAL  Residential  385000) before the start of the address?

No, some lines have more than one word for the second and third field..

 

2) Is the 'S  x  y  z' pattern at the end of the address always an S followed by 3 numbers?

No, it's always followed by numbers but some times the S a R or some thing else

 

3) How about addresses that contain an 'S' for South (i.e. 7610 S Lakeshore Dr.  Harbor Springs)?

Yes, I've run in to this problem. That's what the other if then statements are for. The value after the 'S' are always numeric and the value before is never numeric.

 

I'm unfamilier with the code you've posted.

 

$pattern = "/\d+\s+\w+\s+\w+\s+\d+\s+(.*?)\s+S\s+\d{1}\s+\d{1}\s+\d{1}";

 

is this line finding all the 's''es?

Link to comment
Share on other sites

Upon further reading, is using PCRE conditions really the easiest way to do this. It would appear I'll have to spend days learning the PCRE system. Can any one confirm the cpu work load is whats causing the scrypt to stop? In which case, wouldn't it be easier to do this part in python?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.