Drongo_III Posted February 6, 2011 Share Posted February 6, 2011 Hi I'm learning php and trying to write a script to extract registration information from a large text file. Sadly my meagre knowledge of php is letting me down a bit. It's a case of knowing what you want the script to do but not having the knowlege of how to 'say it'. So i was hoping that if I posted my code here someone could either give me a few pointers on where i am going wrong or suggest a better way. The text file data luckily has a recurring format as follows (for brevity i've only included one entry, which contains made up information): From: bella_done@yahoo.co.uk Sent: 02 February 2011 22:50 To: Jonny tum, patsy fells, dingly bongo Subject: Subject: Fun Run 2010 Categories: Fun Run Name: Bella Donna Address: 14 brondle avenue Postcode: cd83 1rg Phone: 0287343510 Email: bella_don@yahoo.co.uk DOB: 15/11/1945 Half or Full: Full fun run How did you hear: Took part in 2010 As you can see the data has a convenient boundary at the 'from' field and the colon (or so it occurred to me) so I created my script as follows: // the string being analysed $the_string = " From: bella_done@yahoo.co.uk Sent: 02 February 2011 22:50 To: Jonny tum, patsy fells, dingly bongo Subject: Subject: Fun Run 2010 Categories: Fun Run Name: Bella Donna Address: 14 brondle avenue Postcode: cd83 1rg Phone: 0287343510 Email: bella_don@yahoo.co.uk DOB: 15/11/1945 Half or Full: Full fun run How did you hear: Took part in 2010"; // remove all formatting to work with a clean string $clean_string = strip_tags($the_string); // remove form field entries from the data and replace with commas and a ZZZ boundary $remove_fields = array("Categories:" => "","Name:" => ",","Address:" => ",","Postcode:" => ",","Phone:" => ",","Email:" => ",","DOB:" => ",","Half or Full:" => ",","How did you hear:" => ",","From:" => "ZZZ","Sent:" => ",","To:" => ",", ); $new_string = strtr("$clean_string",$remove_fields); // split the data at the boundary ZZZ $string_to_array = explode("ZZZ", $new_string); $new_string2 = implode("</br>",$string_to_array); echo $new_string2; $myFile = "address_list.csv"; $fh = fopen($myFile, 'w') or die("can't open file"); $stringData = $new_string2; fwrite($fh, $stringData); fclose($fh); One major problem is when i write the new data to a csv file the csv contains spacings that cause it to be reproduced in a column form rather than as separate fields for each comma boundary. So can anyone suggest either a) a better way of extracting the data from the text file (doesn't need to be 100% clean and perfect) b) How can i stop the spaces in the csv (i thought i would have fixed this when i stripped the tags from the string at the start??). Any help would be greatly received by a newbie phper. It's my first shot at performing anything moderately taxing so if I've made some blaring oversites I would very much welcome your wisdom! Thank you Drongo Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/ Share on other sites More sharing options...
ChemicalBliss Posted February 6, 2011 Share Posted February 6, 2011 I would highly reccommend learning REGEX syntaxes and practices using preg_match_all. If you do learn REGEX learn the PECL standard as thats what PHP's preg functions use. for ex, preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches); print_r($matches); This is a very simple regex that captures 2 sub-patterns: the first part of each line (which consists of 1 or more of: [space] a-z 0-9) as long as it has a colon ( after and before: The second sub-pattern which captures anything up until a new line (\n etc). hope this helps Good Luck Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1170829 Share on other sites More sharing options...
lastkarrde Posted February 7, 2011 Share Posted February 7, 2011 I've found the website txt2re extremely helpful in generating regular expressions. You should take a look at it . Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1170855 Share on other sites More sharing options...
Drongo_III Posted February 7, 2011 Author Share Posted February 7, 2011 Hi chemical That’s really useful to know. I’ve seen regular expression syntax in coding before but never really knew what it was or how to use it. Now I’ve read a tiny bit about them I can see I was missing out on a massively powerful tool! Though it looks like there is lots to learn on regular expressions…the old adage of the “the more you learn the more you realise you know nothing at all” is leaping to mind! However, to reign this back to more basic terms I could use a little more help. You see the regular expression with pre_match)_all will create a very clean array of registration data. But the reason I used the explode() function and split the string at the boundardy “ZZZ” was to end up with a single full string of registration data for each array entry (i.e. each array element would contain one person’s registration data). This was easy to implode() with a line break to create a csv – i.e. a csv with each line representing one registrant (though my csv didn’t come out quite as expected). The bit I’m now confused about is how you use the shiny preg_match array and turn this into a CSV with each line displaying registration data. It’s complicated a little more by the fact that some of the registration data contains two address lines and some don’t contain any. Could you input a special marker (like my ZZZ) at the start and end of each registration entry and then implode the array only at that boundary? Any suggestions on how to overcome this would be so, so appreciated! I would highly reccommend learning REGEX syntaxes and practices using preg_match_all. If you do learn REGEX learn the PECL standard as thats what PHP's preg functions use. for ex, preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches); print_r($matches); This is a very simple regex that captures 2 sub-patterns: the first part of each line (which consists of 1 or more of: [space] a-z 0-9) as long as it has a colon ( after and before: The second sub-pattern which captures anything up until a new line (\n etc). hope this helps Good Luck Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1171009 Share on other sites More sharing options...
ChemicalBliss Posted February 7, 2011 Share Posted February 7, 2011 Give us an example of a few entries with as much variation as there will be. The more entries the more accurate my REGEX will be. Also making a CSV is very simple and i can tell you how to make it after we jump this regex hurdle (as the data structure will change and so will the method used to create the CSV file). hope this helps Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1171091 Share on other sites More sharing options...
Drongo_III Posted February 7, 2011 Author Share Posted February 7, 2011 Hi Chem Thanks for sticking with this one! I'm learning lots as we go. The format of the data runs as follows with the exact spacing as you see it (all data here is made up.) From: lenny@gmail.com Sent: 07 September 2010 21:58 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: Lenny Davis Address: Ground Floor 500 High Street Postcode: xs34 4fg Phone: 034343554335 Email: lenny@gmail.com DOB: 11.03.86 Half or Full: full run How did you hear: Took part in the Full fun run 2010 From: beth@aol.com Sent: 07 September 2010 18:58 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: Beth Collagin Address: 76 Commercial place, merthyr road, Caerphilly Postcode: Ce34 2vB Phone: 0423433423424 Email: beth@aol.com DOB: Half or Full: full run How did you hear: Took part in the Full run 2010 From: nick@googlemail.com Sent: 07 September 2010 17:59 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: nic jones Address: 92 Drury grove Postcode: cr3 2vu Phone: 077434342354 Email: nick@googlemail.com DOB: 13/12/1973 Half or Full: full run How did you hear: Took part in the Full run 2010 Things to note: The fields that read: Follow Up Flag: Follow up Flag Status: Completed These only appear on some records. On other records these fields are absent but the format remains exactly the same (only without those fields). I am happy to remove all instances of those fields as they have no value. Same goes for the "To:" field, the "Sent:" field and the "Subject:" field. I'm not sure if that will make the REGEX any easier. If removing those fields makes things harder then I don't mind if they are left in. I hope this helps and thanks again! I will hopefully learn lots from your example! Give us an example of a few entries with as much variation as there will be. The more entries the more accurate my REGEX will be. Also making a CSV is very simple and i can tell you how to make it after we jump this regex hurdle (as the data structure will change and so will the method used to create the CSV file). hope this helps Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1171149 Share on other sites More sharing options...
ChemicalBliss Posted February 8, 2011 Share Posted February 8, 2011 I'll Admit i got bored <?php $email = "From: lenny@gmail.com Sent: 07 September 2010 21:58 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: Lenny Davis Address: Ground Floor 500 High Street Address2: test address 2 Postcode: xs34 4fg Phone: 034343554335 Email: lenny@gmail.com DOB: 11.03.86 Half or Full: full run How did you hear: Took part in the Full fun run 2010 From: beth@aol.com Sent: 07 September 2010 18:58 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: Beth Collagin Address: 76 Commercial place, merthyr road, Caerphilly Postcode: Ce34 2vB Phone: 0423433423424 Email: beth@aol.com DOB: Half or Full: full run How did you hear: Took part in the Full run 2010 From: nick@googlemail.com Sent: 07 September 2010 17:59 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: nic jones Address: 92 Drury grove Postcode: cr3 2vu Phone: 077434342354 Email: nick@googlemail.com DOB: 13/12/1973 Half or Full: full run How did you hear: Took part in the Full run 2010"; preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches); // First we sort it into different people/emails $user_id = 0; $user_emails = array(); // Will hold all the data $fields = array(); // We're going to use this for startnig the CSV file, it will hold all the different fields used in the text file. // Iterate each match - for($i=0;$i<count($matches[0]);$i++){ // Set some variables to make it easier to work out; (trim() removes whitespace padding) $field = trim($matches[1][$i]); $data = trim($matches[2][$i]); // Check or add field to fields array if(!in_array($field, $fields)){ $fields[] = $field; } // We check if the current field is "From" (strtolower() turns all letters in a string lowercase. so it can match any case) if(strtolower($field) == "from"){ $user_id++; } // Add item to the user_emails array $user_emails[$user_id][$field] = $data; } // because we are using a variable number of fields we have to make a dynamic way of creating the csv contents. // We start with a template of all fields available: $template = array_flip($fields); // Now keys have been switched with their values for($i=1;$i<count($template);$i++){ // This is just nulling the values so we dnot get the indexes as values. $template[$fields[$i]] = null; } $csv_file = '"'.implode('","', $fields).'"'."\n"; // define our csv start line (the field names) // Now we loop each user for($i=1;$i<=count($user_emails);$i++){ // Use the template $content = $template; // basically a copy, so we dont overwrite our template $content = array_merge($content,$user_emails[$i]); $csv_file .= '"'.implode('","', $content).'"'."\n"; } echo($csv_file); // Save as CSV File: $h = fopen("test.csv", "w"); fwrite($h, $csv_file); fclose($h); ?> Take note of whats going on . Btw - Was easier imo to use the same regex, just used some PHP code to seperate the different users, rather than the regex itself which i doubt is even possible in this scenario since you need multi-dimensional arrays. Added CSV also Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1171234 Share on other sites More sharing options...
litebearer Posted February 8, 2011 Share Posted February 8, 2011 Grrrrr typing with 2 fingers is sooooooooooooooo slow! Perhaps an alternative: (example in action: http://www.nstoia.com/sat/getfields/getfields00.php) <?PHP /* replace with your file's name */ $file = "getfields.txt"; $text = file_get_contents($file); /* the pipe will act as the element separater */ $needle = "From: "; $text = str_replace($needle, "|", $text); /* the ~ will act a field separater */ $needle_array = array("Subject: Subject:", "Sent: ", "To: ", "Name: ", " Address: ", " Postcode: ", " Phone: ", " Email: ", " DOB:", " Half or Full: ", " How did you hear: "); $num= count($needle_array); for($i=0;$i<$num;$i++) { $text = str_replace($needle_array[$i], "~", $text); } $text_array = explode("|", $text); /* the first element is empty, so remove it */ array_shift($text_array); echo "<PRE>"; print_r($text_array); echo "</pre>"; ?> Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1171240 Share on other sites More sharing options...
Drongo_III Posted February 8, 2011 Author Share Posted February 8, 2011 Thank you Chem! It works like a dream! And I'm quite in awe of the skills. I won't pretend i fully understand everything you did but your comments help a lot. Thanks to you too lightbearer. I'm flooded with options Right gonna sit down with this script and workout what each part is doing... I'll Admit i got bored <?php $email = "From: lenny@gmail.com Sent: 07 September 2010 21:58 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: Lenny Davis Address: Ground Floor 500 High Street Address2: test address 2 Postcode: xs34 4fg Phone: 034343554335 Email: lenny@gmail.com DOB: 11.03.86 Half or Full: full run How did you hear: Took part in the Full fun run 2010 From: beth@aol.com Sent: 07 September 2010 18:58 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: Beth Collagin Address: 76 Commercial place, merthyr road, Caerphilly Postcode: Ce34 2vB Phone: 0423433423424 Email: beth@aol.com DOB: Half or Full: full run How did you hear: Took part in the Full run 2010 From: nick@googlemail.com Sent: 07 September 2010 17:59 To: bilbo bagins, sam gamgee, billy bob Subject: Subject: fun run 2011 - Registered interest Follow Up Flag: Follow up Flag Status: Completed Name: nic jones Address: 92 Drury grove Postcode: cr3 2vu Phone: 077434342354 Email: nick@googlemail.com DOB: 13/12/1973 Half or Full: full run How did you hear: Took part in the Full run 2010"; preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches); // First we sort it into different people/emails $user_id = 0; $user_emails = array(); // Will hold all the data $fields = array(); // We're going to use this for startnig the CSV file, it will hold all the different fields used in the text file. // Iterate each match - for($i=0;$i<count($matches[0]);$i++){ // Set some variables to make it easier to work out; (trim() removes whitespace padding) $field = trim($matches[1][$i]); $data = trim($matches[2][$i]); // Check or add field to fields array if(!in_array($field, $fields)){ $fields[] = $field; } // We check if the current field is "From" (strtolower() turns all letters in a string lowercase. so it can match any case) if(strtolower($field) == "from"){ $user_id++; } // Add item to the user_emails array $user_emails[$user_id][$field] = $data; } // because we are using a variable number of fields we have to make a dynamic way of creating the csv contents. // We start with a template of all fields available: $template = array_flip($fields); // Now keys have been switched with their values for($i=1;$i<count($template);$i++){ // This is just nulling the values so we dnot get the indexes as values. $template[$fields[$i]] = null; } $csv_file = '"'.implode('","', $fields).'"'."\n"; // define our csv start line (the field names) // Now we loop each user for($i=1;$i<=count($user_emails);$i++){ // Use the template $content = $template; // basically a copy, so we dont overwrite our template $content = array_merge($content,$user_emails[$i]); $csv_file .= '"'.implode('","', $content).'"'."\n"; } echo($csv_file); // Save as CSV File: $h = fopen("test.csv", "w"); fwrite($h, $csv_file); fclose($h); ?> Take note of whats going on . Btw - Was easier imo to use the same regex, just used some PHP code to seperate the different users, rather than the regex itself which i doubt is even possible in this scenario since you need multi-dimensional arrays. Added CSV also Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1171562 Share on other sites More sharing options...
ChemicalBliss Posted February 9, 2011 Share Posted February 9, 2011 I'm glad it works for you . If you really are interested in finding otu exactly how it works: use print_r($array); ($array should be the name of any array variable, such as $matches) I would do print_r() on $matches, $user_emails (after the loop) and $fields (after the loop). The first loop does a couple things: 1. Creates an array of unique "Field" names that are in the actual text-file. This means that if there is a field missing from very entry then there will be a field missing at the end (whatever that field is - address2 perhaps). 2. Adds each email from each person (a field at a time). hope this helps Quote Link to comment https://forums.phpfreaks.com/topic/226913-extracting-data-from-a-large-text-file/#findComment-1171910 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.