Jump to content

Extracting data from a large text file...


Drongo_III

Recommended Posts

Hi

 

I'm learning php and trying to write a script to extract registration information from a large text file. Sadly my meagre knowledge of php is letting me down a bit. It's a case of knowing what you want the script to do but not having the knowlege of how to 'say it'.

 

So i was hoping that if I posted my code here someone could either give me a few pointers on where i am going wrong or suggest  a better way.

 

The text file data luckily has a recurring format as follows (for brevity i've only included one entry, which contains made up information):

 

 

 

From: bella_done@yahoo.co.uk

Sent: 02 February 2011 22:50

To: Jonny tum, patsy fells, dingly bongo

Subject: Subject: Fun Run 2010

 

Categories: Fun Run

 

Name: Bella Donna

Address: 14 brondle avenue

Postcode: cd83 1rg

Phone: 0287343510

Email: bella_don@yahoo.co.uk

DOB: 15/11/1945

Half or Full: Full fun run

How did you hear: Took part in 2010

 

As you can see the data has a convenient boundary at the 'from' field and the colon (or so it occurred to me) so I created my script as follows:

 


// the string being analysed
$the_string = "
From:	bella_done@yahoo.co.uk
Sent:	02 February 2011 22:50
To:	Jonny tum, patsy fells, dingly bongo
Subject:	Subject: Fun Run 2010

Categories:	Fun Run

Name: Bella Donna 
Address: 14 brondle avenue 
Postcode: cd83 1rg 
Phone: 0287343510 
Email: bella_don@yahoo.co.uk 
DOB: 15/11/1945 
Half or Full: Full fun run 
How did you hear: Took part in 2010";

// remove all formatting to work with a clean string

$clean_string = strip_tags($the_string);



// remove form field entries from the data and replace with commas and a ZZZ boundary

$remove_fields = array("Categories:" => "","Name:" => ",","Address:" => ",","Postcode:" => ",","Phone:" => 

",","Email:" => ",","DOB:" => ",","Half or Full:" => ",","How did you hear:" => ",","From:" => "ZZZ","Sent:" => 

",","To:" => ",",  );


$new_string = strtr("$clean_string",$remove_fields);


// split the data at the boundary ZZZ

$string_to_array = explode("ZZZ", $new_string);

$new_string2 = implode("</br>",$string_to_array);

echo $new_string2; 




$myFile = "address_list.csv";
$fh = fopen($myFile, 'w') or die("can't open file");
$stringData = $new_string2;
fwrite($fh, $stringData);
fclose($fh);


 

One major problem is when i write the new data to a csv file the csv contains spacings that cause it to be reproduced in a column form rather than as separate fields for each comma boundary.

 

So can anyone suggest either

 

a) a better way of extracting the data from the text file (doesn't need to be 100% clean and perfect)

 

b) How can i stop the spaces in the csv (i thought i would have fixed this when i stripped the tags from the string at the start??).

 

Any help would be greatly received by a newbie phper.

 

It's my first shot at performing anything moderately taxing so if I've made some blaring oversites I would very much welcome your wisdom!

 

Thank you

 

Drongo

Link to comment
Share on other sites

I would highly reccommend learning REGEX syntaxes and practices using preg_match_all. If you do learn REGEX learn the PECL standard as thats what PHP's preg functions use.

 

for ex,

 

preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches);
print_r($matches);

 

This is a very simple regex that captures 2 sub-patterns:

the first part of each line (which consists of 1 or more of: [space] a-z 0-9)

as long as it has a colon (:) after and before:

The second sub-pattern which captures anything up until a new line (\n etc).

 

hope this helps

Good Luck

Link to comment
Share on other sites

Hi chemical

 

That’s really useful to know. I’ve seen regular expression syntax in coding before but never really knew what it was or how to use it. Now I’ve read a tiny bit about them I can see I was missing out on a massively powerful tool! Though it looks like there is lots to learn on regular expressions…the old adage of the “the more you learn the more you realise you know nothing at all” is leaping to mind!

 

However, to reign this back to more basic terms I could use a little more help.

 

You see the regular expression with pre_match)_all will create a very clean array of registration data. But the reason I used the explode() function and split the string at the boundardy “ZZZ” was to end up with a single full string of registration data for each array entry (i.e. each array element would contain one person’s registration data). This was easy to implode() with a line break to create a csv – i.e. a csv with each line representing one registrant (though my csv didn’t come out quite as expected).

 

The bit I’m now confused about is how you use the shiny preg_match array and turn this into a CSV with each line displaying registration data.

 

It’s complicated a little more by the fact that some of the registration data contains two address lines and some don’t contain any.

 

Could you input a special marker (like my ZZZ) at the start and end of each registration entry and then implode the array only at that boundary?

 

Any suggestions on how to overcome this would be so, so appreciated!

 

 

I would highly reccommend learning REGEX syntaxes and practices using preg_match_all. If you do learn REGEX learn the PECL standard as thats what PHP's preg functions use.

 

for ex,

 

preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches);
print_r($matches);

 

This is a very simple regex that captures 2 sub-patterns:

the first part of each line (which consists of 1 or more of: [space] a-z 0-9)

as long as it has a colon (:) after and before:

The second sub-pattern which captures anything up until a new line (\n etc).

 

hope this helps

Good Luck

Link to comment
Share on other sites

Give us an example of a few entries with as much variation as there will be.

The more entries the more accurate my REGEX will be.

 

Also making a CSV is very simple and i can tell you how to make it after we jump this regex hurdle (as the data structure will change and so will the method used to create the CSV file).

 

hope this helps

Link to comment
Share on other sites

Hi Chem

 

Thanks for sticking with this one! I'm learning lots as we go.

 

The format of the data runs as follows with the exact spacing as you see it (all data here is made up.)

 

From: lenny@gmail.com

Sent: 07 September 2010 21:58

To: bilbo bagins, sam gamgee, billy bob

Subject: Subject: fun run 2011 - Registered interest

 

Follow Up Flag: Follow up

Flag Status: Completed

 

Name: Lenny Davis

Address: Ground Floor 500 High Street

Postcode: xs34 4fg

Phone: 034343554335

Email: lenny@gmail.com

DOB: 11.03.86

Half or Full: full run 

How did you hear: Took part in the Full fun run 2010

From: beth@aol.com

Sent: 07 September 2010 18:58

To: bilbo bagins, sam gamgee, billy bob

Subject: Subject: fun run 2011 - Registered interest

 

Follow Up Flag: Follow up

Flag Status: Completed

 

Name: Beth Collagin

Address: 76 Commercial place, merthyr road, Caerphilly

Postcode: Ce34 2vB

Phone: 0423433423424

Email: beth@aol.com

DOB: 

Half or Full: full run 

How did you hear: Took part in the Full run 2010

From: nick@googlemail.com

Sent: 07 September 2010 17:59

To: bilbo bagins, sam gamgee, billy bob

Subject: Subject: fun run 2011 - Registered interest

 

Follow Up Flag: Follow up

Flag Status: Completed

 

Name: nic jones

Address: 92 Drury grove

Postcode: cr3 2vu

Phone: 077434342354

Email: nick@googlemail.com

DOB: 13/12/1973

Half or Full: full run

How did you hear: Took part in the Full run 2010

 

 

Things to note:

 

The fields that read:

 

Follow Up Flag: Follow up

Flag Status: Completed

 

These only appear on some records. On other records these fields are absent but the format remains exactly the same (only without those fields). I am happy to remove all instances of those fields as they have no value. Same goes for the "To:" field, the "Sent:" field and the "Subject:" field. I'm not sure if that will make the REGEX any easier. If removing those fields makes things harder then I don't mind if they are left in.

 

I hope this helps and thanks again! I will hopefully learn lots from your example!

 

 

 

 

Give us an example of a few entries with as much variation as there will be.

The more entries the more accurate my REGEX will be.

 

Also making a CSV is very simple and i can tell you how to make it after we jump this regex hurdle (as the data structure will change and so will the method used to create the CSV file).

 

hope this helps

Link to comment
Share on other sites

I'll Admit i got bored :P

 

<?php

$email = "From:   lenny@gmail.com
Sent:   07 September 2010 21:58
To:   bilbo bagins, sam gamgee, billy bob
Subject:   Subject: fun run 2011 - Registered interest

Follow Up Flag:   Follow up
Flag Status:   Completed

Name: Lenny Davis
Address: Ground Floor 500 High Street
Address2: test address 2
Postcode: xs34 4fg
Phone: 034343554335
Email: lenny@gmail.com
DOB: 11.03.86
Half or Full: full run 
How did you hear: Took part in the Full fun run 2010
From:   beth@aol.com
Sent:   07 September 2010 18:58
To:   bilbo bagins, sam gamgee, billy bob
Subject:   Subject: fun run 2011 - Registered interest

Follow Up Flag:   Follow up
Flag Status:   Completed

Name: Beth Collagin
Address: 76 Commercial place, merthyr road, Caerphilly
Postcode: Ce34 2vB
Phone: 0423433423424
Email: beth@aol.com
DOB: 
Half or Full: full run 
How did you hear: Took part in the Full run 2010
From:   nick@googlemail.com
Sent:   07 September 2010 17:59
To:   bilbo bagins, sam gamgee, billy bob
Subject:   Subject: fun run 2011 - Registered interest

Follow Up Flag:   Follow up
Flag Status:   Completed

Name: nic jones
Address: 92 Drury grove
Postcode: cr3 2vu
Phone: 077434342354
Email: nick@googlemail.com
DOB: 13/12/1973
Half or Full: full run
How did you hear: Took part in the Full run 2010";

preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches);

// First we sort it into different people/emails
$user_id = 0;
$user_emails = array(); // Will hold all the data
$fields = array();	// We're going to use this for startnig the CSV file, it will hold all the different fields used in the text file.

// Iterate each match - 
for($i=0;$i<count($matches[0]);$i++){

// Set some variables to make it easier to work out; (trim() removes whitespace padding)
$field = trim($matches[1][$i]);
$data = trim($matches[2][$i]);

// Check or add field to fields array
if(!in_array($field, $fields)){
	$fields[] = $field;
}

// We check if the current field is "From" (strtolower() turns all letters in a string lowercase. so it can match any case)
if(strtolower($field) == "from"){
	$user_id++;
}

// Add item to the user_emails array
$user_emails[$user_id][$field] = $data;
}

// because we are using a variable number of fields we have to make a dynamic way of creating the csv contents.

// We start with a template of all fields available:
$template = array_flip($fields); // Now keys have been switched with their values
for($i=1;$i<count($template);$i++){ // This is just nulling the values so we dnot get the indexes as values.
$template[$fields[$i]] = null;
}

$csv_file = '"'.implode('","', $fields).'"'."\n"; // define our csv start line (the field names)

// Now we loop each user
for($i=1;$i<=count($user_emails);$i++){

// Use the template
$content = $template; // basically a copy, so we dont overwrite our template

$content = array_merge($content,$user_emails[$i]);

$csv_file .= '"'.implode('","', $content).'"'."\n";


}


echo($csv_file);

// Save as CSV File:
$h = fopen("test.csv", "w");
fwrite($h, $csv_file);
fclose($h);

?>

 

Take note of whats going on :P.

 

Btw - Was easier imo to use the same regex, just used some PHP code to seperate the different users, rather than the regex itself which i doubt is even possible in this scenario since you need multi-dimensional arrays.

 

Added CSV also

Link to comment
Share on other sites

Grrrrr typing with 2 fingers is sooooooooooooooo slow! :)

 

Perhaps an alternative:

(example in action: http://www.nstoia.com/sat/getfields/getfields00.php)

<?PHP
/* replace with your file's name */
$file = "getfields.txt";
$text = file_get_contents($file);

/* the pipe will act as the element separater */
$needle = "From:   ";
$text = str_replace($needle, "|", $text);

/* the ~ will act a field separater */
$needle_array = array("Subject:   Subject:", "Sent:   ", "To:   ", "Name: ", " Address: ", " Postcode: ", " Phone: ", " Email: ", " DOB:", " Half or Full: ", " How did you hear: ");
$num= count($needle_array);
for($i=0;$i<$num;$i++) {
$text = str_replace($needle_array[$i], "~", $text);
}
$text_array = explode("|", $text);

/* the first element is empty, so remove it */
array_shift($text_array);

echo "<PRE>";
print_r($text_array);
echo "</pre>";
?>

Link to comment
Share on other sites

Thank you Chem!

 

It works like a dream! And I'm quite in awe of the skills.

 

I won't pretend i fully understand everything you did but your comments help a lot.

 

Thanks to you too lightbearer. I'm flooded with options :)

 

Right gonna sit down with this script and workout what each part is doing...

 

I'll Admit i got bored :P

 

<?php

$email = "From:   lenny@gmail.com
Sent:   07 September 2010 21:58
To:   bilbo bagins, sam gamgee, billy bob
Subject:   Subject: fun run 2011 - Registered interest

Follow Up Flag:   Follow up
Flag Status:   Completed

Name: Lenny Davis
Address: Ground Floor 500 High Street
Address2: test address 2
Postcode: xs34 4fg
Phone: 034343554335
Email: lenny@gmail.com
DOB: 11.03.86
Half or Full: full run 
How did you hear: Took part in the Full fun run 2010
From:   beth@aol.com
Sent:   07 September 2010 18:58
To:   bilbo bagins, sam gamgee, billy bob
Subject:   Subject: fun run 2011 - Registered interest

Follow Up Flag:   Follow up
Flag Status:   Completed

Name: Beth Collagin
Address: 76 Commercial place, merthyr road, Caerphilly
Postcode: Ce34 2vB
Phone: 0423433423424
Email: beth@aol.com
DOB: 
Half or Full: full run 
How did you hear: Took part in the Full run 2010
From:   nick@googlemail.com
Sent:   07 September 2010 17:59
To:   bilbo bagins, sam gamgee, billy bob
Subject:   Subject: fun run 2011 - Registered interest

Follow Up Flag:   Follow up
Flag Status:   Completed

Name: nic jones
Address: 92 Drury grove
Postcode: cr3 2vu
Phone: 077434342354
Email: nick@googlemail.com
DOB: 13/12/1973
Half or Full: full run
How did you hear: Took part in the Full run 2010";

preg_match_all("#([a-z 0-9]+)\.+)#i",$email,$matches);

// First we sort it into different people/emails
$user_id = 0;
$user_emails = array(); // Will hold all the data
$fields = array();	// We're going to use this for startnig the CSV file, it will hold all the different fields used in the text file.

// Iterate each match - 
for($i=0;$i<count($matches[0]);$i++){

// Set some variables to make it easier to work out; (trim() removes whitespace padding)
$field = trim($matches[1][$i]);
$data = trim($matches[2][$i]);

// Check or add field to fields array
if(!in_array($field, $fields)){
	$fields[] = $field;
}

// We check if the current field is "From" (strtolower() turns all letters in a string lowercase. so it can match any case)
if(strtolower($field) == "from"){
	$user_id++;
}

// Add item to the user_emails array
$user_emails[$user_id][$field] = $data;
}

// because we are using a variable number of fields we have to make a dynamic way of creating the csv contents.

// We start with a template of all fields available:
$template = array_flip($fields); // Now keys have been switched with their values
for($i=1;$i<count($template);$i++){ // This is just nulling the values so we dnot get the indexes as values.
$template[$fields[$i]] = null;
}

$csv_file = '"'.implode('","', $fields).'"'."\n"; // define our csv start line (the field names)

// Now we loop each user
for($i=1;$i<=count($user_emails);$i++){

// Use the template
$content = $template; // basically a copy, so we dont overwrite our template

$content = array_merge($content,$user_emails[$i]);

$csv_file .= '"'.implode('","', $content).'"'."\n";


}


echo($csv_file);

// Save as CSV File:
$h = fopen("test.csv", "w");
fwrite($h, $csv_file);
fclose($h);

?>

 

Take note of whats going on :P.

 

Btw - Was easier imo to use the same regex, just used some PHP code to seperate the different users, rather than the regex itself which i doubt is even possible in this scenario since you need multi-dimensional arrays.

 

Added CSV also

Link to comment
Share on other sites

I'm glad it works for you :).

 

If you really are interested in finding otu exactly how it works: use print_r($array); ($array should be the name of any array variable, such as $matches)

 

I would do print_r() on $matches, $user_emails (after the loop) and $fields (after the loop).

 

The first loop does a couple things:

1. Creates an array of unique "Field" names that are in the actual text-file. This means that if there is a field missing from very entry then there will be a field missing at the end (whatever that field is - address2 perhaps).

2. Adds each email from each person (a field at a time).

 

hope this helps

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.