Manipulating a text file - removing parts

Mcod · February 9, 2012

I am looking for some help with manipulating a text file.

My lines have the following format:

abc|some words here; some other words here

bcd|some words here; some other words here and even more words sometimes

I have about 150000 lines in my file. Sometimes there is a second meaning, separated by a ; and sometimes not. Sometimes there are even more than two meanings. What I am after is removing everything INCLUDING the ; line by line - ALWAYS the FIRST ; even if there are multiple (which can happen, as you can see below).

What I basically have here is a dictionary where the word is on the left, then we have the | character and then one or more "meanings".

Sample data:

cough|exhale abruptly, as when one has a chest cold or congestion; "The smoker coughs all day"

hack|cough spasmodically; "The patient with emphysema is hacking all day"

expectorate|discharge (phlegm or sputum) from the lungs and out of the mouth

snort|make a snorting sound by exhaling hard; "The critic snorted contemptuously"

wheeze|breathe with difficulty

What I am after is basically:

cough|exhale abruptly, as when one has a chest cold or congestion

hack|cough spasmodically

expectorate|discharge (phlegm or sputum) from the lungs and out of the mouth

snort|make a snorting sound by exhaling hard

wheeze|breathe with difficulty

It would be great if you have any sample code that could help me, as it would be no fun to remove all this by myself if there is a solution to just do it with PHP and there is always one

Thanks for your time!

AyKay47 · February 9, 2012

here is what I have come up with for you.

$str = "cough|exhale abruptly, as when one has a chest cold or congestion; \"The smoker coughs all day\"\n
hack|cough spasmodically; \"The patient with emphysema is hacking all day\"\n
expectorate|discharge (phlegm or sputum) from the lungs and out of the mouth\n
snort|make a snorting sound by exhaling hard; \"The critic snorted contemptuously\"\n
wheeze|breathe with difficulty\n";
$pattern = '~;(.+?)(\r|\n)~';
$replacement = '$2';
echo preg_replace($pattern,$replacement,$str);

output.

cough|exhale abruptly, as when one has a chest cold or congestion

hack|cough spasmodically

expectorate|discharge (phlegm or sputum) from the lungs and out of the mouth

snort|make a snorting sound by exhaling hard

wheeze|breathe with difficulty

some of the regex relies on the control/format of your txt file

.josh · February 10, 2012

AyKay your solution is bad on so many levels I can't even begin...just, no. Just walk away.

$sourceFile = fopen("sourceFile.txt", "r");
$destFile = fopen("destFile.txt", "a");
if ($sourceFile && $destFile) {
  while (($line = fgets($sourceFile)) !== false) {
    $newLine = array_shift(explode(';',$line)) . "\n";
    fwrite($destFile, $newLine);
  }
  fclose($sourceFile);
  fclose($destFile);
}

AyKay47 · February 10, 2012

AyKay your solution is bad on so many levels I can't even begin...just, no. Just walk away.

$sourceFile = fopen("sourceFile.txt", "r");
$destFile = fopen("destFile.txt", "a");
if ($sourceFile && $destFile) {
  while (($line = fgets($sourceFile)) !== false) {
    $newLine = array_shift(explode(';',$line)) . "\n";
    fwrite($destFile, $newLine);
  }
  fclose($sourceFile);
  fclose($destFile);
}

perhaps you can explain why so I can learn from it. I do not perform regexp on text files ever really, I'm assuming that a PHP solution would normally be best.

.josh · February 10, 2012

Well the main issues with your solution are that :

1) Regex is powerful and sexy, but should always be avoided whenever possible, as it is almost certainly less efficient than using built-in functions. There are lots of different ways stripping everything including and after the first semicolon can be done without regex. One example is in my post above. Another example is using a combination of substr and strpos. You're welcome to do some benchmarking to see for yourself.

2) The OP asked for help with manipulating a file. Your solution gives an example string and does not address file manipulation at all. That in and of itself is missing the mark, but it becomes an even bigger issue considering the size of the file (150k rows) vs. your solution, which based on the context of your $str, is to perform a regex replace on the file as a whole. Loading 150k rows of a file into memory may or may not throw a "allowed memory reached" message, based on his php settings. Or it may or may not crash his server, depending on how shitty it is.

But even if he has a badass server with ample memory and limits maxed or turned off, it is still incredibly inefficient to load 150k rows of data into memory all at once like that. First off, that just amplifies the inefficiency of regex vs. built-in functions (my first point). If you did some benchmark testing on your regex pattern vs. the built-in functions on one row, or even just the several rows you currently have in your $str example, the difference is negligible, not gonna lie. Even if it's twice as slow, we're talking microseconds here... but 150k rows? And that's not even considering whether this is a 1-time-only script or something that needs to be ran regularly. Your example code implies using something like file_get_contents to grab a 150k line file and then I *guess* write it back with file_put_contents. Regardless of stripping method, performing the operation 1 line at a time will always be more safe and efficient.

3) I will at least give you this: If the OP's situation is that somehow his only option is to work on the file as a whole (all in one variable) then for the most part, your regex is okay. You could make it more efficient by not having that first captured group since you aren't actually using $1 (IOW remove the first parens and use $1 instead of $2 in $replacement). But other than that...it's okay. But even still, if it came to having the whole file contents in a single variable, it would still probably be more efficient to first split at the EOL chars and do it one line at a time.

4) This point isn't really speaking towards your solution, but more towards mine. My solution involves reading the source file one line at a time and writing to a new file one line at a time. This is just speculation, but it might be more efficient to instead open the single source file for reading and writing (instead of just reading) and then doing the stripping directly in the file. Maybe. Would involve using a handful of other functions like fseek and ftell, strlen etc.. and I'm not entirely convinced that would really be more efficient (and it would certainly be more complex coding and readability-wise) than just writing results to a new file. But I am mentioning it in the event that writing to a new file somehow isn't an option.

AyKay47 · February 10, 2012

I appreciate the thought out response, and I agree with you fully. I completely overlooked the fact that this file is 150k lines plus. Using a regex in that situation would simply be retarded, as it would more than likely roll over the server. Thanks for catching my horrible mistake here, I would have never suggested this if I would have read the OP thoroughly. The snippet of code you provided is very efficient and readable, i like it. I do question creating a new destination file, but we do not know enough of the OP's situation to justify either way.

Thanks also for pointing out the back reference error I made, that was simply an error on my part.

.josh · February 10, 2012

oh and another thing about point #3 is you could also (maybe) make your regex more efficient by swapping the order of the EOL chars in your alternation:

$pattern = '~;.+?(\n|\r)~';

If the file is windows formatted then yeah, \r would be better first, but more often than not it will most likely just be \n so it is a safer bet to put that first.

.josh · February 10, 2012

I do question creating a new destination file, but we do not know enough of the OP's situation to justify either way.

My thought is that it is more simple and easier to code and read than having to fuck with keeping track of row lengths and file positions.

AyKay47 · February 10, 2012

oh and another thing about point #3 is you could also (maybe) make your regex more efficient by swapping the order of the EOL chars in your alternation:
$pattern = '~;.+?(\n|\r)~';
If the file is windows formatted then yeah, \r would be better first, but more often than not it will most likely just be \n so it is a safer bet to put that first.

causing the regex engine to find a match a little quicker, nice thanks.

I do question creating a new destination file, but we do not know enough of the OP's situation to justify either way.

My thought is that it is more simple and easier to code and read than having to fuck with keeping track of row lengths and file positions.

I agree, not to mention that we do not know the skill level of the OP, or if he/she would understand messing with line lengths and file pointer positions etc (no offense OP).. I'm not a fan of giving code that the user will not understand. But thanks for the pointers, I appreciate it.

Mcod · February 10, 2012

I'll give .josh's solution a try and see how it goes. I have sadly all data in one large file, and my mission is importing it to a mysql database after it has been cleaned. So it makes more sense to clean it with PHP and insert it instead of inserting it to the db and trying to clean it then

Will report back once I'm done testing!

Thanks for all your help so far

AyKay47 · February 10, 2012

I'll give .josh's solution a try and see how it goes. I have sadly all data in one large file, and my mission is importing it to a mysql database after it has been cleaned. So it makes more sense to clean it with PHP and insert it instead of inserting it to the db and trying to clean it then

Will report back once I'm done testing!

Thanks for all your help so far

Yes, CV's script will work for what you need. Never pollute the database.

Mcod · February 10, 2012

And here is the feedback:

It worked perfectly. I manipulated all four files and it took less than a second per file.

Thank you so much! This did help me a lot, as I was close hiring someone to do it by hand.

Thanks again - this is really a great forum

Pikachu2000 · February 10, 2012

For future reference, something like this can be just as easy, if not easier once it's in the database table.

UPDATE `table` SET `field` = SUBSTRING_INDEX( `field`, ';', 1 )

Sign In

Manipulating a text file - removing parts

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information