Jump to content

[SOLVED] making preg_match() more efficient when batch used; how to start from prev spot?


Recommended Posts

Hey, Ok so here is my issue:

 

I have a opened file, roughly 80,000 characters long

I am using preg_match() to find a unique piece of text in this file, like 300 times.

 

Is there a more efficient way to do this?

**There is a patter that I have not taken advantage of but want to:

-Each next term that I search for appears (somewhere, non constant distance) AFTER the spot where the previous term was.

 

So, Is it possible to start the search at the previous found text place (somehow move the cursor there)?, so that I do not waste time and resources checking a spot where my text is not for sure, hundreds of times.

 

Thanks

Here is an example:

 

File contents:

ajkdkansdkUNIQ1382NAME=sally
idsdckdsjckUNIQ78a2NAME=bob
kdjalkdsklaUNIQ8912NAME=tom
osaijdoasksUNIQ8291NAME=charles
aksjdkaskssUNIQds89NAME=sandy
skdjsakjdskUNIQ8238NAME=rock
...

 

And I have an array of "accepted names"

accepted[1]="tom"
accepted[2]="rock"
...

 

Now, I am searching the text like so:

preg_match('~UNIQ([a-z0-9]+){1,4}'.$accepted[1].'~', $textfile, $match);

 

So, $match[1] would be 8912, and so on

 

The thing is, as you can see, i have LOTS of names, and a LONG name list.

I don't want to search through the whole list every time, because I know the names are in same order, only some are missing from the accepted. There is also lots of garbage text in the file, which I have now edited out. The program still takes like 40 seconds to finish. I know it could be faster

 

kinda get it now? What to u think?

You can create a loop and utilize the second to last parameter of file_get_contents() (and the last parameter if you know that the name will be within ~X characters) to only grab a specific part of the file..

 

Ex:

for($i = 0, $position = 0;$i < count($accepted);$i++)
{
$content = file_get_contents('somefile.txt', 0, null, $position);
preg_match('...', $accepted[$i], $matches, PREG_OFFSET_CAPTURE);
$position = $matches[1][1] + strlen($matches[1][0]);
}

 

Where $matches[1][1] would be the starting position of the match.

Ok cool, thanks.

 

I got another question, do you know how I could remove lines that do not contain a certain character?

I want to "process" my text file before I batch read it, and delete all the lines without the first required part of the preg_match().

So far, I have this:

<?php
$file=fopen($filename,"r");

while(!feof($file)){
$temparray = explode("	", fgets($file));
$hadrequired = false;
foreach($temparray as $result){
	if($result == $required){
		$hadrequired = true;
	}
}
if(!$hadrequired){
	//DELETE THIS LINE SOMEHOW
}
}
?>

 

I know i would have to probably have to change it to "rw" or what not, what do you thing?

 

EDIT: I also had a new idea. Maybe the reason it is so slow is because the first part of preg_match shows up in every entry. Is there any way for it to NOT search for the first part, since it is not uniqe, and only for the second part, then take the text minus 5 characters? That way should be much faster.

Why not instead of deleting the line,, just don't write it to the new cleaned up file

 

ie (untested)

<?php
$file=fopen($filename,"r");
$newFile = fopen('newfile.txt', 'w');
while(!feof($file)){
   $line = fgets($file);
   $temparray = explode("   ",$line);
   $hadrequired = false;
   foreach($temparray as $result){
      if($result == $required){
         fwrite($newFile, $line);
      }
   }
}
fclose($newFile);
fclose($file);
?>

Why not instead of deleting the line,, just don't write it to the new cleaned up file

 

ie (untested)

<?php
$file=fopen($filename,"r");
$newFile = fopen('newfile.txt', 'w');
while(!feof($file)){
   $line = fgets($file);
   $temparray = explode("   ",$line);
   $hadrequired = false;
   foreach($temparray as $result){
      if($result == $required){
         fwrite($newFile, $line);
      }
   }
}
fclose($newFile);
fclose($file);
?>

 

Hm Hm I like, thanks I will definetly use that.

But also, would you have any idea about better searching:

I think my search is inefficient because the first part of preg_matcha() is practically at every line, where as the ending of it is Unique.

So preg_match is going sniffing every line when it detects the first part, then doesn't find it because it doesn't have the second part.

 

Wouldn't it be better if I only searched for the second part, then got the lin number where this is. Then search this line using the prefix.

 

Anyone know how I could use this?

Wow!

I filtered the file down to only include potential lines.

This took the size down from 30mb to 300kb,

Then i searched this new file, and everything went super fast.

 

Thanks for all the help guys!!

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.