Jump to content

Recommended Posts

Here's a riddle for PHP coders out there...

 

I want to take a string, and find patterns. RegExp, you say?

 

Exactly, but it gets a little more complex than that.

 

You see, I don't know what I am looking for.

Neither will the code.

The only thing we know is that we want to find the largest "patterns" in a file, and replace it with a number.

 

Here's an example of what I am talking about, "pseudocode" style:

 

Text to be scanned:

 

God is great, God is good,

Let us thank him for our food.

 

Scan complete. Largest pattern: "God is g"

Found twice, replaced with "1", noted and removed.

Now we have:

 

1reat, 1ood,

Let us thank him for our food.

 

Second scan complete. Largest pattern: "ood"

Found twice, replaced with "2", noted and removed.

Now we have:

 

1reat, 12,

Let us thank him for our f2.

 

Third scan complete. Largest pattern: "r_" (Underscore is whitespace)

Found twice, noted and removed.

 

Now we have:

 

1reat, 12,

Let us thank him fo3ou3f.

 

No more patterns of two or greater. Therefore, pattern matching is over.

 

Now, imagine this on a work like Hamlet. I think you see what I am saying.

 

Thanks in advance!

Link to comment
https://forums.phpfreaks.com/topic/186183-patterns-but-not-the-norm/
Share on other sites

I don't usually help unless you provide code, but this seemed like a fun challenge.

 

Here's what I got to work. It's ugly, and it won't do what you think it will. Remember: the script does not know actual words, only length.  You can fix it from there. If you need help, just post your questions here.

<?php
print_r(getpatterns("God is great, God is good,
Let us thank him for our food.",3));
/**
* Finds patterns with a minimum length of $minlen
*
* @param string $string
* @param int $minlen
* @return array
*/
function getpatterns($string, $minlen=1){
$string = strtolower($string);
$maxcount = strlen($string);
$array = array();
$count = $minlen;
while ($count <= $maxcount){
	$tmp_array = str_split($string,$count);
	foreach ($tmp_array as $val){
			$array[$val] = substr_count($string,$val);
		}
	$count++;
}
arsort($array);
return $array;
}
?>

Its quite simple, but ya have to have some wits to it

Original: God is great, God is good, Let us thank him for our food.

Compressed: \0reat\1\0\2\1Let us thank him\3or our\3\2.

#0=>'God is g'

#1=>', '

#2=>'ood'

#3=>' f'

 

As shown this was output of the code i had created, but this sounds more like a school project than a riddle.

 

To achieve this I use a sliding window mechanism.

grab first portion of text with min window,

  compare against rest of string

if a match is found, store the result, increase the window, and try the match again

increment the first portion offset and repeat

 

Unfortunately, at 38, no longer school worthy.  ;D

 

It was a little more of an idle curiosity that I had never been able to figure out. I would start, then over-complicate the code, get frustrated, and walk away from it for a year or so until I remembered it again.

 

It's not even like it's necessarily usable. I guess you could call it a poor man's text compression program.  ::)

 

But, hey, thanks again, jonsjava, for the point in the right direction.

I have another approach which is not exactly what you are looking for but might get you going in another direction.

 

1. make arrays of 10-word sequences.

 

2. then look for anywhere where array1(1,2,3) is the same as either 1,2,3 or 2,3,4 or 3,4,5 etc in another array.

 

does that make sense? i am more clued up with mysql than php so i would dump the whole thing into a db and worlk with the data from there.

38 not too bad

 

here is what I used

<?php
  $quote=$text="God is great, God is good, Let us thank him for our food.";
  
      $minwin=2;
      $pos=$cnt=0;
      $tlen=strlen($text);
      while($pos<($tlen-$minwin))
      {
          $pos2=$pos+$minwin;
          $winsize=$minwin;
          $matchsize=0;
          while(($pos2+$winsize)<$tlen)
          {
              while (substr($text,$pos,$winsize)==substr($text,$pos2,$winsize))
              {
                if($winsize>$matchsize)
                {
                    $matchpos=$pos;
                    $matchsize=$winsize;
                }
                $winsize++;
              }
              $pos2++;
          }
          if($matchsize) {
              $match=substr($text,$pos,$matchsize);
              $matches[$cnt]=$match;
              $text=str_replace($match,"\\{$cnt}",$text);
              $pos++;
              $cnt++;
              $tlen=strlen($text);
              
          }
          $pos++;
      }
//  }
    echo "Original: {$quote}<br />\n";
    echo "Compressed: {$text}<br />\n";
    foreach($matches as $key=>$val)
    {
        echo "  #{$key}=>'{$val}'<br />\n";
    }
?>

Although some code can be added (returning the biggest array first, not incorporated yet)

its almost there

reason I used \0 instead of just 1,2,3,4, is so you can find the replacements quickly, but Im shure ya can use other delimeter marks

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.