Jump to content

Match a chunk of text


MikeM-2468

Recommended Posts

I think I need to use a regex to match part of a string.  I've used preg_match before but my brain hasn't grasped the intricacies of adding regex.  For background, I'm grabbing input from a form then querying MySQL to see if there is a match in the database.  I'm looking for a match of 5 consecutive characters. 

 

$input = "test12345";

$found_match = "west1298";

or

$found_match = "test1278";

 

Is preg_match enough or do I need more?

Link to comment
Share on other sites

For short search strings you can

WHERE field LIKE "%test1%" OR field LIKE "%est12%" OR field LIKE "%st123%" OR field LIKE "%t1234%" OR field LIKE "%12345%"
(ie, a bunch of LIKEs over each set of five consecutive characters)

 

A variation of that would be using a kind of index table with two columns: a WHERE field = "test1" OR field = "est12" OR field = "st123" OR field = "t1234" OR field = "12345"The difference is that this query would perform a lot faster (if you index that one field) and allow you to search on longer strings.

Link to comment
Share on other sites

If you can make sure that the input doesn't contain a certain character, either by validating it or removing any found, and pretty much any character will do, then you can do something like

/([^#]{5}).*?\#.*?\1/
It tries to find five characters on the left of a # then the same five on the right. You'd match it against the string $input . "#" . $found_match.

 

However the simplest way would be a couple nested loops - not so bad when you consider how few passes they would make.

$input = "test12345";
$found_matches = array("west1298", "test1278");

foreach ($found_matches as $match) {
	for ($i = 0, $ilen = strlen($input); $i + 5 <= $ilen; $i++) {
		if (strpos($match, substr($input, $i, 5)) !== false) {
			// found a match
		}
	}
}
Link to comment
Share on other sites

Not entirely sure this can be done with Regular Expressions, to be honest. If it is possible, then you'd probably be looking at a recursive pattern with named references and lookaheads.

An extremely complex expression, in other words, which I suspect would require a lot of resources to compile.

 

A better approach in this case would be to make a very simple tokenizer, and have it parse the strings character (group) incrementally. This is quite easily done by using mb_substr, mb_strpos and mb_strlen. Plus a loop.

Using the MB functions to ensure that it doesn't break on multi-byte characters.

Edited by Christian F.
Link to comment
Share on other sites

Speaking of tokenizing, this problem is a form of the LCS problem with one key difference: if the two characters do not match then the new value is 0. Oh, and you can immediately return success if you hit five matching characters.

    l   e   l   e   p   h   o   n   e
  +---+---+---+---+---+---+---+---+---+
l | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
l | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
e | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 1 |
  +---+---+---+---+---+---+---+---+---+
p | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
h | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
o | 0 | 0 | 0 | 0 | 0 | 0 | 5*|   |   |
  +---+---+---+---+---+---+---+---+---+


    b   a   a   c   c   b
  +---+---+---+---+---+---+
a | 0 | 1 | 1 | 0 | 0 | 0 |
  +---+---+---+---+---+---+
a | 0 | 1 | 2 | 0 | 0 | 0 |
  +---+---+---+---+---+---+
a | 0 | 1 | 2 | 0 | 0 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 3 | 1 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 4 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 2 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 2 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 2 | 0 |
  +---+---+---+---+---+---+
Plenty of room for optimizations too.

 

[edit] Better examples.

Edited by requinix
Link to comment
Share on other sites

This seems to work for what I need:

 

$input = "xxxxssworxxxickle";
$wordlist = array("password", "pickle", "passwest", "wordgame", "swords", "orxxx");
$charactercheckcount = 5;
$charactercheckcountoffset = $charactercheckcount-1;
for($x = 0; $x < count($wordlist); $x++) {
	$wordlistitem = $wordlist[$x];
	$wordlistitemlength = strlen($wordlistitem);	
	$loop = 0;
	while ($loop < $wordlistitemlength-$charactercheckcountoffset) {
		$checkstring = substr($wordlistitem, $loop, $charactercheckcount);
		$match = strpos($input, $checkstring);
		if ($match) {
			echo "Match found";
			exit();
		}
		++$loop;
	}
}
Link to comment
Share on other sites

  • 1 month later...

Well ... before you get snooty about it - exactly what is it about finding "a match of 5 consecutive characters" that have come from an HTML form that MySQL's REGEXP can't do?

 

// NOTE: make sure any data used to access a database is properly escaped - this example does not do this.

$form_input = 'test1234';

$consecutive_chars = substr($form_input, [start], 5);

$sql = "SELECT * FROM `table` WHERE `field` REGEXP '*" . $consecutive_chars . "*'";

 

The algorithm can get more complex as more fields or tables are searched, but overall it's a simple search as I understand it.

Edited by rama schneider
Link to comment
Share on other sites

Right. Now repeat that for every substring.

SELECT * FROM table WHERE field REGEXP 'test1' OR field REGEXP 'est12' OR field REGEXP 'st123' OR field REGEXP 't1234'
(And since all that does is check string contents a LIKE might be better.)
Link to comment
Share on other sites

REGEXP (test1|est12| ....) - it has always worked for me. As you point out if one is going to check each possibility one at a time then LIKE would probably be quicker. But REGEXP would work well for what the original poster wants to do.

 

The main point being that one can offload this simple type of search to the MySQL server which is very efficient at doing just this thing.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.