Jump to content

Recommended Posts

I think I need to use a regex to match part of a string.  I've used preg_match before but my brain hasn't grasped the intricacies of adding regex.  For background, I'm grabbing input from a form then querying MySQL to see if there is a match in the database.  I'm looking for a match of 5 consecutive characters. 

 

$input = "test12345";

$found_match = "west1298";

or

$found_match = "test1278";

 

Is preg_match enough or do I need more?

Link to comment
https://forums.phpfreaks.com/topic/275086-match-a-chunk-of-text/
Share on other sites

You may be able to use similar_text () for this, but it doesn't match consecutive substrings. That said, there's some comments there that might be useful, not to mention a reference to a book which contains lot of knowledge on stuff like this.

Edited by Christian F.

For short search strings you can

WHERE field LIKE "%test1%" OR field LIKE "%est12%" OR field LIKE "%st123%" OR field LIKE "%t1234%" OR field LIKE "%12345%"
(ie, a bunch of LIKEs over each set of five consecutive characters)

 

A variation of that would be using a kind of index table with two columns: a WHERE field = "test1" OR field = "est12" OR field = "st123" OR field = "t1234" OR field = "12345"The difference is that this query would perform a lot faster (if you index that one field) and allow you to search on longer strings.

If you can make sure that the input doesn't contain a certain character, either by validating it or removing any found, and pretty much any character will do, then you can do something like

/([^#]{5}).*?\#.*?\1/
It tries to find five characters on the left of a # then the same five on the right. You'd match it against the string $input . "#" . $found_match.

 

However the simplest way would be a couple nested loops - not so bad when you consider how few passes they would make.

$input = "test12345";
$found_matches = array("west1298", "test1278");

foreach ($found_matches as $match) {
	for ($i = 0, $ilen = strlen($input); $i + 5 <= $ilen; $i++) {
		if (strpos($match, substr($input, $i, 5)) !== false) {
			// found a match
		}
	}
}

Not entirely sure this can be done with Regular Expressions, to be honest. If it is possible, then you'd probably be looking at a recursive pattern with named references and lookaheads.

An extremely complex expression, in other words, which I suspect would require a lot of resources to compile.

 

A better approach in this case would be to make a very simple tokenizer, and have it parse the strings character (group) incrementally. This is quite easily done by using mb_substr, mb_strpos and mb_strlen. Plus a loop.

Using the MB functions to ensure that it doesn't break on multi-byte characters.

Edited by Christian F.

Speaking of tokenizing, this problem is a form of the LCS problem with one key difference: if the two characters do not match then the new value is 0. Oh, and you can immediately return success if you hit five matching characters.

    l   e   l   e   p   h   o   n   e
  +---+---+---+---+---+---+---+---+---+
l | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
l | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
e | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 1 |
  +---+---+---+---+---+---+---+---+---+
p | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
h | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
  +---+---+---+---+---+---+---+---+---+
o | 0 | 0 | 0 | 0 | 0 | 0 | 5*|   |   |
  +---+---+---+---+---+---+---+---+---+


    b   a   a   c   c   b
  +---+---+---+---+---+---+
a | 0 | 1 | 1 | 0 | 0 | 0 |
  +---+---+---+---+---+---+
a | 0 | 1 | 2 | 0 | 0 | 0 |
  +---+---+---+---+---+---+
a | 0 | 1 | 2 | 0 | 0 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 3 | 1 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 4 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 2 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 2 | 0 |
  +---+---+---+---+---+---+
c | 0 | 0 | 0 | 1 | 2 | 0 |
  +---+---+---+---+---+---+
Plenty of room for optimizations too.

 

[edit] Better examples.

Edited by requinix

This seems to work for what I need:

 

$input = "xxxxssworxxxickle";
$wordlist = array("password", "pickle", "passwest", "wordgame", "swords", "orxxx");
$charactercheckcount = 5;
$charactercheckcountoffset = $charactercheckcount-1;
for($x = 0; $x < count($wordlist); $x++) {
	$wordlistitem = $wordlist[$x];
	$wordlistitemlength = strlen($wordlistitem);	
	$loop = 0;
	while ($loop < $wordlistitemlength-$charactercheckcountoffset) {
		$checkstring = substr($wordlistitem, $loop, $charactercheckcount);
		$match = strpos($input, $checkstring);
		if ($match) {
			echo "Match found";
			exit();
		}
		++$loop;
	}
}
  • 1 month later...

MySQL supports regexp searches - I do this on a regular basis from my php code. See http://dev.mysql.com/doc/refman/5.1/en/regexp.html

As I said in the very first reply to this thread, which you clearly didn't read,

MySQL doesn't support the regex syntax you would need for this

Well ... before you get snooty about it - exactly what is it about finding "a match of 5 consecutive characters" that have come from an HTML form that MySQL's REGEXP can't do?

 

// NOTE: make sure any data used to access a database is properly escaped - this example does not do this.

$form_input = 'test1234';

$consecutive_chars = substr($form_input, [start], 5);

$sql = "SELECT * FROM `table` WHERE `field` REGEXP '*" . $consecutive_chars . "*'";

 

The algorithm can get more complex as more fields or tables are searched, but overall it's a simple search as I understand it.

Edited by rama schneider

Right. Now repeat that for every substring.

SELECT * FROM table WHERE field REGEXP 'test1' OR field REGEXP 'est12' OR field REGEXP 'st123' OR field REGEXP 't1234'
(And since all that does is check string contents a LIKE might be better.)

REGEXP (test1|est12| ....) - it has always worked for me. As you point out if one is going to check each possibility one at a time then LIKE would probably be quicker. But REGEXP would work well for what the original poster wants to do.

 

The main point being that one can offload this simple type of search to the MySQL server which is very efficient at doing just this thing.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.