REGEXP - anything, including space

neuroxik · August 16, 2012

Hey everyone.

So I want to be able to find "John Bee Doe" when the user input is "John Doe". I'm using REGEXP because I'm already using what I call $name_similes containing an array of similar names (ex.: Britney, Britanny, etc).

I'm stuck on this part though (John Bee Doe). I tried using REGEXP imploding on spaces after each word (search term) to replace with ([[:alpha:]]|[[:space:]]) which I assumed would include either a space, but it doesn't work. Anybody hae any advice or ideas ?

Christian F. · August 16, 2012

You're looking for a single character there, which is either a letter (a-z) or a whitespace character. Which is why you're not finding anything.

I recommend using the \w for letters, \d for digits (0-9) and \s for whitespace instead. Gives you, paradoxically enough, a much cleaner and easier to read RegExp. You can also read more about Regular Expressions here: http://regular-expressions.info/

Now, that said I don't quite understand what you're trying to say with the middle part about $name_smilies. Do you have any code to show, that can help shed some light on this? I suspect you might be doing things in a bit overly complex manner.

neuroxik · August 16, 2012

For code, there's quite a bit, but since you ask:

function prepStrForSrc($str=NULL,$nm_similes=NULL) {
$r = array(
	'go_on_with_src' => FALSE,
	'src_str' => ltrim($str), # trim only leading spaces
	'passed_thru_prep_func' => 1,
	'orig_str' => $str
);

$tr_str = trim($str);
if(empty($tr_str)) return $r;

$r['go_on_with_src'] = TRUE;
$r['src_str'] = strtolower_utf8($r['src_str']);

$has_backslashes = strpos($r['src_str'],'\\');
if($has_backslashes===FALSE) $r['src_str'] = addslashes($r['src_str']);

if(empty($nm_similes)) $treat_similes = FALSE;
else $treat_similes = TRUE; # even if no treat_similes, still want seperators on each name pieces for mysql src

$pcs = explode(" ",$r['src_str']);
$num_pcs = count($pcs); // 'ali' will return 1, 'ali ' (with space) will return 2, so trim each then to know if empty, but atleast now know (when space) that previous is "[word]"

/* works this way: [[:<:]] (name1|name2|name3) [[:>:]] 
	without the () around name1|name2|name3, it'll find alizee with ali (not wanted) -- wouldn't do that on last since not alot of chances someone will write "John Doe " (instead of "John Doe") */

$new_terms = array();
$is_last = FALSE; // until set to TRUE. "John Doe", on first loop (John) still false, then true on "Doe".
$i = 1;
if($num_pcs > 1) {
	foreach($pcs as $k=>$word) {
		$trimmed = trim($word);
		if($i == $num_pcs) $is_last = TRUE;

		if(!empty($trimmed)) {
			if($treat_similes) {
				# cond 1
				$t = getKeyNameSimiles2($trimmed,$nm_similes,$is_last);
				if($t['similes_found']===TRUE) $tmp_new_word = $t['nm_similes_formatted'];
				else $tmp_new_word = $t['term']; // hmm... still need to put a [[:<:]] around it if not last

				$new_terms[] = $tmp_new_word;
			}
			else {
				# cond 2
				$tmp_new_word = sqlTermWrappers($trimmed,$is_last);
				$new_terms[] = $trimmed;
			}
		}
		$i++;
	}
}
else { // no [word] wrappers because maybe word ain't finished
				# cond 3
	// only one str word, maybe incomplete yet, but still want to check (if strlen > 1) for similes
	if(strlen($pcs[0])>1) {
		$t = getKeyNameSimiles2($pcs[0],$nm_similes,$is_last=TRUE);
		if($t['similes_found']===TRUE) $tmp_new_word = $t['nm_similes_formatted'];
		else $tmp_new_word = $t['term'];
	}
	else $tmp_new_word = $pcs[0];
	$new_terms[] = $tmp_new_word;
}

// implode $new_terms , also use chars for mysql here BETWEEN each term, if necessary, use here
$new_terms_specchars = array();
foreach($new_terms as $k=>$v) { $new_terms_specchars[] = specCharVariations($v); }
unset($new_terms);
$r['src_str'] = implode(" ",$new_terms_specchars);

/* now, to find "John Doe" even if write "Doe John", have to use formated return as above, then separate like this (string_query_1|string_query_2), string_query_2 for Doe Jogn. No spaces (works) << DO LATER */

return $r;
}

function sqlTermWrappers($str=NULL,$is_last=TRUE) { # $is_last can also be used if only one term has been entered with no spaces yet, used as 'not_necessarily_complete', so no () around
if(empty($str)) return $str;
if($is_last) $str = sprintf('[[:<:]]%s[[:>:]]',$str);
else $str = sprintf('[[:<:]](%s)[[:>:]]',$str);
return $str;
}

function specCharVariations($str=NULL) {
if(empty($str)) return $str;	
#$str = utf8_fix($str);
// DON'T INCLUDE mysql query REGEX chars such as [:<
$str = preg_replace("(e|?|?|?|?)","(e|?|?|?|?)",$str); // we don't want these, only for test purposes
$str = preg_replace("(a|?|?|?|?|?)","(a|?|?|?|?|?)",$str); // ?
$str = preg_replace("(i|?|?|?|?)","(i|?|?|?|?)",$str);
$str = preg_replace("(o|?|?|?|?|?)","(o|?|?|?|?|?)",$str);
$str = preg_replace("(u|?|?|?)","(u|?|?|?)",$str);
$str = preg_replace("(?|c)","(?|c)",$str);
$str = preg_replace("(?|n)","(?|n)",$str);
/* CHANGES DONE HERE MUST BE PASTED INTO ../config.php:specCharVariations() <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< */
//$str = utf8_decode($str);
return $str;
}

function strToAlpha($str=NULL) {
if(empty($str)) return $str;
// expected to already be strtolowered
$str = preg_replace("(?|?|?|?)","e",$str); 
$str = preg_replace("(?|?|?|?|?|?)","a",$str);
$str = preg_replace("(?|?|?|?)","i",$str);
$str = preg_replace("(?|?|?|?|?)","o",$str);
$str = preg_replace("(?|?|?)","u",$str);
$str = str_replace('?','c',$str);
$str = preg_replace("(?|ń)","n",$str);
$str = str_replace('?','ae',$str);
$str = str_replace('?','oe',$str);
return $str;
}

Then prepStrForSrc calls getKeyNameSimiles2() :

function getKeyNameSimiles2($term=NULL,$nm_similes=NULL,$is_last=FALSE) {
# also chk comments inside prepStrForSrc() func
$r['similes_found'] = FALSE;
$r['term'] = $term;
$r['nm_similes_formatted'] = '';

// if $is_last is TRUE, then don't limit to that (ex.: John Doe, user input: "John Do", want to find "John Doe", so no () around "Do" )

if(empty($term) || strlen($term) < 2) return $r;
if(empty($nm_similes) || !is_array($nm_similes)) return $r;

// now need to temporarily switch to strict alpha (example: Jos?e becomes Josee). Will use specCharVariations() anyway, so no interference
$term_alpha = strToAlpha($term);

foreach($nm_similes as $k=>$v) {
	if(is_array($v)) {
		foreach($v as $k2=>$name) {
			if($term_alpha == $name) {
				$r['similes_found'] = TRUE;
				$nm_similes_arr = $nm_similes[$k];
				$new_terms_f = array();
				foreach($nm_similes_arr as $alt) {
					$new_terms_f[] = addslashes(strtolower_utf8($alt));
				}
				$r['nm_similes_formatted'] = implode("|",$new_terms_f);
				$r['nm_similes_formatted'] = sqlTermWrappers($r['nm_similes_formatted'],$is_last);
				break 2;
			}
		}
	}
}
return $r;
}

Above, it'll get similarities from another array, which is not quite important to mention here but I just didn't want the type of answer "you don't need REGEXP for this", that's why I gave the context of similar names, alternative letters found in specCharVariations so that writing "St?phanie" still finds "Stephanie" as well as "Stephany".

Then I call it this way:

/*

I'M NOT INCLUDING ALL THE CLEANING HERE, like if strlen < 2, addslashes, etc

*/

$ns_src = utf8_decode($ns_src);

$q_str = "SELECT * FROM `gni` WHERE `name` REGEXP '".$ns_src."' ";

neuroxik · August 16, 2012

You're looking for a single character there, which is either a letter (a-z) or a whitespace character. Which is why you're not finding anything.

I recommend using the \w for letters, \d for digits (0-9) and \s for whitespace instead. Gives you, paradoxically enough, a much cleaner and easier to read RegExp. You can also read more about Regular Expressions here: http://regular-expressions.info/

Now, that said I don't quite understand what you're trying to say with the middle part about $name_smilies. Do you have any code to show, that can help shed some light on this? I suspect you might be doing things in a bit overly complex manner.

I'm sorry I'm bumping, can't find the edit button on my previous post, but:

how would you write space or word? Would (\s|\w) do the job ? But I'm also thinking, another word means surrounding spaces too, so could it be ( \s | (\s \w \s) ) (without the spaces)

Christian F. · August 16, 2012

OK... I haven't had the time, desire, nor the necessary data to sit down and figure out exactly what your code does. Since I not only am missing some input examples, but also some of the code that you're calling (don't know where you're getting $ns_src from, or what's been done to it previously.

In any case, that's besides the point for this thread, so no need to post it here. What I will say, however, is that it seems that you've made an overly complicated solution, which could instead have been solved with "metaphone ()" or "soundex ()".

Now, to the RegExp problem:

The three character classes I gave above matches only a single character, not a group. So in your last RegExp, you're asking for a single space, or a single space followed by a single letter and then a single space. Each captured in their own sub group, so that group 1 == " " and group 2 == " a " (for example). You'll want to use both quantifiers and, most likely, word boundaries to construct your RegExp.

Though, I'm feeling a bit trepidation about the way you want to construct that RegExp, as I suspect your focus is too narrow. A RegExp has to be constructed with the entire string in mind, or at the very least the neighbouring elements to that which you want to have captured. So if you could post some example input strings, we can possibly help you with pointing out a far better way to construct that RegExp; Or even the whole process, if you're lucky.

My final advice, which is something you should do in either case, is to go to the site I linked previously and read up. If you're going to work with Regular Expressions, then you really need to know how to use them. They are a very powerful tool, but also a very complex one. That makes them hard to get right and very easy to get wrong, thus becoming a liability.

neuroxik · August 16, 2012

Wow, you've opened me up to a world of possibilities with metaphone, soundex and the similar functions. If only I would've known about this before. Thanks.

Sorry for over-complicating my examples, I'll try keeping with the basic output for these examples' sake. Thanks for all your input, by the way.

Here's a list of user input strings and what they generate:

SELECT * FROM `gni` WHERE `name` REGEXP '".$ns_src."' ORDER BY `name` ASC

1 - "Anna " outputs http://img211.imageshack.us/img211/9466/13116717.jpg (I'm using a screenshot because upon clicking Preview, the special characters aren't encoded properly here)

(It searches Ana|Anna , having found that alternate spelling in $name_similes array)

The reason for the multiple a's is that different languages use different characters sometimes, as I mentioned earlier.

It returns, among other too numerous results:

Ana Beatriz Barros

Anna Christine Speckhart

2 - "Anna Chri" outputs http://img411.imageshack.us/img411/2329/99856546.jpg (screenshot)

This one only outputs this (which is what is expected)

Anna Christine Speckhart

Note: There's no [[:<:]]word boundaries[[:>:]] around the last "Chri" because a user will rarely put a space at the end of his last word, and I also want to be able to fetch with partial results.

Where the problem comes is if someone enters "Anna Speckhart" in the input. (screenshot just in case: http://img32.imageshack.us/img32/1576/33794545.jpg )

"Christine" being the middle name, it doesn't find it since the name is Anna Christine Speckhart. That's why I was wondering about the "[word] [word | space] [word]" (to put it in a vulgar way).

Christian F. · August 16, 2012

Hehe, I suspected as much. In that case I would really recommend you to refactor your code, and base it upon the metaphone () function (or whatever else works best for you).

That said, it will still not help completely against this "missing middle name" issue, but it's a good start. Create a sound key for each part of the name, and then store it in the database as a single string (with spaces). Then you can use word boundaries and the dot to facilitate for a missing middle name:

$first name)(\s+|\b.*\b)($last name

neuroxik · August 17, 2012

[..] recommend you to refactor your code, and base it upon the metaphone () function (or whatever else works best for you).

[..]
$first name)(\s+|\b.*\b)($last name

I'm pretty much out for the rest of the week but I'll try that out some time soon and let you know if I come across a problem or come and praise you if this works. Thanks alot.

I hate having overkilled on the code because I ignored about metaphone and the similar functions, but hey, we're always learning.

Christian F. · August 17, 2012

I know how you feel, been there myself a couple of times. Though, now you won't forget about those functions at least.

Also, next time you're facing a daunting and/or complicated task think for yourself "there has to be a better way", and see if you can't find some existing solutions on the net somewhere. Chances are that at least one person have had exactly the same problem as you, and posted the easiest solution to the problem.

Had to run while writing the previous post, so I didn't have the time to add a note about the RegExp syntax being general. Thus you may have to adjust it a bit to make it MySQL-compatible. Not too familiar with MySQL's RegExp flavour, I'm afraid.

fenway · August 20, 2012

MySQL's RE capabilities are extremely limited -- though there are some pretty powerful UDFs available as plugins.

Sign In

REGEXP - anything, including space

Recommended Posts

neuroxik

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

neuroxik

Link to comment

Share on other sites

neuroxik

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

neuroxik

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

neuroxik

Link to comment

Share on other sites

Christian F.

Link to comment

Share on other sites

fenway

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information