dsaba Posted February 19, 2008 Share Posted February 19, 2008 I'm having trouble writing a regex to match a phrase ignoring text that is inside html tags. Surely a programming algorithm to match this is possible, but I'd like to do it in regex. Haystack: <b>this</b> hello <span id="this"> whatever </span> man What is IN CAPS below is the expected match: (also it includes the spaces surrounding the caps terms) <b>THIS</b> HELLO <span id="this"> WHATEVER </span> MAN so it should match the phrase: this\s*hello\s*whatever\*man basically I want to match this phrase and skip/ignore all the text inside html tags as if it were not there, a programming approach would simply be to remove all the html tags..etc.. and then analyze the data, how could this be done in regex? I've tried these things: (?<=(?![^<]+>)this)(?![^<]+>)hello (?![^<]+>)this\s* //this correctly matches this not in html tags, but with the combination of the whole phrase, it gets more complicated Thanks for looking. Quote Link to comment Share on other sites More sharing options...
effigy Posted February 19, 2008 Share Posted February 19, 2008 <pre> <?php $data = '<b>this</b> hello <span id="this"> whatever </span> man'; print_r( preg_split('/<[^>]+>/', $data, -1, PREG_SPLIT_NO_EMPTY) ); ?> </pre> Quote Link to comment Share on other sites More sharing options...
dsaba Posted February 19, 2008 Author Share Posted February 19, 2008 Ah yes preg_split! Thanks effigy. The overall use of my question and concept is an interesting & useful one. I developed these functions for universal uses related to my original question, and of course this is the solution to my problem. Basically the function preg_match_ignore_pat() returns an array just like preg_match_all with the PREG_OFFSET_CAPTURE flag sent, except that it will search for your $searchPat in a haystack while ignoring any data matching your $ignorePat. So really, it will search through your $hay for your $searchPat as if the data matched by $ignorePat is invisible. phrase2pat() function prepares non regex search strings and turns them into preg pattern compliant search patterns. You don't have to use this, but its helpful in turning regular search strings into preg patterns for use in the preg_match_all() function. You can send any kind of pattern like ones with subgroups & even named subgroups. An excellent example of this is to simulate searching for phrases on a internet page in your browser. If you are looking a google cache webpage and certain terms are highlighted on the browser the phrase is still the same aside from different appearance, while in the source the phrase has added html tags as a part of it. Searching through the data and ignoring html tags, lets you look at the data how a browser/user sees it, and finds words next to each other..etc.. This is one example of use, the sky is the limit! Enjoy <?php function phrase2pat($searchPhrase, $ignoreSpaces = true, $ignoreCase = true) { $searchPhrase = preg_quote($searchPhrase, '~'); if ($ignoreSpaces) { $searchPhrase = implode('\s*',preg_split('~\s+~', $searchPhrase)); } $searchPat = "~$searchPhrase~"; if ($ignoreCase) { $searchPat .= 'i'; } return $searchPat; } function preg_match_ignore_pat($searchPat, $ignorePat, $hay) { $matches = preg_match_all($ignorePat, $hay, $out); if ($matches && !empty($out[0])) { $splitArr = preg_split($ignorePat, $hay, -1, PREG_SPLIT_OFFSET_CAPTURE); $splitArr2 = preg_split($ignorePat, $hay); $searchHay = implode('', $splitArr2); $findPhrase = preg_match_all($searchPat, $searchHay, $phraseOut, PREG_OFFSET_CAPTURE); //phrase was found if ($findPhrase && !empty($phraseOut[0])) { //make offArr $newOff = 0; foreach ($splitArr as $k => $pieceInfo) { $offArr[$newOff] = $pieceInfo[1]; $newOff = $newOff + strlen($pieceInfo[0]); } //find real offsets of matches of found phrase in haystack foreach ($phraseOut as $subKey => $subArr) { foreach ($subArr as $matchInfo) { //newMatchRealOff = closestRealOff + (newMatchNewOff - closestNewOff) //find real off, start of match $newOff_start = $matchInfo[1]; $x = $newOff_start + 1; do { $x--; } while(!array_key_exists($x, $offArr)); $realOff_start = $offArr[$x] + ($newOff_start - $x); //find real off, end of match $newOff_end = ($matchInfo[1] + strlen($matchInfo[0]))-1; $x = $newOff_end + 1; do { $x--; } while(!array_key_exists($x, $offArr)); $realOff_end = $offArr[$x] + ($newOff_end - $x); $len = ($realOff_end - $realOff_start) + 1; //out array $pOut[$subKey][] = array(substr($hay, $realOff_start, $len), $realOff_start); } } return $pOut; } else { //phrase wasn't found return false; } } else { //the ignore data is not present in the haystack, so search through it normally with preg_match_all $matches = preg_match_all($searchPat, $hay, $out, PREG_OFFSET_CAPTURE); if ($matches && !empty($out[0])) { return $out; } else { return false; } } } ?> -----------EXAMPLES------------------------- Searching for a regular string phase: <?php echo '<pre>'; $ignoreDataPat = '~<[^>]+>~'; $searchPat = phrase2pat('this hello whatever man'); $hay = '<b>this</b> hello <span id="this"> what<html tag>ever </span> man whatever'; $arr = preg_match_ignore_pat($searchPat, $ignoreDataPat, $hay); print_r($arr); echo '</pre>'; //RETURNS /* <pre>Array ( [0] => Array ( [0] => Array ( [0] => this</b> hello <span id="this"> what<html tag>ever </span> man [1] => 3 ) ) ) </pre> */ ?> Searching for a custom pattern: <?php echo '<pre>'; $ignoreDataPat = '~<[^>]+>~'; $searchPat = '~this\s*(?P<first>hello)\s*(?P<second>whatever)\s*man~i'; $hay = '<b>this</b> hello <span id="this"> what<html tag>ever </span> man whatever'; $arr = preg_match_ignore_pat($searchPat, $ignoreDataPat, $hay); print_r($arr); echo '</pre>'; //RETURNS /* //RETURNS /* <pre>Array ( [0] => Array ( [0] => Array ( [0] => this</b> hello <span id="this"> what<html tag>ever </span> man [1] => 3 ) ) [first] => Array ( [0] => Array ( [0] => hello [1] => 12 ) ) [1] => Array ( [0] => Array ( [0] => hello [1] => 12 ) ) [second] => Array ( [0] => Array ( [0] => what<html tag>ever [1] => 35 ) ) [2] => Array ( [0] => Array ( [0] => what<html tag>ever [1] => 35 ) ) ) </pre> */ */ ?> Quote Link to comment Share on other sites More sharing options...
dsaba Posted February 19, 2008 Author Share Posted February 19, 2008 *I added the without the pat data into the out array, you can make the edit yourself $pOut[$subKey][] = array($matchInfo[0], substr($hay, $realOff_start, $len), $realOff_start); so now it outputs: [0] without the ignore pat data [1] with the ignore pat data [2] the offset found within the haystack Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.