Jump to content

Matching whole phrase (not individual terms) not in html tags


Recommended Posts

I'm having trouble writing a regex to match a phrase ignoring text that is inside html tags.

 

Surely a programming algorithm to match this is possible, but I'd like to do it in regex.

 

 

 

Haystack:

<b>this</b> hello <span id="this"> whatever </span> man

 

What is IN CAPS below is the expected match: (also it includes the spaces surrounding the caps terms)

 

<b>THIS</b> HELLO <span id="this"> WHATEVER </span> MAN

 

so it should match the phrase:         

this\s*hello\s*whatever\*man

 

 

 

basically I want to match this phrase and skip/ignore all the text inside html tags as if it were not there, a programming approach would simply be to remove all the html tags..etc.. and then analyze the data, how could this be done in regex?

 

 

 

I've tried these things:

 

(?<=(?![^<]+>)this)(?![^<]+>)hello

 

(?![^<]+>)this\s*

//this correctly matches this not in html tags, but with the combination of the whole phrase, it gets more complicated

 

 

Thanks for looking.

Ah yes preg_split! :)

Thanks effigy.

 

The overall use of my question and concept is an interesting & useful one. I developed these functions for universal uses related to my original question, and of course this is the solution to my problem.

 

Basically the function preg_match_ignore_pat() returns an array just like preg_match_all with the PREG_OFFSET_CAPTURE flag sent, except that it will search for your $searchPat in a haystack while ignoring any data matching your $ignorePat. So really, it will search through your $hay for your $searchPat as if the data matched by $ignorePat is invisible. phrase2pat() function prepares non regex search strings and turns them into preg pattern compliant search patterns. You don't have to use this, but its helpful in turning regular search strings into preg patterns for use in the preg_match_all() function. You can send any kind of pattern like ones with subgroups & even named subgroups.

 

An excellent example of this is to simulate searching for phrases on a internet page in your browser. If you are looking a google cache webpage and certain terms are highlighted on the browser the phrase is still the same aside from different appearance, while in the source the phrase has added html tags as a part of it. Searching through the data and ignoring html tags, lets you look at the data how a browser/user sees it, and finds words next to each other..etc.. This is one example of use, the sky is the limit! Enjoy :)

<?php 
function phrase2pat($searchPhrase, $ignoreSpaces = true, $ignoreCase = true) {
$searchPhrase = preg_quote($searchPhrase, '~');
if ($ignoreSpaces) {
	$searchPhrase = implode('\s*',preg_split('~\s+~', $searchPhrase));
}
$searchPat = "~$searchPhrase~";
if ($ignoreCase) {
	$searchPat .= 'i';
}
return $searchPat;
}

function preg_match_ignore_pat($searchPat, $ignorePat, $hay) {
$matches = preg_match_all($ignorePat, $hay, $out);
if ($matches && !empty($out[0])) {
	$splitArr = preg_split($ignorePat, $hay, -1, PREG_SPLIT_OFFSET_CAPTURE);
	$splitArr2 = preg_split($ignorePat, $hay);

	$searchHay = implode('', $splitArr2);
	$findPhrase = preg_match_all($searchPat, $searchHay, $phraseOut, PREG_OFFSET_CAPTURE);

	//phrase was found
	if ($findPhrase && !empty($phraseOut[0])) {
		//make offArr
		$newOff = 0;
		foreach ($splitArr as $k => $pieceInfo) {
			$offArr[$newOff] = $pieceInfo[1];
			$newOff = $newOff + strlen($pieceInfo[0]);
		}

		//find real offsets of matches of found phrase in haystack
		foreach ($phraseOut as $subKey => $subArr) {
			foreach ($subArr as $matchInfo) {
				//newMatchRealOff = closestRealOff + (newMatchNewOff - closestNewOff)
				//find real off, start of match
				$newOff_start = $matchInfo[1];
				$x = $newOff_start + 1;
				do {
					$x--;
				} while(!array_key_exists($x, $offArr));
				$realOff_start = $offArr[$x] + ($newOff_start - $x);

				//find real off, end of match
				$newOff_end = ($matchInfo[1] + strlen($matchInfo[0]))-1;
				$x = $newOff_end + 1;
				do {
					$x--;
				} while(!array_key_exists($x, $offArr));
				$realOff_end = $offArr[$x] + ($newOff_end - $x);
				$len = ($realOff_end - $realOff_start) + 1;

				//out array
				$pOut[$subKey][] = array(substr($hay, $realOff_start, $len), $realOff_start);
			}
		}
		return $pOut;	
	} else {
		//phrase wasn't found
		return false;
	}
} else {
	//the ignore data is not present in the haystack, so search through it normally with preg_match_all
	$matches =  preg_match_all($searchPat, $hay, $out, PREG_OFFSET_CAPTURE);
	if ($matches && !empty($out[0])) {
		return $out;
	} else {
		return false;
	}
}	
}
?>

 

 

 

 

-----------EXAMPLES-------------------------

Searching for a regular string phase:

<?php
echo '<pre>';
$ignoreDataPat = '~<[^>]+>~';
$searchPat = phrase2pat('this hello whatever man');
$hay = '<b>this</b> hello <span id="this"> what<html tag>ever </span> man whatever';
$arr = preg_match_ignore_pat($searchPat, $ignoreDataPat, $hay);
print_r($arr);
echo '</pre>';
//RETURNS
/*
<pre>Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => this</b> hello <span id="this"> what<html tag>ever </span> man
                    [1] => 3
                )

        )

)
</pre>
*/
?>

 

Searching for a custom pattern:

<?php
echo '<pre>';
$ignoreDataPat = '~<[^>]+>~';
$searchPat = '~this\s*(?P<first>hello)\s*(?P<second>whatever)\s*man~i';
$hay = '<b>this</b> hello <span id="this"> what<html tag>ever </span> man whatever';
$arr = preg_match_ignore_pat($searchPat, $ignoreDataPat, $hay);
print_r($arr);
echo '</pre>';
//RETURNS
/*
//RETURNS
/*
<pre>Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => this</b> hello <span id="this"> what<html tag>ever </span> man
                    [1] => 3
                )

        )

    [first] => Array
        (
            [0] => Array
                (
                    [0] => hello
                    [1] => 12
                )

        )

    [1] => Array
        (
            [0] => Array
                (
                    [0] => hello
                    [1] => 12
                )

        )

    [second] => Array
        (
            [0] => Array
                (
                    [0] => what<html tag>ever
                    [1] => 35
                )

        )

    [2] => Array
        (
            [0] => Array
                (
                    [0] => what<html tag>ever
                    [1] => 35
                )

        )

)
</pre>
*/
*/
?>

*I added the without the pat data into the out array, you can make the edit yourself

 

$pOut[$subKey][] = array($matchInfo[0], substr($hay, $realOff_start, $len), $realOff_start);

 

so now it outputs:

[0] without the ignore pat data
[1] with the ignore pat data
[2] the offset found within the haystack

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.