Regex Experts...I need help excluding <html> tags...!

steven fullman · December 31, 2009

Hello!

I really hope you can help...

I'm have a function which replaces keywords with...other keywords.

The trouble is, it also replaces words within <html> tags...which is a problem, since I'm now replacing (for example) <div class="good"> with <div class="great">

...which obviously breaks the rendering...

What I'm trying to (and, Boy! have I been trying..) is to ignore anything inside <tags>

I know preg_replace is the way to go, but I've been trying and failing for 2 days now...

And I'd really appreciate your help.

In short, I need my preg_replace to replace all instances of my keyword....unless it's in <html tags>

I need someone smarter than me!

Here's my code (without my attempt to "ignore" stuff within <html tags>):

if (!function_exists("wordmash")) {
function wordmash($parts) {
	$wordlist = file_get_contents('dictionary.txt', true);
	$dictionary = explode(",", $wordlist);
	$htmldictionary = array();
	foreach($dictionary as $dicword) {
		$htmldictionary[] = htmlcode($dicword);
		$htmldictionary_u[] = htmlcode(strtoupper($dicword));
		$htmldictionary_u1[] = htmlcode(ucfirst($dicword));
		$htmldictionary_ucwords[] = htmlcode(ucwords($dicword));
	}
	for($i=0;$i<count($dictionary);$i++){
		$parts = preg_replace("/\b$dictionary[$i]\b/", $htmldictionary[$i], $parts);
		$parts = preg_replace("/\b" . strtoupper($dictionary[$i]) . "\b/", $htmldictionary_u[$i], $parts);
		$parts = preg_replace("/\b" . ucfirst($dictionary[$i]) . "\b/", $htmldictionary_u1[$i], $parts);
		$parts = preg_replace("/\b" . ucwords($dictionary[$i]) . "\b/", $htmldictionary_ucwords[$i], $parts);
	}
	return $parts;
}
}

If you need more info...please let me know!

Kind regards,

Steve

(I don't know the rules of this forum, but I'd like to buy the 1st person to clear this up for me a beer via PayPal...if that's acceptable?)

cags · December 31, 2009

Something along these lines should do the trick...

$keyword = 'good';
$replacement = 'great';
$input = 'Please replace good if if is not inside a tag like <div class="good"> blah</div>';
$pattern = "#(?=[^>]*(?:<|$))\b$keyword\b#s";

$output = preg_replace($pattern, $replacement, $input);

steven fullman · January 1, 2010

Something along these lines should do the trick...

$keyword = 'good';
$replacement = 'great';
$input = 'Please replace good if if is not inside a tag like <div class="good"> blah</div>';
$pattern = "#(?=[^>]*(?:<|$))\b$keyword\b#s";

$output = preg_replace($pattern, $replacement, $input);

Hi Cags!

That's awesome, seems to work perfectly...thank you!

I wonder if I can ask for a final piece of advice?

If you look at my function from the first post, you'll notice it:

1. Opens up a text file of dictionary words (currently 17,000+)

2. Builds an array

3. Runs another function against the dictionary words (let's say for argument's sake, transforms them into unicode characters)

4. Then the function runs the regex you gave me. It does this 4 times, to account for different cases.

Trouble is, the function performs the regex 17,000+ times x 4 (once for each dictionary word, and 3x for each case match)...

As you can imagine, that's hugely intensive, and runs very slowly...

Can you think of a more efficient way to perform the regex...without having to loop through so many times?

Kind regards, and thanks again for your help.

Steve

salathe · January 1, 2010

There are numerous points in your code which might not be as efficient as other methods. How intensive is "hugely intensive" and how slow is "very slowly", if you don't mind my asking?

As for performing the regex more efficiently, there are a couple of factors. First is just the number of function calls (e.g. 68,000 times) — calling any function 68,000 times will take a while. Secondly, the regular expression itself could likely be much more optimised.

steven fullman · January 1, 2010

Hi Salathe,

Thanks for the reply!

Well, I guess intensive is the wrong word. What I meant was the amount of function calls are (probably too) excessive.

To perform this function across an article of, say, 500 words takes about 15 seconds.

The problem is, there are probably 10 articles per web page that this function runs against...

So, most of the time, my page times out before the function completes.

If I take out the regex which filters out the <html> tags, it runs much faster.

But, I wonder if there's a better way to perform the preg_replace without having to loop it 68,000 times?

Kind regards,

Steve

There are numerous points in your code which might not be as efficient as other methods. How intensive is "hugely intensive" and how slow is "very slowly", if you don't mind my asking?

As for performing the regex more efficiently, there are a couple of factors. First is just the number of function calls (e.g. 68,000 times) — calling any function 68,000 times will take a while. Secondly, the regular expression itself could likely be much more optimised.

cags · January 1, 2010

I think a lot of the advice we might give you would depend on what exactly it is your trying to do. You say there may be 10 articles to a page and this takes 10 seconds+ per article, why are you processing 10 articles at a time like this? Is your script some kind of scraper? What server are you running the code on, if you are doing heavy duty scrapping your probably going to need a dedicated server, in which case you could change the time-out time for scripts. It would probably be useful if we knew a.) what exactly the function htmlcode does and b.) what the format of the entries in your wordlist text file look like.

steven fullman · January 1, 2010

I think a lot of the advice we might give you would depend on what exactly it is your trying to do. You say there may be 10 articles to a page and this takes 10 seconds+ per article, why are you processing 10 articles at a time like this? Is your script some kind of scraper? What server are you running the code on, if you are doing heavy duty scrapping your probably going to need a dedicated server, in which case you could change the time-out time for scripts. It would probably be useful if we knew a.) what exactly the function htmlcode does and b.) what the format of the entries in your wordlist text file look like.

Hi there,

Yes, I realise I need to give you a better understanding.

OK, this code is for a Wordpress plugin.

It performs several actions, but one of which replaces dictionary words with synonyms...

Because Wordpress can display many articles on a single page, the code will cycle through all of them when a page is requested.

Here's another example:

function synonymize($string){
	$buffer=$string;	
	$folder= PLUGINDIR."/myplugin";
	$synonymsfile = $folder.'/synonyms.txt';
	if ( $synonymsfile = fopen($synonymsfile, "r")) {
		while (!feof($synonymsfile)) {
			$synonyms = fgets($synonymsfile);
			$synonyms = str_replace("\n", "", $synonyms);
			$synonyms = str_replace("\r", "", $synonyms);
			if(!strlen($synonyms) < 2) {
				$synonymlist = explode(",", $synonyms);
				$oldword = $synonymlist[0];
				$synonym = $synonymlist[1];

				$buffer = preg_replace("#(?=[^>]*(?:<|$))\b$oldword\b#s", $synonym, $buffer);
				$buffer = preg_replace("#(?=[^>]*(?:<|$))\b" . ucfirst($oldword) . "\b#s", ucfirst($synonym), $buffer);
			}
		}
		fclose($synonymsfile);			
	}
return $buffer;
}

So, as you can see, I open a text file (synonyms.txt)...the format is like this:

abandonment,leaving behind
abase,lower
abash,humiliate
abashed,embarrassed
abate,stop
abating,narrowing
ABC,essentials
abdicate,renounce
abdication,resignation
abdomen,belly
abduct,kidnap
abduction,kidnap
...
...
17,000 more lines
...
...
zilch,zero
zing,dynamism
zip,vitality
zipper,closure
zippy,rapid
zit,blackhead
zombie,android

So, here's what happens:

1. I open the synonyms.txt file read only.

2. I get a line, and remove any line ending characters

3. If the line is 2 characters or longer, I explode the line by the comma

4. I then perform the preg_replace to replace the word with it's synonym

5. I then perform the preg_replace again to replace an ucfirst version of the word with it's synoym

6. Get the next line, and perform steps 3-5 again

Now, steps 3-5 end up being executed 17,000 times, each with 2 preg_replace calls.

So, the problem I face is that, without the regex you kindly gave me in your first reply, I find that words within <html tags> also get "synonymed"...and with the extra regex code it runs particularly slowly.

But the real issue is that I suspect the way I'm attacking the problem is all wrong.

It seems like there must be a better way to achieve the same result without having to create 17,000 arrays and calling preg_replace 34,000 times...!

I hope that makes sense, and I really appreciate your help.

Kind regards,

Steve

cags · January 1, 2010

Hmm... the whole approach of doing something this 'complex' every time you load an article, especially when x number of articles can be loaded at once seems a little redundant. I've never worked with wordpress, let alone created a plugin for it, but is there a particular reason why you cannot 'synonymize' the articles as they are created? Obviously you'd still want to optimise the script, but performance would be less important. I'll be honest with you optimization has never been my strong point, I can normally come up with several different approaches, but I'm never entirely sure which will yield the best result so I always end up just experimenting with them all until I come up with something.

I'd imagine that rather than loop through the entire array for standard case, ucase, title case, ucfirst you would be better off using preg_match_callback. Match the word in a case insensitive manner, then calculate which replacement to use specifically for that word. In my mind that reduces the number of loops performed by 75% so I'd have thought that would improve performance. I also wonder if it would be faster to split the input into words, compare each word against the dictionary, make any replacements then stitch it back together.

I can't really do much experimentation though as I don't have a dictionary file nor any articles to play about though. I'm sure salathe will probably have some good ideas, just a matter of wait and see

salathe · January 1, 2010

There are lots of things that you could do to improve the execution time of the plugin, I'll touch on a couple but it might take a few major iterations to create something you're really happy with.

Important point #1 is that each time you call the main function it re-opens-loads-parses-closes the synonyms file. Call the function 5 times, for 5 articles on a page, and it does lots of identical work 5 times. Ideally we would like to cut this down to only once (if at all, see later) per page execution. There are a number of different ways you could achieve this: 1. load and parse the file into memory once and keep that hanging around for all of the other function calls, e.g. using a static function variable; 2. change the dictionary into something easier to work with, e.g. into a PHP array that you can just include (or require_once into a static variable); 3. Doubtless there are other ways too.

The next one is calling preg_replace many dozens of thousands of times. As with above, we need to whittle this down if possible. One way is to do what cags mentioned and grab the target word in whatever state of case it is and deal with the case elsewhere, e.g. in a callback. There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. Even with a callback, preg_replace_callback might be called once for each word in the synonyms list (for each article). You could experiment with using arrays for the pattern and/or subject arguments to see if combining the regular expressions and/or splitting up the article help at all.

Also as cags mentioned, it would be far better to do the 'synonymization' once (when the article is published? or on some other event) than on each page load. This is a lot of work to be doing for each any every time the page is loaded.

At the number of times they're being called, even functions like str_replace, strlen and explode might have a discernible effect on the running time of the function. If possible, it might be worth looking for other ways to achieve the same result: e.g. is strlen really needed (is there ever less then 3 characters on a line?)? You could replace the str_replace lines either by a single call to it or just using rtrim around the fgets call. This last paragraph though is likely more eeking out tiny changes compared to the big stuff above.

steven fullman · January 1, 2010

This reply blew my mind. I had to send you a PM!

Kind regards,

Steve

There are lots of things that you could do to improve the execution time of the plugin, I'll touch on a couple but it might take a few major iterations to create something you're really happy with.

Important point #1 is that each time you call the main function it re-opens-loads-parses-closes the synonyms file. Call the function 5 times, for 5 articles on a page, and it does lots of identical work 5 times. Ideally we would like to cut this down to only once (if at all, see later) per page execution. There are a number of different ways you could achieve this: 1. load and parse the file into memory once and keep that hanging around for all of the other function calls, e.g. using a static function variable; 2. change the dictionary into something easier to work with, e.g. into a PHP array that you can just include (or require_once into a static variable); 3. Doubtless there are other ways too.

The next one is calling preg_replace many dozens of thousands of times. As with above, we need to whittle this down if possible. One way is to do what cags mentioned and grab the target word in whatever state of case it is and deal with the case elsewhere, e.g. in a callback. There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. Even with a callback, preg_replace_callback might be called once for each word in the synonyms list (for each article). You could experiment with using arrays for the pattern and/or subject arguments to see if combining the regular expressions and/or splitting up the article help at all.

Also as cags mentioned, it would be far better to do the 'synonymization' once (when the article is published? or on some other event) than on each page load. This is a lot of work to be doing for each any every time the page is loaded.

At the number of times they're being called, even functions like str_replace, strlen and explode might have a discernible effect on the running time of the function. If possible, it might be worth looking for other ways to achieve the same result: e.g. is strlen really needed (is there ever less then 3 characters on a line?)? You could replace the str_replace lines either by a single call to it or just using rtrim around the fgets call. This last paragraph though is likely more eeking out tiny changes compared to the big stuff above.

steven fullman · January 1, 2010

There's a reason why I run these regex's at "page request"...

I don't want to change someone's post's permanently.

Kind regards,

Steve

There are lots of things that you could do to improve the execution time of the plugin, I'll touch on a couple but it might take a few major iterations to create something you're really happy with.

Important point #1 is that each time you call the main function it re-opens-loads-parses-closes the synonyms file. Call the function 5 times, for 5 articles on a page, and it does lots of identical work 5 times. Ideally we would like to cut this down to only once (if at all, see later) per page execution. There are a number of different ways you could achieve this: 1. load and parse the file into memory once and keep that hanging around for all of the other function calls, e.g. using a static function variable; 2. change the dictionary into something easier to work with, e.g. into a PHP array that you can just include (or require_once into a static variable); 3. Doubtless there are other ways too.

The next one is calling preg_replace many dozens of thousands of times. As with above, we need to whittle this down if possible. One way is to do what cags mentioned and grab the target word in whatever state of case it is and deal with the case elsewhere, e.g. in a callback. There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. Even with a callback, preg_replace_callback might be called once for each word in the synonyms list (for each article). You could experiment with using arrays for the pattern and/or subject arguments to see if combining the regular expressions and/or splitting up the article help at all.

Also as cags mentioned, it would be far better to do the 'synonymization' once (when the article is published? or on some other event) than on each page load. This is a lot of work to be doing for each any every time the page is loaded.

At the number of times they're being called, even functions like str_replace, strlen and explode might have a discernible effect on the running time of the function. If possible, it might be worth looking for other ways to achieve the same result: e.g. is strlen really needed (is there ever less then 3 characters on a line?)? You could replace the str_replace lines either by a single call to it or just using rtrim around the fgets call. This last paragraph though is likely more eeking out tiny changes compared to the big stuff above.

thebadbad · January 3, 2010

There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag.

@OP, you can find an alternative pattern that uses a look-behind instead of a look-ahead in this post: http://www.phpfreaks.com/forums/index.php/topic,258333.msg1215689.html#msg1215689

steven fullman · January 4, 2010

There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag.

@OP, you can find an alternative pattern that uses a look-behind instead of a look-ahead in this post: http://www.phpfreaks.com/forums/index.php/topic,258333.msg1215689.html#msg1215689

Thanks!

I'll look into that...it looks exactly what I need!

Kind regards,

Steve

Sign In

Regex Experts...I need help excluding <html> tags...!

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information