steven fullman Posted December 31, 2009 Share Posted December 31, 2009 Hello! I really hope you can help... I'm have a function which replaces keywords with...other keywords. The trouble is, it also replaces words within <html> tags...which is a problem, since I'm now replacing (for example) <div class="good"> with <div class="great"> ...which obviously breaks the rendering... What I'm trying to (and, Boy! have I been trying..) is to ignore anything inside <tags> I know preg_replace is the way to go, but I've been trying and failing for 2 days now... And I'd really appreciate your help. In short, I need my preg_replace to replace all instances of my keyword....unless it's in <html tags> I need someone smarter than me! Here's my code (without my attempt to "ignore" stuff within <html tags>): if (!function_exists("wordmash")) { function wordmash($parts) { $wordlist = file_get_contents('dictionary.txt', true); $dictionary = explode(",", $wordlist); $htmldictionary = array(); foreach($dictionary as $dicword) { $htmldictionary[] = htmlcode($dicword); $htmldictionary_u[] = htmlcode(strtoupper($dicword)); $htmldictionary_u1[] = htmlcode(ucfirst($dicword)); $htmldictionary_ucwords[] = htmlcode(ucwords($dicword)); } for($i=0;$i<count($dictionary);$i++){ $parts = preg_replace("/\b$dictionary[$i]\b/", $htmldictionary[$i], $parts); $parts = preg_replace("/\b" . strtoupper($dictionary[$i]) . "\b/", $htmldictionary_u[$i], $parts); $parts = preg_replace("/\b" . ucfirst($dictionary[$i]) . "\b/", $htmldictionary_u1[$i], $parts); $parts = preg_replace("/\b" . ucwords($dictionary[$i]) . "\b/", $htmldictionary_ucwords[$i], $parts); } return $parts; } } If you need more info...please let me know! Kind regards, Steve (I don't know the rules of this forum, but I'd like to buy the 1st person to clear this up for me a beer via PayPal...if that's acceptable?) Quote Link to comment Share on other sites More sharing options...
cags Posted December 31, 2009 Share Posted December 31, 2009 Something along these lines should do the trick... $keyword = 'good'; $replacement = 'great'; $input = 'Please replace good if if is not inside a tag like <div class="good"> blah</div>'; $pattern = "#(?=[^>]*(?:<|$))\b$keyword\b#s"; $output = preg_replace($pattern, $replacement, $input); Quote Link to comment Share on other sites More sharing options...
steven fullman Posted January 1, 2010 Author Share Posted January 1, 2010 Something along these lines should do the trick... $keyword = 'good'; $replacement = 'great'; $input = 'Please replace good if if is not inside a tag like <div class="good"> blah</div>'; $pattern = "#(?=[^>]*(?:<|$))\b$keyword\b#s"; $output = preg_replace($pattern, $replacement, $input); Hi Cags! That's awesome, seems to work perfectly...thank you! I wonder if I can ask for a final piece of advice? If you look at my function from the first post, you'll notice it: 1. Opens up a text file of dictionary words (currently 17,000+) 2. Builds an array 3. Runs another function against the dictionary words (let's say for argument's sake, transforms them into unicode characters) 4. Then the function runs the regex you gave me. It does this 4 times, to account for different cases. Trouble is, the function performs the regex 17,000+ times x 4 (once for each dictionary word, and 3x for each case match)... As you can imagine, that's hugely intensive, and runs very slowly... Can you think of a more efficient way to perform the regex...without having to loop through so many times? Kind regards, and thanks again for your help. Steve Quote Link to comment Share on other sites More sharing options...
salathe Posted January 1, 2010 Share Posted January 1, 2010 There are numerous points in your code which might not be as efficient as other methods. How intensive is "hugely intensive" and how slow is "very slowly", if you don't mind my asking? As for performing the regex more efficiently, there are a couple of factors. First is just the number of function calls (e.g. 68,000 times) — calling any function 68,000 times will take a while. Secondly, the regular expression itself could likely be much more optimised. Quote Link to comment Share on other sites More sharing options...
steven fullman Posted January 1, 2010 Author Share Posted January 1, 2010 Hi Salathe, Thanks for the reply! Well, I guess intensive is the wrong word. What I meant was the amount of function calls are (probably too) excessive. To perform this function across an article of, say, 500 words takes about 15 seconds. The problem is, there are probably 10 articles per web page that this function runs against... So, most of the time, my page times out before the function completes. If I take out the regex which filters out the <html> tags, it runs much faster. But, I wonder if there's a better way to perform the preg_replace without having to loop it 68,000 times? Kind regards, Steve There are numerous points in your code which might not be as efficient as other methods. How intensive is "hugely intensive" and how slow is "very slowly", if you don't mind my asking? As for performing the regex more efficiently, there are a couple of factors. First is just the number of function calls (e.g. 68,000 times) — calling any function 68,000 times will take a while. Secondly, the regular expression itself could likely be much more optimised. Quote Link to comment Share on other sites More sharing options...
cags Posted January 1, 2010 Share Posted January 1, 2010 I think a lot of the advice we might give you would depend on what exactly it is your trying to do. You say there may be 10 articles to a page and this takes 10 seconds+ per article, why are you processing 10 articles at a time like this? Is your script some kind of scraper? What server are you running the code on, if you are doing heavy duty scrapping your probably going to need a dedicated server, in which case you could change the time-out time for scripts. It would probably be useful if we knew a.) what exactly the function htmlcode does and b.) what the format of the entries in your wordlist text file look like. Quote Link to comment Share on other sites More sharing options...
steven fullman Posted January 1, 2010 Author Share Posted January 1, 2010 I think a lot of the advice we might give you would depend on what exactly it is your trying to do. You say there may be 10 articles to a page and this takes 10 seconds+ per article, why are you processing 10 articles at a time like this? Is your script some kind of scraper? What server are you running the code on, if you are doing heavy duty scrapping your probably going to need a dedicated server, in which case you could change the time-out time for scripts. It would probably be useful if we knew a.) what exactly the function htmlcode does and b.) what the format of the entries in your wordlist text file look like. Hi there, Yes, I realise I need to give you a better understanding. OK, this code is for a Wordpress plugin. It performs several actions, but one of which replaces dictionary words with synonyms... Because Wordpress can display many articles on a single page, the code will cycle through all of them when a page is requested. Here's another example: function synonymize($string){ $buffer=$string; $folder= PLUGINDIR."/myplugin"; $synonymsfile = $folder.'/synonyms.txt'; if ( $synonymsfile = fopen($synonymsfile, "r")) { while (!feof($synonymsfile)) { $synonyms = fgets($synonymsfile); $synonyms = str_replace("\n", "", $synonyms); $synonyms = str_replace("\r", "", $synonyms); if(!strlen($synonyms) < 2) { $synonymlist = explode(",", $synonyms); $oldword = $synonymlist[0]; $synonym = $synonymlist[1]; $buffer = preg_replace("#(?=[^>]*(?:<|$))\b$oldword\b#s", $synonym, $buffer); $buffer = preg_replace("#(?=[^>]*(?:<|$))\b" . ucfirst($oldword) . "\b#s", ucfirst($synonym), $buffer); } } fclose($synonymsfile); } return $buffer; } So, as you can see, I open a text file (synonyms.txt)...the format is like this: abandonment,leaving behind abase,lower abash,humiliate abashed,embarrassed abate,stop abating,narrowing ABC,essentials abdicate,renounce abdication,resignation abdomen,belly abduct,kidnap abduction,kidnap ... ... 17,000 more lines ... ... zilch,zero zing,dynamism zip,vitality zipper,closure zippy,rapid zit,blackhead zombie,android So, here's what happens: 1. I open the synonyms.txt file read only. 2. I get a line, and remove any line ending characters 3. If the line is 2 characters or longer, I explode the line by the comma 4. I then perform the preg_replace to replace the word with it's synonym 5. I then perform the preg_replace again to replace an ucfirst version of the word with it's synoym 6. Get the next line, and perform steps 3-5 again Now, steps 3-5 end up being executed 17,000 times, each with 2 preg_replace calls. So, the problem I face is that, without the regex you kindly gave me in your first reply, I find that words within <html tags> also get "synonymed"...and with the extra regex code it runs particularly slowly. But the real issue is that I suspect the way I'm attacking the problem is all wrong. It seems like there must be a better way to achieve the same result without having to create 17,000 arrays and calling preg_replace 34,000 times...! I hope that makes sense, and I really appreciate your help. Kind regards, Steve Quote Link to comment Share on other sites More sharing options...
cags Posted January 1, 2010 Share Posted January 1, 2010 Hmm... the whole approach of doing something this 'complex' every time you load an article, especially when x number of articles can be loaded at once seems a little redundant. I've never worked with wordpress, let alone created a plugin for it, but is there a particular reason why you cannot 'synonymize' the articles as they are created? Obviously you'd still want to optimise the script, but performance would be less important. I'll be honest with you optimization has never been my strong point, I can normally come up with several different approaches, but I'm never entirely sure which will yield the best result so I always end up just experimenting with them all until I come up with something. I'd imagine that rather than loop through the entire array for standard case, ucase, title case, ucfirst you would be better off using preg_match_callback. Match the word in a case insensitive manner, then calculate which replacement to use specifically for that word. In my mind that reduces the number of loops performed by 75% so I'd have thought that would improve performance. I also wonder if it would be faster to split the input into words, compare each word against the dictionary, make any replacements then stitch it back together. I can't really do much experimentation though as I don't have a dictionary file nor any articles to play about though. I'm sure salathe will probably have some good ideas, just a matter of wait and see Quote Link to comment Share on other sites More sharing options...
salathe Posted January 1, 2010 Share Posted January 1, 2010 There are lots of things that you could do to improve the execution time of the plugin, I'll touch on a couple but it might take a few major iterations to create something you're really happy with. Important point #1 is that each time you call the main function it re-opens-loads-parses-closes the synonyms file. Call the function 5 times, for 5 articles on a page, and it does lots of identical work 5 times. Ideally we would like to cut this down to only once (if at all, see later) per page execution. There are a number of different ways you could achieve this: 1. load and parse the file into memory once and keep that hanging around for all of the other function calls, e.g. using a static function variable; 2. change the dictionary into something easier to work with, e.g. into a PHP array that you can just include (or require_once into a static variable); 3. Doubtless there are other ways too. The next one is calling preg_replace many dozens of thousands of times. As with above, we need to whittle this down if possible. One way is to do what cags mentioned and grab the target word in whatever state of case it is and deal with the case elsewhere, e.g. in a callback. There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. Even with a callback, preg_replace_callback might be called once for each word in the synonyms list (for each article). You could experiment with using arrays for the pattern and/or subject arguments to see if combining the regular expressions and/or splitting up the article help at all. Also as cags mentioned, it would be far better to do the 'synonymization' once (when the article is published? or on some other event) than on each page load. This is a lot of work to be doing for each any every time the page is loaded. At the number of times they're being called, even functions like str_replace, strlen and explode might have a discernible effect on the running time of the function. If possible, it might be worth looking for other ways to achieve the same result: e.g. is strlen really needed (is there ever less then 3 characters on a line?)? You could replace the str_replace lines either by a single call to it or just using rtrim around the fgets call. This last paragraph though is likely more eeking out tiny changes compared to the big stuff above. Quote Link to comment Share on other sites More sharing options...
steven fullman Posted January 1, 2010 Author Share Posted January 1, 2010 This reply blew my mind. I had to send you a PM! Kind regards, Steve There are lots of things that you could do to improve the execution time of the plugin, I'll touch on a couple but it might take a few major iterations to create something you're really happy with. Important point #1 is that each time you call the main function it re-opens-loads-parses-closes the synonyms file. Call the function 5 times, for 5 articles on a page, and it does lots of identical work 5 times. Ideally we would like to cut this down to only once (if at all, see later) per page execution. There are a number of different ways you could achieve this: 1. load and parse the file into memory once and keep that hanging around for all of the other function calls, e.g. using a static function variable; 2. change the dictionary into something easier to work with, e.g. into a PHP array that you can just include (or require_once into a static variable); 3. Doubtless there are other ways too. The next one is calling preg_replace many dozens of thousands of times. As with above, we need to whittle this down if possible. One way is to do what cags mentioned and grab the target word in whatever state of case it is and deal with the case elsewhere, e.g. in a callback. There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. Even with a callback, preg_replace_callback might be called once for each word in the synonyms list (for each article). You could experiment with using arrays for the pattern and/or subject arguments to see if combining the regular expressions and/or splitting up the article help at all. Also as cags mentioned, it would be far better to do the 'synonymization' once (when the article is published? or on some other event) than on each page load. This is a lot of work to be doing for each any every time the page is loaded. At the number of times they're being called, even functions like str_replace, strlen and explode might have a discernible effect on the running time of the function. If possible, it might be worth looking for other ways to achieve the same result: e.g. is strlen really needed (is there ever less then 3 characters on a line?)? You could replace the str_replace lines either by a single call to it or just using rtrim around the fgets call. This last paragraph though is likely more eeking out tiny changes compared to the big stuff above. Quote Link to comment Share on other sites More sharing options...
steven fullman Posted January 1, 2010 Author Share Posted January 1, 2010 There's a reason why I run these regex's at "page request"... I don't want to change someone's post's permanently. Kind regards, Steve There are lots of things that you could do to improve the execution time of the plugin, I'll touch on a couple but it might take a few major iterations to create something you're really happy with. Important point #1 is that each time you call the main function it re-opens-loads-parses-closes the synonyms file. Call the function 5 times, for 5 articles on a page, and it does lots of identical work 5 times. Ideally we would like to cut this down to only once (if at all, see later) per page execution. There are a number of different ways you could achieve this: 1. load and parse the file into memory once and keep that hanging around for all of the other function calls, e.g. using a static function variable; 2. change the dictionary into something easier to work with, e.g. into a PHP array that you can just include (or require_once into a static variable); 3. Doubtless there are other ways too. The next one is calling preg_replace many dozens of thousands of times. As with above, we need to whittle this down if possible. One way is to do what cags mentioned and grab the target word in whatever state of case it is and deal with the case elsewhere, e.g. in a callback. There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. Even with a callback, preg_replace_callback might be called once for each word in the synonyms list (for each article). You could experiment with using arrays for the pattern and/or subject arguments to see if combining the regular expressions and/or splitting up the article help at all. Also as cags mentioned, it would be far better to do the 'synonymization' once (when the article is published? or on some other event) than on each page load. This is a lot of work to be doing for each any every time the page is loaded. At the number of times they're being called, even functions like str_replace, strlen and explode might have a discernible effect on the running time of the function. If possible, it might be worth looking for other ways to achieve the same result: e.g. is strlen really needed (is there ever less then 3 characters on a line?)? You could replace the str_replace lines either by a single call to it or just using rtrim around the fgets call. This last paragraph though is likely more eeking out tiny changes compared to the big stuff above. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted January 3, 2010 Share Posted January 3, 2010 There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. @OP, you can find an alternative pattern that uses a look-behind instead of a look-ahead in this post: http://www.phpfreaks.com/forums/index.php/topic,258333.msg1215689.html#msg1215689 Quote Link to comment Share on other sites More sharing options...
steven fullman Posted January 4, 2010 Author Share Posted January 4, 2010 There's also optimising the regular expression itself. Currently, due to the arrangement of the lookahead, at every position in the article, every following non-greater-than symbol character is looked at, then the less-than-or-end alternation is checked, then potentially a whole heap of backtracking happens until less-than-or-end is met, then (and only then!) the engine moves on to check for the target word itself. That's a lot of overhead happening for each and every character in the article, even for non-matches. One quick fix would be to match the target word first, then later check if it is not within a HTML tag. @OP, you can find an alternative pattern that uses a look-behind instead of a look-ahead in this post: http://www.phpfreaks.com/forums/index.php/topic,258333.msg1215689.html#msg1215689 Thanks! I'll look into that...it looks exactly what I need! Kind regards, Steve Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.