iambaz Posted December 19, 2006 Share Posted December 19, 2006 Hi Everyone,This has been causing me a headache for about a week now. I have created a glossary on my website. I need the expression to find a given word in a body of HTML text and surround the given word with the <strong></strong> tags. Problem is, I only want it to find the word if it is between <p></p> tags and isn't between <a></a> (or <h></h>, <li>tags etc tags).I did manage to get it working so it picks up text just between the <p> tags, however it still picked up the word when it was in an <a> tag which was in a <p> tag.The current expression I have to do this is:'(!<h[1-9]>)|(!>)|(?!<.*?) CONTENT (?![^<>]*?>)|(!</a>)|(!</h[1-9]>)'sI hope I have explained myself well, I am so desperate that if anyone could fix this for me, or point me in the correct direction that I will sort them out with a small payment in thanks for their time (Hope this isn't against board rules or anything). Thanks for any help in advance.Barry Link to comment https://forums.phpfreaks.com/topic/31238-glossary-regex-problem/ Share on other sites More sharing options...
c4onastick Posted December 20, 2006 Share Posted December 20, 2006 Barry,I think I may have a solution for you. Usually when regex gets really complicated like this, I try to break it down into smaller tasks that are easier. In this case, you're essentially page scraping, so I'd start just by pulling out all the <p> tags.[code]preg_match_all( '/<p>(.+?)<\/p>/s', $html, $chocolate_outside );[/code]That'll be half the battle right there.After that, you know that there's also tags in there that you don't want, like <a>'s, <h[1-9]>'s, and <(u|o)l>'s. Since we really don't care what's in there we can just remove it.[code]$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", "/<h[1-9]>.+?<\/h[1-9]>/", "/<(u|o)l>.+?<\/\1>/");$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);[/code]Then just match the stuff you need: the words in the <strong> tags.[code]preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);[/code]All together it looks like this.[code]$strings = "<p><h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul>\n\t<li><strong>wont be</strong></li>\n\t<li>Picked up</li>\n</ul> although this <strong>last one</strong> could get picked up.</p>\n<ul>\n\t<li><strong>Not this one</strong></li>\n</ul>";preg_match_all( '/<p>(.+?)<\/p>/s', $strings, $chocolate_outside );echo "Between p tags: ".$chocolate_outside[1][0]."\n";$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", '/<(h[1-9])>.+?<\/\1>/', '/<((?:u|o)l)>.+?<\/\1>/s');$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);echo "Removed all not cool tags: ".$caramel_coating."\n";preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);foreach($more_matches[1] as $gooey_center){echo "The warm gooey center: ".$gooey_center."\n";}[/code]This should give you this (for my little contrived test scrape):[code]Between p tags:<h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul> <li><strong>wont be</strong></li> <li>Picked up</li></ul> although this <strong>last one</strong> could get picked up.Removed all not cool tags:Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But . A list although this <strong>last one</strong> could get picked up.The warm gooey center: very strong wordThe warm gooey center: oneThe warm gooey center: last one[/code]Hopefully that should get you close, definitely not bullet-proof by any stretch, but its a start! Link to comment https://forums.phpfreaks.com/topic/31238-glossary-regex-problem/#findComment-145489 Share on other sites More sharing options...
Nicklas Posted December 21, 2006 Share Posted December 21, 2006 You could do something like this[code]<?php// a callback-function used by preg_replace()function strong($match, $word) { // Replace all <> tags and bbcode [] stuff with | (pipe) chars $tmp = preg_replace('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', '|', $match); // Surround the target word with <strong> tags $tmp = preg_replace('/\b' . $word . '\b/is', '<strong>\\0</strong>', $tmp); // Match all <> tags and bbcode [] stuff from the un-edited $match var preg_match_all('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', $match, $tag); // Put all the tags back into $tmp by replacing the | chars with the tags stored in the $tag array foreach($tag[0] as $html) { $tmp = preg_replace('/\|/', $html, $tmp, 1); }// ... and return the stringreturn $tmp;}// the word you want to replace$word = "text";// ...and the string$str = 'bla bla text <p>here is some text <a href="text.php">text link</a> and a bit <i>more text</i> here</p>. End of text.';// Replace all $word's using a callback-function$str = preg_replace('/(?<=<p>).*?\b' . $word . '\b.*?(?=<\/p>)/ise', 'strong("\\0", $word)', $str);// and the result is...echo $str;?>[/code] Link to comment https://forums.phpfreaks.com/topic/31238-glossary-regex-problem/#findComment-145581 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.