iambaz Posted December 19, 2006 Share Posted December 19, 2006 Hi Everyone,This has been causing me a headache for about a week now. I have created a glossary on my website. I need the expression to find a given word in a body of HTML text and surround the given word with the <strong></strong> tags. Problem is, I only want it to find the word if it is between <p></p> tags and isn't between <a></a> (or <h></h>, <li>tags etc tags).I did manage to get it working so it picks up text just between the <p> tags, however it still picked up the word when it was in an <a> tag which was in a <p> tag.The current expression I have to do this is:'(!<h[1-9]>)|(!>)|(?!<.*?) CONTENT (?![^<>]*?>)|(!</a>)|(!</h[1-9]>)'sI hope I have explained myself well, I am so desperate that if anyone could fix this for me, or point me in the correct direction that I will sort them out with a small payment in thanks for their time (Hope this isn't against board rules or anything). Thanks for any help in advance.Barry Quote Link to comment Share on other sites More sharing options...
c4onastick Posted December 20, 2006 Share Posted December 20, 2006 Barry,I think I may have a solution for you. Usually when regex gets really complicated like this, I try to break it down into smaller tasks that are easier. In this case, you're essentially page scraping, so I'd start just by pulling out all the <p> tags.[code]preg_match_all( '/<p>(.+?)<\/p>/s', $html, $chocolate_outside );[/code]That'll be half the battle right there.After that, you know that there's also tags in there that you don't want, like <a>'s, <h[1-9]>'s, and <(u|o)l>'s. Since we really don't care what's in there we can just remove it.[code]$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", "/<h[1-9]>.+?<\/h[1-9]>/", "/<(u|o)l>.+?<\/\1>/");$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);[/code]Then just match the stuff you need: the words in the <strong> tags.[code]preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);[/code]All together it looks like this.[code]$strings = "<p><h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul>\n\t<li><strong>wont be</strong></li>\n\t<li>Picked up</li>\n</ul> although this <strong>last one</strong> could get picked up.</p>\n<ul>\n\t<li><strong>Not this one</strong></li>\n</ul>";preg_match_all( '/<p>(.+?)<\/p>/s', $strings, $chocolate_outside );echo "Between p tags: ".$chocolate_outside[1][0]."\n";$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", '/<(h[1-9])>.+?<\/\1>/', '/<((?:u|o)l)>.+?<\/\1>/s');$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);echo "Removed all not cool tags: ".$caramel_coating."\n";preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);foreach($more_matches[1] as $gooey_center){echo "The warm gooey center: ".$gooey_center."\n";}[/code]This should give you this (for my little contrived test scrape):[code]Between p tags:<h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul> <li><strong>wont be</strong></li> <li>Picked up</li></ul> although this <strong>last one</strong> could get picked up.Removed all not cool tags:Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But . A list although this <strong>last one</strong> could get picked up.The warm gooey center: very strong wordThe warm gooey center: oneThe warm gooey center: last one[/code]Hopefully that should get you close, definitely not bullet-proof by any stretch, but its a start! Quote Link to comment Share on other sites More sharing options...
Nicklas Posted December 21, 2006 Share Posted December 21, 2006 You could do something like this[code]<?php// a callback-function used by preg_replace()function strong($match, $word) { // Replace all <> tags and bbcode [] stuff with | (pipe) chars $tmp = preg_replace('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', '|', $match); // Surround the target word with <strong> tags $tmp = preg_replace('/\b' . $word . '\b/is', '<strong>\\0</strong>', $tmp); // Match all <> tags and bbcode [] stuff from the un-edited $match var preg_match_all('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', $match, $tag); // Put all the tags back into $tmp by replacing the | chars with the tags stored in the $tag array foreach($tag[0] as $html) { $tmp = preg_replace('/\|/', $html, $tmp, 1); }// ... and return the stringreturn $tmp;}// the word you want to replace$word = "text";// ...and the string$str = 'bla bla text <p>here is some text <a href="text.php">text link</a> and a bit <i>more text</i> here</p>. End of text.';// Replace all $word's using a callback-function$str = preg_replace('/(?<=<p>).*?\b' . $word . '\b.*?(?=<\/p>)/ise', 'strong("\\0", $word)', $str);// and the result is...echo $str;?>[/code] Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.