Jump to content

Glossary regex problem


iambaz

Recommended Posts

Hi Everyone,

This has been causing me a headache for about a week now. I have created a glossary on my website. I need the expression to find a given word in a body of HTML text and surround the given word with the <strong></strong> tags.

Problem is, I only want it to find the word if it is between <p></p> tags and isn't between <a></a> (or <h></h>, <li>tags etc tags).

I did manage to get it working so it picks up text just between the <p> tags, however it still picked up the word when it was in an <a> tag which was in a <p> tag.

The current expression I have to do this is:

'(!<h[1-9]>)|(!>)|(?!<.*?) CONTENT (?![^<>]*?>)|(!</a>)|(!</h[1-9]>)'s

I hope I have explained myself well, I am so desperate that if anyone could fix this for me, or point me in the correct direction that I will sort them out with a small payment in thanks for their time (Hope this isn't against board rules or anything).

Thanks for any help in advance.

Barry

Link to comment
https://forums.phpfreaks.com/topic/31238-glossary-regex-problem/
Share on other sites

Barry,

I think I may have a solution for you. Usually when regex gets really complicated like this, I try to break it down into smaller tasks that are easier. In this case, you're essentially page scraping, so I'd start just by pulling out all the <p> tags.[code]preg_match_all( '/<p>(.+?)<\/p>/s', $html, $chocolate_outside );[/code]That'll be half the battle right there.
After that, you know that there's also tags in there that you don't want, like <a>'s, <h[1-9]>'s, and <(u|o)l>'s. Since we really don't care what's in there we can just remove it.[code]$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", "/<h[1-9]>.+?<\/h[1-9]>/", "/<(u|o)l>.+?<\/\1>/");
$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);[/code]Then just match the stuff you need: the words in the <strong> tags.[code]preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);[/code]

All together it looks like this.
[code]$strings = "<p><h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul>\n\t<li><strong>wont be</strong></li>\n\t<li>Picked up</li>\n</ul> although this <strong>last one</strong> could get picked up.</p>\n<ul>\n\t<li><strong>Not this one</strong></li>\n</ul>";

preg_match_all( '/<p>(.+?)<\/p>/s', $strings, $chocolate_outside );

echo "Between p tags: ".$chocolate_outside[1][0]."\n";

$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", '/<(h[1-9])>.+?<\/\1>/', '/<((?:u|o)l)>.+?<\/\1>/s');
$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);

echo "Removed all not cool tags: ".$caramel_coating."\n";

preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);

foreach($more_matches[1] as $gooey_center)
{
echo "The warm gooey center: ".$gooey_center."\n";
}
[/code]
This should give you this (for my little contrived test scrape):
[code]Between p tags:

<h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul>
        <li><strong>wont be</strong></li>
        <li>Picked up</li>
</ul> although this <strong>last one</strong> could get picked up.

Removed all not cool tags:

Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But . A list  although this <strong>last one</strong> could get picked up.

The warm gooey center: very strong word
The warm gooey center: one
The warm gooey center: last one
[/code]
Hopefully that should get you close, definitely not bullet-proof by any stretch, but its a start!
You could do something like this
[code]<?php

// a callback-function used by preg_replace()
function strong($match, $word) {

// Replace all <> tags and bbcode [] stuff with | (pipe) chars
$tmp = preg_replace('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', '|', $match);

// Surround the target word with <strong> tags
$tmp = preg_replace('/\b' . $word . '\b/is', '<strong>\\0</strong>', $tmp);

// Match all <> tags and bbcode [] stuff from the un-edited $match var
preg_match_all('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', $match, $tag);

// Put all the tags back into $tmp by replacing the | chars with the tags stored in the $tag array
foreach($tag[0] as $html) {

$tmp = preg_replace('/\|/', $html, $tmp, 1);

}

// ... and return the string
return $tmp;

}


// the word you want to replace
$word = "text";

// ...and the string
$str = 'bla bla text <p>here is some text <a href="text.php">text link</a> and a bit <i>more text</i> here</p>. End of text.';

// Replace all $word's using a callback-function
$str = preg_replace('/(?<=<p>).*?\b' . $word . '\b.*?(?=<\/p>)/ise', 'strong("\\0", $word)', $str);

// and the result is...
echo $str;

?>[/code]

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.