Jump to content

Glossary regex problem


iambaz

Recommended Posts

Hi Everyone,

This has been causing me a headache for about a week now. I have created a glossary on my website. I need the expression to find a given word in a body of HTML text and surround the given word with the <strong></strong> tags.

Problem is, I only want it to find the word if it is between <p></p> tags and isn't between <a></a> (or <h></h>, <li>tags etc tags).

I did manage to get it working so it picks up text just between the <p> tags, however it still picked up the word when it was in an <a> tag which was in a <p> tag.

The current expression I have to do this is:

'(!<h[1-9]>)|(!>)|(?!<.*?) CONTENT (?![^<>]*?>)|(!</a>)|(!</h[1-9]>)'s

I hope I have explained myself well, I am so desperate that if anyone could fix this for me, or point me in the correct direction that I will sort them out with a small payment in thanks for their time (Hope this isn't against board rules or anything).

Thanks for any help in advance.

Barry

Link to comment
Share on other sites

Barry,

I think I may have a solution for you. Usually when regex gets really complicated like this, I try to break it down into smaller tasks that are easier. In this case, you're essentially page scraping, so I'd start just by pulling out all the <p> tags.[code]preg_match_all( '/<p>(.+?)<\/p>/s', $html, $chocolate_outside );[/code]That'll be half the battle right there.
After that, you know that there's also tags in there that you don't want, like <a>'s, <h[1-9]>'s, and <(u|o)l>'s. Since we really don't care what's in there we can just remove it.[code]$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", "/<h[1-9]>.+?<\/h[1-9]>/", "/<(u|o)l>.+?<\/\1>/");
$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);[/code]Then just match the stuff you need: the words in the <strong> tags.[code]preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);[/code]

All together it looks like this.
[code]$strings = "<p><h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul>\n\t<li><strong>wont be</strong></li>\n\t<li>Picked up</li>\n</ul> although this <strong>last one</strong> could get picked up.</p>\n<ul>\n\t<li><strong>Not this one</strong></li>\n</ul>";

preg_match_all( '/<p>(.+?)<\/p>/s', $strings, $chocolate_outside );

echo "Between p tags: ".$chocolate_outside[1][0]."\n";

$not_cool = array( "/<a\s+href='[^']+'>.+?<\/a>/", '/<(h[1-9])>.+?<\/\1>/', '/<((?:u|o)l)>.+?<\/\1>/s');
$caramel_coating = preg_replace( $not_cool, '', $chocolate_outside[1][0]);

echo "Removed all not cool tags: ".$caramel_coating."\n";

preg_match_all( '/<strong>([\w ]+)<\/strong>/s', $caramel_coating, $more_matches);

foreach($more_matches[1] as $gooey_center)
{
echo "The warm gooey center: ".$gooey_center."\n";
}
[/code]
This should give you this (for my little contrived test scrape):
[code]Between p tags:

<h3>Title here</h3>Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But <a href='somewhere.html'><strong>Not this one</strong></a>. A list <ul>
        <li><strong>wont be</strong></li>
        <li>Picked up</li>
</ul> although this <strong>last one</strong> could get picked up.

Removed all not cool tags:

Here is a <strong>very strong word</strong> this <strong>one</strong> should also be picked up. But . A list  although this <strong>last one</strong> could get picked up.

The warm gooey center: very strong word
The warm gooey center: one
The warm gooey center: last one
[/code]
Hopefully that should get you close, definitely not bullet-proof by any stretch, but its a start!
Link to comment
Share on other sites

You could do something like this
[code]<?php

// a callback-function used by preg_replace()
function strong($match, $word) {

// Replace all <> tags and bbcode [] stuff with | (pipe) chars
$tmp = preg_replace('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', '|', $match);

// Surround the target word with <strong> tags
$tmp = preg_replace('/\b' . $word . '\b/is', '<strong>\\0</strong>', $tmp);

// Match all <> tags and bbcode [] stuff from the un-edited $match var
preg_match_all('/(<|\[).*?(>|])(.*?(<|\[)\s*\/.*?(>|]))?/s', $match, $tag);

// Put all the tags back into $tmp by replacing the | chars with the tags stored in the $tag array
foreach($tag[0] as $html) {

$tmp = preg_replace('/\|/', $html, $tmp, 1);

}

// ... and return the string
return $tmp;

}


// the word you want to replace
$word = "text";

// ...and the string
$str = 'bla bla text <p>here is some text <a href="text.php">text link</a> and a bit <i>more text</i> here</p>. End of text.';

// Replace all $word's using a callback-function
$str = preg_replace('/(?<=<p>).*?\b' . $word . '\b.*?(?=<\/p>)/ise', 'strong("\\0", $word)', $str);

// and the result is...
echo $str;

?>[/code]
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.