Jump to content

Struggling with a regex match. Appreciate any help..


Anti-Moronic

Recommended Posts

Here is the text:

 

<div class="left">Lorem Ipsum is simply dummy text of the printing and</div> typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scramble'd it to make-shift type <a href="google.com">specimen book</a> and something [tag]else[/tag].

 

Essentially what I'm trying to do is extract all of the words above while abiding by these rules:

 

1. word can contain dash and apostrophe (scramble'd and make-shift above)

2. word cannot be within a link tag

3. word cannot be within a block tag - [tag]

4. word cannot be part of a tag name or html (class in class=", div, a, tag etc)

 

That's about it. Any advice on where I might start with that? I'm currently experimenting with it but all I have is this:

 

(\s?)[a-zA-Z0-9\'\-]+(\s?|\,|\.)

 

Shameful, I know.

Thanks for the response.

 

Strip tags only removes the tags and so would leave the link text still there. I need to remove all links or tags of a certain type and their content. That is why I would remove the links and text but not remove the content within the div.

 

Thanks.

Please specify what you mean by "within."  Which of these words is not "within" a tag:

WordA <a href="http://www.wordb.com" class="wordc">WordD</a>

If your answer is "wordA and wordD," then remove the tags with strip_tags, emulate the same functionality with the preg_replace /\[[^\]]+\]/, and continue with your matching.

 

If your answer is "wordA only" you're going to have a lot more trouble.  It's going to be extremely difficult, especially when you start doing nested tags or erroneous HTML.

 

-Dan

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.