Here is the text:
<div class="left">Lorem Ipsum is simply dummy text of the printing and</div> typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scramble'd it to make-shift type <a href="google.com">specimen book</a> and something [tag]else[/tag].
Essentially what I'm trying to do is extract all of the words above while abiding by these rules:
1. word can contain dash and apostrophe (scramble'd and make-shift above)
2. word cannot be within a link tag
3. word cannot be within a block tag - [tag]
4. word cannot be part of a tag name or html (class in class=", div, a, tag etc)
That's about it. Any advice on where I might start with that? I'm currently experimenting with it but all I have is this:
(\s?)[a-zA-Z0-9\'\-]+(\s?|\,|\.)
Shameful, I know.