Anti-Moronic Posted October 29, 2011 Share Posted October 29, 2011 Here is the text: <div class="left">Lorem Ipsum is simply dummy text of the printing and</div> typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scramble'd it to make-shift type <a href="google.com">specimen book</a> and something [tag]else[/tag]. Essentially what I'm trying to do is extract all of the words above while abiding by these rules: 1. word can contain dash and apostrophe (scramble'd and make-shift above) 2. word cannot be within a link tag 3. word cannot be within a block tag - [tag] 4. word cannot be part of a tag name or html (class in class=", div, a, tag etc) That's about it. Any advice on where I might start with that? I'm currently experimenting with it but all I have is this: (\s?)[a-zA-Z0-9\'\-]+(\s?|\,|\.) Shameful, I know. Quote Link to comment Share on other sites More sharing options...
silkfire Posted October 29, 2011 Share Posted October 29, 2011 There's a striptags function in PHP but it only works for HTML, not block tags. Quote Link to comment Share on other sites More sharing options...
Anti-Moronic Posted October 29, 2011 Author Share Posted October 29, 2011 Thanks for the response. Strip tags only removes the tags and so would leave the link text still there. I need to remove all links or tags of a certain type and their content. That is why I would remove the links and text but not remove the content within the div. Thanks. Quote Link to comment Share on other sites More sharing options...
silkfire Posted October 30, 2011 Share Posted October 30, 2011 I'm quite sure striptags() will remove href="google.com" as well. Quote Link to comment Share on other sites More sharing options...
ManiacDan Posted October 30, 2011 Share Posted October 30, 2011 Please specify what you mean by "within." Which of these words is not "within" a tag: WordA <a href="http://www.wordb.com" class="wordc">WordD</a> If your answer is "wordA and wordD," then remove the tags with strip_tags, emulate the same functionality with the preg_replace /\[[^\]]+\]/, and continue with your matching. If your answer is "wordA only" you're going to have a lot more trouble. It's going to be extremely difficult, especially when you start doing nested tags or erroneous HTML. -Dan Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.