Jump to content

BBCode-Stripping Regex Help for a PHP Novice


andy5000

Recommended Posts

Hello,

 

PHP Beginner here. Please go easy on me. My problem is as follows:

 

I want to display latest post from a forum on a webpage. I want to completely strip out [ quote], [ img], [ flash] and [ youtube] tags, including the text between the tags.

 

I then want to simply remove any other bbcode tags. For example, [ b], [ i], [ list], [ code], etc. Leaving the text between the tags in place.

 

I have tried searching the forum and web in general, but most bbcode regex questions seem to be to do with converting bbcode to html instead of simply stripping tags and contents - plus, the fact im a beginner and don't really know what im doing makes things a lot harder to understand.

 

I currently have the following (please excuse me for any awfully written code):

 

$messagebody = $row['post_text'];

// Remove [quote] [img] [flash] [youtube] code entirely (including contents)
// :-(

// Remove [b] [i] [u] [list] [*], etc (just tags)
// :-(

// Remove any nasty html tags
$messagebody = strip_tags($messagebody);


print '<p>'.$row['username'].' says: <a href="forum/viewtopic.php?t='.$row['post_id'].'">'.$messagebody.'</a></p>';

 

The commented out frowny faces represent my failure at knowing what to do.

 

I apologise if this gets asked all the time, I did try searching, honest!

Any help, or links to somewhere I can find an answer, would be much appreciated.

 

Thanks,

Andy

Link to comment
Share on other sites

There is no real difference between converting bbcode tags to HTML and removing them. Either way you are attempting to match the tags and replace them with something. The main difference is that you will be replacing them with empty strings rather than HTML tags. To remove the tags such as , etc. You are probably better off simply using str_replace since you don't actually need to match a pattern.

 

// the tags to remove
$tags = array('[b]','[/b]', '[i]', '[/i]');

// remove them
$output = str_replace($tags, '', $input);

For matching the patterns, what do you mean by...

 

I want to completely strip out [ quote], [ img], [ flash] and [ youtube] tags, including the text between the tags.

Do you wish to get rid of for example , completely? Will any of the tags be nested? I can't see that most of them would, but I know you get qoutes of quotes. Do any of the tags, if so which tags, support attributes eg. Google

Link to comment
Share on other sites

Hi Cags, Thanks for the reply.

 

I've done a little bit of reading and playing with str_replace but have found that unfortunately in a phpbb3 database, bbcode is stored like [b:12312312] and [i:5633] - i dont know what the numbers represent, but I believe they are unique. Therefore, [ b] isn't actually matching anything.

 

For the "completely stripping out" bit, I would ideally want to remove the whole thing for Quote, URL, IMG, Flash and Youtube tags, including anything in between the opening and close tag. For anything else other than these tags, I would like to remove just the tags, keeping what is between the opening and closing tag.

 

For example, this post:

 

[quote]Introduce Yourself![/quote]

Hello my name is [i]Jason[/i], have a look at my [b]COOL[/b] website!
[url=http://www.jason.com]My Website[/url]

 

Would become:

 

Hello my name is Jason, have a look at my COOL website!

 

Nesting is a good point. Quote tags could contain a whole host of any other tags. Italics/Bold/Youtube tags could all appear within a quote. As well as quotes within quotes. This is becoming more complicated than I thought it would be.

 

The following tags are the ones I want to basically nuke and remove in their entirity:

 

quote

img

flash

youtube

url

 

Out of these, the "Quote", "Flash" and "Youtube" tags can have optional attributes.

Link to comment
Share on other sites

You can match that format using something along the lines of...

 

"#\[b:[0-9]+?\]#"

 

As for url, img, flash and youtube you'd probably be looking at something along the lines of...

 

"#\[quote[^\]*]\].*?\[/quote\]#"

 

...obviously replacing the word url with the others for the other tag names to make the extra patterns. I was thinking that quotes would be difficult because they can be nested, but then it hit me, it's irrelevant since you wish to strip the contents. Just make the pattern greedy and it'll match all the way from the  first open tag to the close tag. As such the same as above should work, but without the question mark to make it lazy.

 

"#\[url[^\]*]\].*\[/url\]#"

 

Nb. These solutions probably aren't perfect, they aren't even tested, but it should give you a good idea.

Link to comment
Share on other sites

Hello again,

 

Thanks for posting those! They seem to be doing the trick in some cases, however I have now noticed that it is possible to have the following bbcode:

 

[quote="Persons Username":c123]Quote Body Text[/quote:c123]

 

I have taken the code you posted above and played around with it a bit, and got the following:

 

$messagebody = preg_replace('#\[quote(.*)\].*?\[/quote(.*)\]#','',$messagebody);

 

This appears to cover all possible quotes that could appear, except for ones where the quote body text contains a new line. I have read that the . (period) in regex means any character except for new lines - this is obviously what is causing it to not match.

 

Is there a way I can do any character including new lines?

 

I have played around with changing the .*? bit to [./n]*? and loads of other various possiblities, but no luck. Can you tell I dont really "get" regular expressions?  :D

Link to comment
Share on other sites

Firstly I've noticed a few typos in my last post, the patterns should have read...

 

"#\[url[^\]]*\].*?\[/url\]#"
"#\[quote[^\]]*\].*\[/quote\]#"

 

But there is a problem with the second one (the quote one). Because I left the match none greedy, if you had an input of the form...

 

[quote:123]This is a quote.[/quote] Normal text to stay. [quote]This is another quote[/quote]

 

... you would replace everything since the greedy pattern will keep going as far as it can. The problem is if you leave it lazy you encounter problems with this format...

 

Something. [quote:123]This is a quote. [quote:124]This is another inside a quote[/quote] More of the original quote[/quote]

 

... as the output string will be...

 

Something. More of the original quote[/quote]

 

This is the problem with nested tags. The tags having id's in them, where the id in the closing tag matches the opening tag does negate this problem in your specific case, but in order to solve it we would need to know that id exists in all quote tags, is that the case?

 

In answer to your question, yes the . doesn't match the newline character, we can force it to by going into single line mode using the s modifier. Modifiers are placed after the closing delimiter.

 

 

Link to comment
Share on other sites

Hello,

 

Unfortunately, I think when multiple quotes appear in the same post, they share the same ID/code thing. For example:

 

[quote:c123][quote:c123]How are you?[/quote:c123]

Good Thanks[/quote:c123]

That's good to hear.

 

Argh  :(

 

On a sidenote, though, the single-line mode works a treat. Which just leaves me with the problem mentioned in your previous post...

 

Is there a solution, or am I now firmly up the creek with no paddles in sight?

Link to comment
Share on other sites

There is a solution, but off the top of my head I'll be damned if I can remember what it is as I've never personally worked with bbcode. You might have to use preg_replace_callback, which I've also never used. There's an example in the documentation of working with nested bbcode but it doesn't look especially simple. I haven't got time to look through it in any detail right now so I'm afraid you might have to wait for somebody else on that one (if you can't figure out the example).

Link to comment
Share on other sites

Well I managed to quickly throw this together. It's not fully tested and can't say I've looked at the pattern enough to fully understand it, I just modified it for the obvious difference between the example on the site and your requirements...

 

$input = "plain [quote] deep [quote:123] deeper [/quote:123] deep [/quote] plain [quote:123] another test [/quote:123] does it work";

function stripquotes($input) {
    $regex = '#\[quote(?:[^]]*)]((?:[^[]|\[(?!/?quote(?:[^]]*)])|(?R))+)\[/quote(?:[^]]*)]#';

    if (is_array($input)) {
        $input = '';
    }

    return preg_replace_callback($regex, 'stripquotes', $input);
}

$output = stripquotes($input);
echo $output;

 

Of course depending on

Link to comment
Share on other sites

Hi Cags,

 

That is absolutely fantastic. Tested it, and it does exactly what i was asking for.

 

I am very grateful for the time you put in to helping me out!

 

Now I have something that works... im going to try and work through it and understand how it works :D

 

Thanks again,

Andy

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.