Stripping unwanted (extra) </table> tags

qwikaddotcom · May 24, 2013

Hi!

Let's say somebody posts an ad that has html tables in them, and the tables have extra closing tags </table> in them. Example:

 
<table border="0">
<tr>
</td>
 
Some text....
 
</td>
<td>
 
Some text....
 
</td>
</tr>
 
</table>
 
</table>
</table>

Is there a way to remove extra closing tags (just extra </table> tags) from any html table?

Thank you for any input.

ginerjm · May 24, 2013

I think the best way is to get the poster to stop doing it!

That said - I think you have to parse the text of the page and do your own thing to count start tags and end tags and skip the end tags when they outnumber the starts.

qwikaddotcom · May 24, 2013

I think the best way is to get the poster to stop doing it!

That said - I think you have to parse the text of the page and do your own thing to count start tags and end tags and skip the end tags when they outnumber the starts.

How would you accomplish something like that with let's say preg_match or regex?

Jessica · May 24, 2013

You wouldn't, you would use a Dom Parser. I like SimpleDOM

qwikaddotcom · May 24, 2013

You wouldn't, you would use a Dom Parser. I like SimpleDOM

I am not familiar with SimpleDOM. I found something about it on here: http://simplehtmldom.sourceforge.net/ it will take me a lot of time to try to figure something like removing extra </table> tags using it. Can you suggest a solution? The idea is if there's an unwanted closing table tag (whether 1 or many) they should be stripped. Any help will be appreciated.

Jessica · May 24, 2013

Yes, I got the idea. It's up to you to write the code.

If you want a quick solution done for you either in regex or a dom parser, you'll need to post in Freelancing. Otherwise, pick which way you want to go and make an effort, and we can help then

kicken · May 24, 2013

You could try Tidy, it may be able to clean it up for you.

jazzman1 · May 24, 2013

Just treat the table content as a simple string, trim the content and remove only this duplicates that you want.

jazzman1 · May 24, 2013

If your problem is only "</table>" you can remove all empty spaces between "><" tags then just to remove all "</table" and add just one.

Take a look at this, not very elegant solution, b/s don't have a much time but it should work.

<?php

$str = '<table border="0">
<tr>

</td>
 
Some text....
 
</td>
<td>
 
Some text....
 
</td>

</tr>  

</table>

</table>

</table>
';

$tbl = preg_replace('~>(\s+)?<~', '><', $str);

$html = implode('',array_unique(explode('</table>', $tbl)));

echo $html.'</table>';

Results:

<table border="0"><tr></td>

Some text....

</td><td>

Some text....

</td></tr>
</table>

qwikaddotcom · May 24, 2013

Just treat the table content as a simple string, trim the content and remove only this duplicates that you want.

It's a great recommendation, but can you show me how it can be done with either preg_match, regex or SimpleDOM? All it has to do is clean all extra </table> tags (closing table tags). Everything else can stay. Thank you.

qwikaddotcom · May 24, 2013

I guess you posted it just a second before I did. LOL.

qwikaddotcom · May 24, 2013

But this will work just for one particular example. I need something more universal. I need a preg_match or something that will pretty much strip all extra </table> tags in all kinds of html table structures...

jazzman1 · May 24, 2013

For more universal solution, there is a "Freelance Section" to the forum or......I highly recommend you to start learning RegEX, I am a big their fan

http://www.regular-expressions.info/

qwikaddotcom · May 26, 2013

For more universal solution, there is a "Freelance Section" to the forum or......I highly recommend you to start learning RegEX, I am a big their fan

http://www.regular-expressions.info/

I know I am probably getting all of you annoyed with my posts about the same thing, but I need something different. Actually, what I need (as I have figured it out now) is something way simpler than what I thought I needed:

If there is already a closing </table> tag at the end of a table, any other closing </table> tags after that last closing tag should be stripped.

In other words, it doesn't matter what happens inside the table, what matters is after the table is closed and there's no new opening <table> tag, any extra closing </table> tags must be stripped. I think this is doable. AND... that will eliminate 95% of the wrong tables (from what I've seen happening in the posts).

The difference between what I need now and what I needed before is that there will be no need for parsing. The only "parsing" that will be involved will go like this "ok the table has a closing tag and there are no openings tags, but there are extra closing tags...... strip them!"

Can you suggest how this can be done with preg_match or something similar? I'd really appreciate it. Preg_match, str_replace or preg_replace work best for me, because I can apply them directly to the markdown.

Thank you a lot!

jazzman1 · May 26, 2013

Where is the problem to use my script above?

Jessica · May 26, 2013

That is the exact same question you already posted.

qwikaddotcom · May 26, 2013

That is the exact same question you already posted.

The difference is (at least I thought there was a difference), before, as I thought it was mentioned, it needed to be parsed where's now it probably can be accomplished with a one liner. For example (although it's a different solution):

 $text = preg_replace('/<([^<>]+)>/e', '"<" .str_replace(""", \'"\', "$1").">"', $text);

or do I still have to use the solution offered by jazzman1 where I have to use the table itself within the script?

qwikaddotcom · May 26, 2013

Also, without the table in the script, it will strip all unwanted </table> tags, not just a definitive number of them.

qwikaddotcom · May 26, 2013

I guess I've taken the easiest way out for now. I've tried different lines and ended up with this one. It does what I want... for now:

$text = preg_replace( '/(s*<\/table\s*\/?>\s*)+/', "</table>", $text);

Thanks everyone for your input!

jazzman1 · May 26, 2013

No, it's not corect

You should make a "\s" as optional!

Monkuar · May 26, 2013

use htmlspecialchars to stop that crap dont let users EVER allow to use html

jazzman1 · May 26, 2013

The htmlspecialchars function just predefined special html characters to their entites?

Can you give us an example what do you mean?

Sign In

Stripping unwanted (extra) </table> tags

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information