Jump to content

[SOLVED] HELP: nested tags?


mab

Recommended Posts

Hello @all,

 

I currently struggle with RegEx and got stuck in this little problem.

 

I have a sourcecode with nested tables and text between separated tables. I want to remove all tables, but all text, that is not within the tables should stay. I am using the preg_replace function (PCRE-function) to do this.  At the moment it's working with non-nested tables. But as soon there are nested tables it doesn't work properly.

 

Hope my explanation is understandable. Anybody out there who can help? I appreciate any suggestions ...

 

Thanks so much in advance!!

 

Well, to explain a bit more, here's some code:

 

 

$pattern = '{(<[ \\n\\r\\t]*(table)(>|[^>]*>))(.*?)(<[ \\n\\r\\t]*/[ \\n\\r\\t]*(\2)(>|[^>]*>))}is'
$replacement = '';
echo preg_replace($pattern, $replacement, $subject);

 

And here is an example for the 'subject':

 

<h1>Some tables and text</h1>

<table>
  <tr>
    <th>England</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>
		<table>
  			<tr>
    			<th>London</th>
	    	<th>Brighton</th>
    			<th>Cambridge</th>
  			</tr>
  			<tr>
    			<td>rain</td>
    			<td>sun</td>
    			<td>wind</td>
  			</tr>
			</table>
    </td>
    <td>sun</td>
    <td>wind</td>
  </tr>  
</table>
This is a text between the tables wich should not be removed.
<table>
  <tr>
    <th>London</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>rain</td>
    <td>sun</td>
    <td>wind</td>
  </tr>
  </table>
This is a text after the tables which should also not be removed.

Link to comment
https://forums.phpfreaks.com/topic/133191-solved-help-nested-tags/
Share on other sites

<?php
$html='<h1>Some tables and text</h1>

<table>
  <tr>
    <th>England</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>
		<table>
  			<tr>
    			<th>London</th>
	    	<th>Brighton</th>
    			<th>Cambridge</th>
  			</tr>
  			<tr>
    			<td>rain</td>
    			<td>sun</td>
    			<td>wind</td>
  			</tr>
			</table>
    </td>
    <td>sun</td>
    <td>wind</td>
  </tr>  
</table>
This is a text between the tables wich should not be removed.
<table>
  <tr>
    <th>London</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>rain</td>
    <td>sun</td>
    <td>wind</td>
  </tr>
  </table>';
$html=preg_replace('~<table[^>]*>(??>(??!</?table[^>]*>).)+)|(?0))*</table>~is','',$html);
echo $html;
?>

try

<?php
$test = '<h1>Some tables and text</h1>

<table>
  <tr>
    <th>England</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>
		<table>
  			<tr>
    			<th>London</th>
	    	<th>Brighton</th>
    			<th>Cambridge</th>
  			</tr>
  			<tr>
    			<td>rain</td>
    			<td>sun</td>
    			<td>wind</td>
  			</tr>
			</table>
    </td>
    <td>sun</td>
    <td>wind</td>
  </tr>  
</table>
This is a text between the tables wich should not be removed.
<table>
  <tr>
    <th>London</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>rain</td>
    <td>sun</td>
    <td>wind</td>
  </tr>
  </table>
This is a text after the tables which should also not be removed.';
$out = '';
$start = 0;
$a = strpos($test, '<table');
$b = strpos($test, '</table');
$open_tag = 0;
while ($a !== false or $b !== false){
if ($a < $b and $a !== false){
	if ($open_tag == 0){
		$out .= substr($test, $start, $a - $start);
	}
	$open_tag++;
	$a = strpos($test, '<table', $a + 1);
} else {
	$open_tag--;
	$start = strpos($test, '>', $b+1) + 1;
	$b = strpos($test, '</table', $start);
}
}
if ($open_tag) die('HTML error!');
$out .= substr($test, $start);
echo $out;
?>

thanks for all your help. I tried the solutions and both are working for me.  I really appreciate your help.

 

Using Regex is quite new for me and I don't understand the whole part of the first solution. The difficult part seems to be the one in parentheses:

(??>(??!</?table[^>]*>).)+)|(?0))*

 

 

Well, I recognized the atomic group, negative lookahead and that there is an alternation.

 

So what does the first alternation do? 

(?>(??!</?table[^>]*>).)+)

Am I right that it's matching everything, but looking first ahead if there is no opening or closing table-tag? And the atomic group keeps the match as a whole and can only be given back as a whole.

 

 

And the second alternation?

(?0)

What is this doing? And to which part of the regex does this refer?

 

 

I would be really happy if you could give a short explanation, so that I understand to create such a pattern on my own the next time.

 

So thanks again for all the nice and fast help!

A more common way of seeing that is with (?R) instead of (?0) although (?0) helps to illustrate that you could incorporate lookahead and lookbehind and use a capture group 1 as the nested pattern (?1).

 

A complete background is in Friedl's "Mastering Regular Expressions" but for a quick PHP regex syntax overview:

http://us3.php.net/manual/en/reference.pcre.pattern.syntax.php

 

Search for "Recursive Patterns" on that page and you will see the discussion of the general pattern, although instead their example matches nested/non-nested parens groups.  It is simpler to construct a pattern with a single bounding character such as ( ) versus the table tags but the theory is the same.

 

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.