Jump to content

[SOLVED] HELP: nested tags?


mab

Recommended Posts

Hello @all,

 

I currently struggle with RegEx and got stuck in this little problem.

 

I have a sourcecode with nested tables and text between separated tables. I want to remove all tables, but all text, that is not within the tables should stay. I am using the preg_replace function (PCRE-function) to do this.  At the moment it's working with non-nested tables. But as soon there are nested tables it doesn't work properly.

 

Hope my explanation is understandable. Anybody out there who can help? I appreciate any suggestions ...

 

Thanks so much in advance!!

 

Well, to explain a bit more, here's some code:

 

 

$pattern = '{(<[ \\n\\r\\t]*(table)(>|[^>]*>))(.*?)(<[ \\n\\r\\t]*/[ \\n\\r\\t]*(\2)(>|[^>]*>))}is'
$replacement = '';
echo preg_replace($pattern, $replacement, $subject);

 

And here is an example for the 'subject':

 

<h1>Some tables and text</h1>

<table>
  <tr>
    <th>England</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>
		<table>
  			<tr>
    			<th>London</th>
	    	<th>Brighton</th>
    			<th>Cambridge</th>
  			</tr>
  			<tr>
    			<td>rain</td>
    			<td>sun</td>
    			<td>wind</td>
  			</tr>
			</table>
    </td>
    <td>sun</td>
    <td>wind</td>
  </tr>  
</table>
This is a text between the tables wich should not be removed.
<table>
  <tr>
    <th>London</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>rain</td>
    <td>sun</td>
    <td>wind</td>
  </tr>
  </table>
This is a text after the tables which should also not be removed.

Link to comment
Share on other sites

<?php
$html='<h1>Some tables and text</h1>

<table>
  <tr>
    <th>England</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>
		<table>
  			<tr>
    			<th>London</th>
	    	<th>Brighton</th>
    			<th>Cambridge</th>
  			</tr>
  			<tr>
    			<td>rain</td>
    			<td>sun</td>
    			<td>wind</td>
  			</tr>
			</table>
    </td>
    <td>sun</td>
    <td>wind</td>
  </tr>  
</table>
This is a text between the tables wich should not be removed.
<table>
  <tr>
    <th>London</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>rain</td>
    <td>sun</td>
    <td>wind</td>
  </tr>
  </table>';
$html=preg_replace('~<table[^>]*>(??>(??!</?table[^>]*>).)+)|(?0))*</table>~is','',$html);
echo $html;
?>

Link to comment
Share on other sites

try

<?php
$test = '<h1>Some tables and text</h1>

<table>
  <tr>
    <th>England</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>
		<table>
  			<tr>
    			<th>London</th>
	    	<th>Brighton</th>
    			<th>Cambridge</th>
  			</tr>
  			<tr>
    			<td>rain</td>
    			<td>sun</td>
    			<td>wind</td>
  			</tr>
			</table>
    </td>
    <td>sun</td>
    <td>wind</td>
  </tr>  
</table>
This is a text between the tables wich should not be removed.
<table>
  <tr>
    <th>London</th>
    <th>Paris</th>
    <th>Munich</th>
  </tr>
  <tr>
    <td>rain</td>
    <td>sun</td>
    <td>wind</td>
  </tr>
  </table>
This is a text after the tables which should also not be removed.';
$out = '';
$start = 0;
$a = strpos($test, '<table');
$b = strpos($test, '</table');
$open_tag = 0;
while ($a !== false or $b !== false){
if ($a < $b and $a !== false){
	if ($open_tag == 0){
		$out .= substr($test, $start, $a - $start);
	}
	$open_tag++;
	$a = strpos($test, '<table', $a + 1);
} else {
	$open_tag--;
	$start = strpos($test, '>', $b+1) + 1;
	$b = strpos($test, '</table', $start);
}
}
if ($open_tag) die('HTML error!');
$out .= substr($test, $start);
echo $out;
?>

Link to comment
Share on other sites

thanks for all your help. I tried the solutions and both are working for me.  I really appreciate your help.

 

Using Regex is quite new for me and I don't understand the whole part of the first solution. The difficult part seems to be the one in parentheses:

(??>(??!</?table[^>]*>).)+)|(?0))*

 

 

Well, I recognized the atomic group, negative lookahead and that there is an alternation.

 

So what does the first alternation do? 

(?>(??!</?table[^>]*>).)+)

Am I right that it's matching everything, but looking first ahead if there is no opening or closing table-tag? And the atomic group keeps the match as a whole and can only be given back as a whole.

 

 

And the second alternation?

(?0)

What is this doing? And to which part of the regex does this refer?

 

 

I would be really happy if you could give a short explanation, so that I understand to create such a pattern on my own the next time.

 

So thanks again for all the nice and fast help!

Link to comment
Share on other sites

A more common way of seeing that is with (?R) instead of (?0) although (?0) helps to illustrate that you could incorporate lookahead and lookbehind and use a capture group 1 as the nested pattern (?1).

 

A complete background is in Friedl's "Mastering Regular Expressions" but for a quick PHP regex syntax overview:

http://us3.php.net/manual/en/reference.pcre.pattern.syntax.php

 

Search for "Recursive Patterns" on that page and you will see the discussion of the general pattern, although instead their example matches nested/non-nested parens groups.  It is simpler to construct a pattern with a single bounding character such as ( ) versus the table tags but the theory is the same.

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.