preg_match_all nested HTML tags

random_ · October 7, 2013

So I have HTML file with code:

          <div class="className" id="va_56">
            <div class="newclName">              
              <div class="another">
                <a href="/va56/md"><img src="http://imageshack.us/someimage.jpg" id="name_56" /></a>
              </div>
              <p><a href="/va56/md">md</a></p>
                            
              <p class="ptext">
                <span class="de">
                  <span class="done">(5369 max)</span>
                  Some text: 82%
                </span>
              </p>
            </div>
          </div>
                

          <div class="className" id="va_57">
            <div class="newclName">              
              <div class="another">
                <a href="/va57/md"><img src="http://imageshack.us/someimage2.jpg" id="name_57" /></a>
              </div>
              <p><a href="/va57/md">md</a></p>
                            
              <p class="ptext">
                <span class="de">
                  <span class="done">(469 max)</span>
                  Some text: 50%
                </span>
              </p>
            </div>
          </div>

I need to extract that nested html div tags including that frist one <div class="className" id="va_56"> and <div class="className" id="va_57"> and any other that might ocure in the script.

I only managed to get either class attribute or id attribute or only some div content but without that class and id and other nested content:

This is the code I used:

	$url = file_get_contents("file.html");

	function srch($var)
	{
		/* $tmp = preg_match_all('/(<div)(<\/div>)/', $var, $matches); */
		$tmp = preg_match_all('/(id="(va_\w*)")/is', $var, $matches);
		/* $tmp = preg_match_all('/(<div \w*)(.*)(<\/div>)/', $var, $matches); */
		$result = array($tmp);
		array_push($result, $matches[2]);
		array_push($result, count($matches[2]));
		return $result;
	}

	$result = srch($url);
	echo '<pre>';
	var_dump($result);
	echo '</pre>';

When I put something like this /(<div).(<\/div>)/ why that wont return all content between first and end div tags found, or maybe I dont need preg match all at all, instead I should use just preg_match (I know it returns only one result but If we consider just first part of posted HTML div with id va_56 would this get content between the tags - I would rather use preg_match_all to avoid loops)?

.josh · October 7, 2013

regex is not good for trying to parse html content, especially nested html content. Regex is good at parsing regular languages. HTML is a context-free language. Use DOM instead.

random_ · October 8, 2013

I know about DOM, I used it to create custom URLs from html page, but now I want to get in touch with regular expressions, so I dont need to parse the output, I just want it returned to see what preg_match_all registred and returned (int() is not enough info ). I touht html is good example cause of its complexity <=./text and etc. so I tryed one one example and I was like let copy/paste this html and use preg match all instead of preg_match to catch more than one ocurrence at output.

I know, Im doing it wrong, learning from more complex to simple is wrong, should be the other way around

So again I hit the wall, this code works:

$tmp = preg_match_all('/<div class="(\w*)" id="(\w*)">/', $file, $matches);

the problem is how to define the all char text until the end tag <\/div>. I tried . but does . matches all text with special characters or should I use that square braces to define what to expect there eg. [a-z*+,A-Z*+,\t,\n,\Q,\w,\W,\s,\d] this definition in braces dont work, think of it just as ilustration.

One more time, Its just practice and I dont need this to be parsed.

Hmmmm now I just realised something, this is the $patern - /<div class="(\w*)" id="(\w*)">/ and if I type something between / / directly it wont get returned unless it is metacharacter eg. \w so $pattern should look like this /<\w\s\w="\w"\s\w\"\w">/ (its just ilustration I know this doesnt work so tell me what I'm doing wrong here).

.josh · October 8, 2013

Okay well then here are a couple of tips:

1) Don't use / as the pattern delimiter. Since you are using it as your delimiter, you have to escape it if you want to use it in your pattern and html uses it. It's not a necessity, it's a convenience/readability thing. It's good practice to use a delimiter that won't appear in your pattern so you don't have to make your pattern uglier with extra escapes. A popular alternatives are the pound sign # or tilde ~.

example:

$tmp = preg_match_all('~<div class="(\w*)" id="(\w*)">~', $file, $matches);

2) read up on the difference between greedy matching .* vs. lazy matching .*?. Basically with html you should never use .* because you will almost certainly get more than you bargained for. On that note..

3) As much as possible, avoid "match alls". As much as possible, use quantifiers with character classes instead. In general they are safer than matching everything. For example, instead of this:

~<a.*?id='someid'.*?>~

Do this:

~<a[^>]*id='someid'[^>]*>~

4) Read up on modifiers, particularly "s" modifier. This will change the behavior of the dot, which will allow for what "match alls" you must use, that span multiple lines.

5) Make the pattern flexible to cover inconsistencies in markup. For example, you may see any of these things:

<a id = 'someid' >
<a id='someid'>
<a id= 'someid'>
<a id= "someid">
etc..

The above would best be handled like this:

~<a[^>]*id\s*=\s*(['"])((??!\1).)+)\1[^>]*>~

Though there are certainly less complex patterns that you can probably get away with, if you know the format isn't going to deviate. But even this doesn't consider if someone doesn't use quotes around the attribute.. for that you'd have to match expected format of the attribute - which is "easy" for something like an id.. but what about arbitrary attributes with arbitrary values? What about ignoring escaped quotes? It is not that regex can't handle all these things - it can (well, most of them, anyways), but this is one of the things that make using DOM a better choice than regex for parsing html. The DOM accounts for this stuff already. Your regex will also have to account for it. It's like insisting on writing your own in_array or sort function from scratch instead of using the baked in functions.

6) Read up on referencing captured groups. For example you can see in the example code in previous point I used \1. First I matched for the opening quote, regardless of whether it was single or double, and put it into a captured group. Then I referenced it to match the value of the attribute by one at a time matching characters that was not the captured quote. Then I referenced it once again for the closing quote. Technically I probably didn't need that final bit in that example since the example just matches to end of the tag, but in reality you'll want to have it if you're matching other stuff in it. This will come in handy for matching nested stuff, especially when used with zero width assertions.

7) Read up zero width assertions. lookaheads, lookbehinds and lookarounds. This is one of the more "advanced" parts of regex, the part people struggle with the most. But it really comes in handy when matching up nested content

8 ) Read up on conditions. This will also come in handy for nested tags, particularly tags that do not have full closing tags (e.g. <br />")

random_ · October 8, 2013

Thank you .josh, you were more than helpful.

I was buliding my own pattern but now I see its to literal and not flexible like yours in the example above:

$tmp = preg_match_all('/<(\w*\s\w*="\w*"\s\w*="\w*")>/', $file, $matches);

I knew about escaping certain characters like / or ) but I didnt know that I can use different pattern delimiter than / /. I dont know can a ' " = < > be escaped or are they members of any special char group...

I bumped on greedy matching today when I used /<.*>/ as a pattern (it just returns everything) and I noticed if I used it like this /<(.*)>/ it spits out another array without everything wraped around < >.

I also bumped on lookaheads and lookbehinds today when I was unable to match the rest of that code by adding \r\n and multiple \s to match the next tag using code I posted, but I didnt quite understood them.

Excilent tip about pattern flexibility, I would never tought about that and rather used literal definition with some conditions... I know wrong

Now I know how much I dont know, so I will need to learn

Once more, thanks.

Sign In

preg_match_all nested HTML tags

Recommended Posts

random_

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

random_

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

random_

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information