[SOLVED] regexp html parsing help

agrafuese · October 5, 2007

Hello, I want to parse some links (just the text part) from HTML documents, but I only want to grab the ones that are contained within a certain DIV. Each link is separated by a comma and then a space (the last link is followed by just a space, obviously), and there can be any number of links within the DIV (it's not always the same). Here's what a typical page would look like:

<div id="the_id"><a href="one_url.html">Some Text</a>, <a href="another_url.html">Some Text</a> </div>

This is the code for the regex I have so far (using preg_match_all). I know it's wrong, but I just wanted to show where I am in terms of thinking so someone can tell me how far off I am :

"/<div\sid=\"the_id\">(<a\shref=\"[^\"]+\">(.*)<\/a>,?\s)*<\/div>/siU"

agrafuese · October 6, 2007

Maybe I was a bit too vague about this question...here's a better way to put it.

I have this very simple regex to extract the text from links (not the url, just the text), like so:

"/<a\shref=\"[^\"]+\">(.*)<\/a>,?\s/siU"

I know it's not as fool-proof as a normal link parsing regex should be, but that's fine for me at the moment. Now, what I want to do is extract only the links that are nested within a particular DIV. As of right now, I have the following code which does the job using a preg_match statement and a preg_match_all statement (see below), but I was wondering if this is possible with just one preg_match_all statement. This is something I am more curious about because I'd like to do it the "right" way, but like I said, the code I have right now is technically working. Any help would be appreciated.


$whole_string = '<div id="the_id"><a href="one_url.html">Some Text1</a>, <a href="another_url.html">Some Text2</a> </div>';

$whole_pattern = "/<div\sid=\"the_id\">(.*)<\/div>/siU";

preg_match($whole_pattern, $whole_string, $sub_matches); // extract the links from the DIV

$sub_pattern = "/<a\shref=\"[^\"]+\">(.*)<\/a>/siU";

preg_match_all($sub_pattern, $sub_matches[1], $matches, PREG_SET_ORDER); // extract the text from the links

foreach($matches as $match) {

$join_string .= $match[1] . ', '; // formats the link text like so: Some Text1, Some Text2, etc...

}

$insert_string = substr($join_string, 0, -2); // removes the last comma and space

...etc...

agrafuese · October 6, 2007

UPDATE: I ended up going a different route all together on this project, so I no longer need the advice. If you still want to answer the question, however, feel free

Sign In

[SOLVED] regexp html parsing help

Recommended Posts

agrafuese

Link to comment

Share on other sites

agrafuese

Link to comment

Share on other sites

agrafuese

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information