Jump to content

[SOLVED] regexp html parsing help


agrafuese

Recommended Posts

Hello, I want to parse some links (just the text part) from HTML documents, but I only want to grab the ones that are contained within a certain DIV. Each link is separated by a comma and then a space (the last link is followed by just a space, obviously), and there can be any number of links within the DIV (it's not always the same). Here's what a typical page would look like:

 

<div id="the_id"><a href="one_url.html">Some Text</a>, <a href="another_url.html">Some Text</a> </div>

 

This is the code for the regex I have so far (using preg_match_all). I know it's wrong, but I just wanted to show where I am in terms of thinking so someone can tell me how far off I am :) :

 

"/<div\sid=\"the_id\">(<a\shref=\"[^\"]+\">(.*)<\/a>,?\s)*<\/div>/siU"

 

 

 

Link to comment
Share on other sites

Maybe I was a bit too vague about this question...here's a better way to put it.

 

I have this very simple regex to extract the text from links (not the url, just the text), like so:

 

"/<a\shref=\"[^\"]+\">(.*)<\/a>,?\s/siU"

 

I know it's not as fool-proof as a normal link parsing regex should be, but that's fine for me at the moment. Now, what I want to do is extract only the links that are nested within a particular DIV. As of right now, I have the following code which does the job using a preg_match statement and a preg_match_all statement (see below), but I was wondering if this is possible with just one preg_match_all statement. This is something I am more curious about because I'd like to do it the "right" way, but like I said, the code I have right now is technically working. Any help would be appreciated.

 


$whole_string = '<div id="the_id"><a href="one_url.html">Some Text1</a>, <a href="another_url.html">Some Text2</a> </div>';

$whole_pattern = "/<div\sid=\"the_id\">(.*)<\/div>/siU";

preg_match($whole_pattern, $whole_string, $sub_matches); // extract the links from the DIV

$sub_pattern = "/<a\shref=\"[^\"]+\">(.*)<\/a>/siU";

preg_match_all($sub_pattern, $sub_matches[1], $matches, PREG_SET_ORDER); // extract the text from the links

foreach($matches as $match) {

$join_string .= $match[1] . ', '; // formats the link text like so: Some Text1, Some Text2, etc...

}

$insert_string = substr($join_string, 0, -2); // removes the last comma and space

...etc...

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.