agrafuese Posted October 5, 2007 Share Posted October 5, 2007 Hello, I want to parse some links (just the text part) from HTML documents, but I only want to grab the ones that are contained within a certain DIV. Each link is separated by a comma and then a space (the last link is followed by just a space, obviously), and there can be any number of links within the DIV (it's not always the same). Here's what a typical page would look like: <div id="the_id"><a href="one_url.html">Some Text</a>, <a href="another_url.html">Some Text</a> </div> This is the code for the regex I have so far (using preg_match_all). I know it's wrong, but I just wanted to show where I am in terms of thinking so someone can tell me how far off I am : "/<div\sid=\"the_id\">(<a\shref=\"[^\"]+\">(.*)<\/a>,?\s)*<\/div>/siU" Link to comment https://forums.phpfreaks.com/topic/72037-solved-regexp-html-parsing-help/ Share on other sites More sharing options...
agrafuese Posted October 6, 2007 Author Share Posted October 6, 2007 Maybe I was a bit too vague about this question...here's a better way to put it. I have this very simple regex to extract the text from links (not the url, just the text), like so: "/<a\shref=\"[^\"]+\">(.*)<\/a>,?\s/siU" I know it's not as fool-proof as a normal link parsing regex should be, but that's fine for me at the moment. Now, what I want to do is extract only the links that are nested within a particular DIV. As of right now, I have the following code which does the job using a preg_match statement and a preg_match_all statement (see below), but I was wondering if this is possible with just one preg_match_all statement. This is something I am more curious about because I'd like to do it the "right" way, but like I said, the code I have right now is technically working. Any help would be appreciated. $whole_string = '<div id="the_id"><a href="one_url.html">Some Text1</a>, <a href="another_url.html">Some Text2</a> </div>'; $whole_pattern = "/<div\sid=\"the_id\">(.*)<\/div>/siU"; preg_match($whole_pattern, $whole_string, $sub_matches); // extract the links from the DIV $sub_pattern = "/<a\shref=\"[^\"]+\">(.*)<\/a>/siU"; preg_match_all($sub_pattern, $sub_matches[1], $matches, PREG_SET_ORDER); // extract the text from the links foreach($matches as $match) { $join_string .= $match[1] . ', '; // formats the link text like so: Some Text1, Some Text2, etc... } $insert_string = substr($join_string, 0, -2); // removes the last comma and space ...etc... Link to comment https://forums.phpfreaks.com/topic/72037-solved-regexp-html-parsing-help/#findComment-363145 Share on other sites More sharing options...
agrafuese Posted October 6, 2007 Author Share Posted October 6, 2007 UPDATE: I ended up going a different route all together on this project, so I no longer need the advice. If you still want to answer the question, however, feel free Link to comment https://forums.phpfreaks.com/topic/72037-solved-regexp-html-parsing-help/#findComment-363604 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.