agrafuese Posted October 5, 2007 Share Posted October 5, 2007 Hello, I want to parse some links (just the text part) from HTML documents, but I only want to grab the ones that are contained within a certain DIV. Each link is separated by a comma and then a space (the last link is followed by just a space, obviously), and there can be any number of links within the DIV (it's not always the same). Here's what a typical page would look like: <div id="the_id"><a href="one_url.html">Some Text</a>, <a href="another_url.html">Some Text</a> </div> This is the code for the regex I have so far (using preg_match_all). I know it's wrong, but I just wanted to show where I am in terms of thinking so someone can tell me how far off I am : "/<div\sid=\"the_id\">(<a\shref=\"[^\"]+\">(.*)<\/a>,?\s)*<\/div>/siU" Quote Link to comment https://forums.phpfreaks.com/topic/72037-solved-regexp-html-parsing-help/ Share on other sites More sharing options...
agrafuese Posted October 6, 2007 Author Share Posted October 6, 2007 Maybe I was a bit too vague about this question...here's a better way to put it. I have this very simple regex to extract the text from links (not the url, just the text), like so: "/<a\shref=\"[^\"]+\">(.*)<\/a>,?\s/siU" I know it's not as fool-proof as a normal link parsing regex should be, but that's fine for me at the moment. Now, what I want to do is extract only the links that are nested within a particular DIV. As of right now, I have the following code which does the job using a preg_match statement and a preg_match_all statement (see below), but I was wondering if this is possible with just one preg_match_all statement. This is something I am more curious about because I'd like to do it the "right" way, but like I said, the code I have right now is technically working. Any help would be appreciated. $whole_string = '<div id="the_id"><a href="one_url.html">Some Text1</a>, <a href="another_url.html">Some Text2</a> </div>'; $whole_pattern = "/<div\sid=\"the_id\">(.*)<\/div>/siU"; preg_match($whole_pattern, $whole_string, $sub_matches); // extract the links from the DIV $sub_pattern = "/<a\shref=\"[^\"]+\">(.*)<\/a>/siU"; preg_match_all($sub_pattern, $sub_matches[1], $matches, PREG_SET_ORDER); // extract the text from the links foreach($matches as $match) { $join_string .= $match[1] . ', '; // formats the link text like so: Some Text1, Some Text2, etc... } $insert_string = substr($join_string, 0, -2); // removes the last comma and space ...etc... Quote Link to comment https://forums.phpfreaks.com/topic/72037-solved-regexp-html-parsing-help/#findComment-363145 Share on other sites More sharing options...
agrafuese Posted October 6, 2007 Author Share Posted October 6, 2007 UPDATE: I ended up going a different route all together on this project, so I no longer need the advice. If you still want to answer the question, however, feel free Quote Link to comment https://forums.phpfreaks.com/topic/72037-solved-regexp-html-parsing-help/#findComment-363604 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.