t_machine Posted May 22, 2009 Share Posted May 22, 2009 Hi, I am trying to parse a page that contain many links. There are links on there with <h2> tags which are the ones I need. How can I get only those links? I am using the following but it returns every link. $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; The <h2> links on the page are like the following: <a href="link to page"><h2>LINK NAME</h2></a> Thanks for any help Quote Link to comment Share on other sites More sharing options...
Masna Posted May 22, 2009 Share Posted May 22, 2009 $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*><h2>(.*)<\/h2><\/a>"; Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted May 22, 2009 Share Posted May 22, 2009 Do you mean something along these lines? Example: $str = '<a href="http://link to page"><h2>LINK NAME(1)</h2></a> .. some text .. <a href="http://link to page2"><h2>LINK NAME(2)</h2></a>'; if(preg_match_all('#<a [^>]*href=[\'"]([^\'"]+)[\'"].*?><h2>(.+?)</h2></a>#i', $str, $matches)){ $arr = array_combine($matches[1], $matches[2]); echo '<pre>'.print_r($arr,true); } Outouts (via right-click view source): Array ( [http://link to page] => LINK NAME(1) [http://link to page2] => LINK NAME(2) ) All I've done here is capture the href values and the h2 tags, and combined the arrays so that the keys are the href values, and the key values are the h2 tags (this is not necessary though.. I just did this for organizational purposes... the href and h2 tag values are still stored into the arrays $matches[1] and $matches[2] respectively. Alternatively, instead of using array_combine, you could use a simple foreach loop to extract the captured values: foreach($matches[1] as $key=>$val){ echo 'Link: ' . $matches[1][$key] . ' - h2: ' . $matches[2][$key] . "<br />\n"; } EDIT - One can also make use of DOMDocument / XPath instead of regex to find these tags as well.. and as a side note on those tags, seems like the website in question has validation issues. Quote Link to comment Share on other sites More sharing options...
t_machine Posted May 22, 2009 Author Share Posted May 22, 2009 Thanks very much for the reply. You are right, the site does have poor coding. That may be the reason the codes you both posted did not work. Below is the exact way their links are in the page. I am not sure if them placing the class before the href makes a difference in the preg match <a class="CLASS_NAME" href="URL" alt="ALT_TEXT"> <h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2> <br/> </a> Thanks again for the help. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted May 22, 2009 Share Posted May 22, 2009 Let's try the Domdoc / xpath way: Would something like this work? $dom = new DOMDocument; @$dom->loadHTMLFile('http://somesite.whatever'); // obviously, use the real website url in question instead... $xpath = new DOMXPath($dom); $aTag = $xpath->query('//a/h2'); foreach ($aTag as $val) { echo $val->parentNode->getAttribute('href') . ' - ' . $val->nodeValue . "<br />\n"; } Quote Link to comment Share on other sites More sharing options...
t_machine Posted May 22, 2009 Author Share Posted May 22, 2009 brilliant! This works exactly the way I need it. Thanks very much Quote Link to comment Share on other sites More sharing options...
t_machine Posted May 23, 2009 Author Share Posted May 23, 2009 Is there a php 4 version of the Domdoc, the codes above worked great on my local server which uses php 5 but does not work on my main server which is php 4 Quote Link to comment Share on other sites More sharing options...
Axeia Posted May 23, 2009 Share Posted May 23, 2009 I'd upgrade the server instead http://www.php.net/archive/2007.php PHP 4 end of life announcement [13-Jul-2007] Today it is exactly three years ago since PHP 5 has been released. In those three years it has seen many improvements over PHP 4. PHP 5 is fast, stable & production-ready and as PHP 6 is on the way, PHP 4 will be discontinued. The PHP development team hereby announces that support for PHP 4 will continue until the end of this year only. After 2007-12-31 there will be no more releases of PHP 4.4. We will continue to make critical security fixes available on a case-by-case basis until 2008-08-08. Please use the rest of this year to make your application suitable to run on PHP 5. For documentation on migration for PHP 4 to PHP 5, we would like to point you to our migration guide. There is additional information available in the PHP 5.0 to PHP 5.1 and PHP 5.1 to PHP 5.2 migration guides as well. PHP4 doesn't even get security fixes anymore. Quote Link to comment Share on other sites More sharing options...
.josh Posted May 23, 2009 Share Posted May 23, 2009 nrg your regex didn't work for him because it assumed <h2> instead of <h2[^>]*> Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted May 24, 2009 Share Posted May 24, 2009 nrg your regex didn't work for him because it assumed <h2> instead of <h2[^>]*> Yeah, I was going by what the OP gave initially: The <h2> links on the page are like the following: <a href="link to page"><h2>LINK NAME</h2></a> So I was figuring the h2 tags contained no attributes :-/ (that'll learn me). Quote Link to comment Share on other sites More sharing options...
.josh Posted May 24, 2009 Share Posted May 24, 2009 oh sure, blame it on the OP for giving inaccurate info. As if that happens all the time. Oh wait... Quote Link to comment Share on other sites More sharing options...
t_machine Posted May 24, 2009 Author Share Posted May 24, 2009 Sorry for not posting the exact codes in the first place My host will not update to php 5 and I would like a fix until I get a new host. The fix that Crayon Violent added still did not get any results. This is how the links are laid out on the page: <a class="CLASS_NAME" href="URL" alt="ALT_TEXT"> <h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2> <br/> </a> I am using file get content which gets the content fine but parsing it still gives no result. I am using the following with Crayon fix included $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*><h2[^>]*>(.*)<\/h2><\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { foreach($matches as $match) { echo $match[2]; echo $match[3]; } } Thanks again for any help Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted May 24, 2009 Share Posted May 24, 2009 My advice, switch hosting providers (seriously). The jump from version 4 to 5 is significant enough and warrants switching over. Even the PHP team themselves no longer support PHP 4. $str = 'Some text...<a class="CLASS_NAME" href="URL" alt="ALT_TEXT"> <h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2> <br/> </a> some more text...<a class="CLASS_NAME" href="URL2" alt="ALT_TEXT"> <h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME2</h2> <br/> </a>'; if(preg_match_all('#<a [^>]*href=[\'"]([^\'"]+)[\'"].*?>.*?<h2[^>]*>(.+?)</h2>#si', $str, $matches)){ foreach($matches[1] as $key=>$val){ echo 'Link: ' . $matches[1][$key] . ' - h2: ' . $matches[2][$key] . "<br />\n"; } } Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted May 25, 2009 Share Posted May 25, 2009 You're not allowed to put block elements inside inline elements. Just saying... Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.