How to get only links with h2 tags?

t_machine · May 22, 2009

Hi, I am trying to parse a page that contain many links. There are links on there with <h2> tags which are the ones I need. How can I get only those links?

I am using the following but it returns every link.

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";

The <h2> links on the page are like the following:

<a href="link to page"><h2>LINK NAME</h2></a>

Thanks for any help

Masna · May 22, 2009

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*><h2>(.*)<\/h2><\/a>";

nrg_alpha · May 22, 2009

Do you mean something along these lines?

Example:

$str = '<a href="http://link to page"><h2>LINK NAME(1)</h2></a> .. some text .. <a href="http://link to page2"><h2>LINK NAME(2)</h2></a>';
if(preg_match_all('#<a [^>]*href=[\'"]([^\'"]+)[\'"].*?><h2>(.+?)</h2></a>#i', $str, $matches)){
$arr = array_combine($matches[1], $matches[2]);
echo '<pre>'.print_r($arr,true);
}

Outouts (via right-click view source):

Array
(
    [http://link to page] => LINK NAME(1)
    [http://link to page2] => LINK NAME(2)
)

All I've done here is capture the href values and the h2 tags, and combined the arrays so that the keys are the href values, and the key values are the h2 tags (this is not necessary though.. I just did this for organizational purposes... the href and h2 tag values are still stored into the arrays $matches[1] and $matches[2] respectively.

Alternatively, instead of using array_combine, you could use a simple foreach loop to extract the captured values:

foreach($matches[1] as $key=>$val){
echo 'Link: ' . $matches[1][$key] . ' - h2: ' . $matches[2][$key] . "<br />\n";
}

EDIT - One can also make use of DOMDocument / XPath instead of regex to find these tags as well.. and as a side note on those tags, seems like the website in question has validation issues.

t_machine · May 22, 2009

Thanks very much for the reply. You are right, the site does have poor coding. That may be the reason the codes you both posted did not work. Below is the exact way their links are in the page. I am not sure if them placing the class before the href makes a difference in the preg match

<a class="CLASS_NAME" href="URL" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2>
<br/>
</a>

Thanks again for the help.

nrg_alpha · May 22, 2009

Let's try the Domdoc / xpath way:

Would something like this work?

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://somesite.whatever'); // obviously, use the real website url in question instead...
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a/h2');

foreach ($aTag as $val) {
echo $val->parentNode->getAttribute('href') . ' - ' . $val->nodeValue . "<br />\n";
}

t_machine · May 22, 2009

brilliant! This works exactly the way I need it. Thanks very much

t_machine · May 23, 2009

Is there a php 4 version of the Domdoc, the codes above worked great on my local server which uses php 5 but does not work on my main server which is php 4

Axeia · May 23, 2009

I'd upgrade the server instead

http://www.php.net/archive/2007.php

PHP 4 end of life announcement

[13-Jul-2007]

Today it is exactly three years ago since PHP 5 has been released. In those three years it has seen many improvements over PHP 4. PHP 5 is fast, stable & production-ready and as PHP 6 is on the way, PHP 4 will be discontinued.

The PHP development team hereby announces that support for PHP 4 will continue until the end of this year only. After 2007-12-31 there will be no more releases of PHP 4.4. We will continue to make critical security fixes available on a case-by-case basis until 2008-08-08. Please use the rest of this year to make your application suitable to run on PHP 5.

For documentation on migration for PHP 4 to PHP 5, we would like to point you to our migration guide. There is additional information available in the PHP 5.0 to PHP 5.1 and PHP 5.1 to PHP 5.2 migration guides as well.

PHP4 doesn't even get security fixes anymore.

.josh · May 23, 2009

nrg your regex didn't work for him because it assumed <h2> instead of <h2[^>]*>

nrg_alpha · May 24, 2009

nrg your regex didn't work for him because it assumed <h2> instead of <h2[^>]*>

Yeah, I was going by what the OP gave initially:

The <h2> links on the page are like the following:
<a href="link to page"><h2>LINK NAME</h2></a>

So I was figuring the h2 tags contained no attributes :-/ (that'll learn me).

.josh · May 24, 2009

oh sure, blame it on the OP for giving inaccurate info. As if that happens all the time. Oh wait...

t_machine · May 24, 2009

Sorry for not posting the exact codes in the first place

My host will not update to php 5 and I would like a fix until I get a new host. The fix that Crayon Violent added still did not get any results.

This is how the links are laid out on the page:

<a class="CLASS_NAME" href="URL" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2>
<br/>
</a>

I am using file get content which gets the content fine but parsing it still gives no result. I am using the following with Crayon fix included

 $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*><h2[^>]*>(.*)<\/h2><\/a>"; 
if(preg_match_all("/$regexp/siU", $input, $matches)) {
     foreach($matches as $match) {
       echo $match[2];
       echo $match[3];

     }
}

Thanks again for any help

nrg_alpha · May 24, 2009

My advice, switch hosting providers (seriously). The jump from version 4 to 5 is significant enough and warrants switching over. Even the PHP team themselves no longer support PHP 4.

$str = 'Some text...<a class="CLASS_NAME" href="URL" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2>
<br/>
</a> some more text...<a class="CLASS_NAME" href="URL2" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME2</h2>
<br/>
</a>';

if(preg_match_all('#<a [^>]*href=[\'"]([^\'"]+)[\'"].*?>.*?<h2[^>]*>(.+?)</h2>#si', $str, $matches)){
foreach($matches[1] as $key=>$val){
	echo 'Link: ' . $matches[1][$key] . ' - h2: ' . $matches[2][$key] . "<br />\n";
}
}

Daniel0 · May 25, 2009

You're not allowed to put block elements inside inline elements. Just saying...

Sign In

How to get only links with h2 tags?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information