Jump to content

How to get only links with h2 tags?


t_machine

Recommended Posts

Hi, I am trying to parse a page that contain many links. There are links on there with <h2> tags which are the ones I need. How can I get only those links?

I am using the following but it returns every link.

 

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; 

 

The <h2> links on the page are like the following:

<a href="link to page"><h2>LINK NAME</h2></a>

 

Thanks for any help :)

Link to comment
Share on other sites

Do you mean something along these lines?

 

Example:

$str = '<a href="http://link to page"><h2>LINK NAME(1)</h2></a> .. some text .. <a href="http://link to page2"><h2>LINK NAME(2)</h2></a>';
if(preg_match_all('#<a [^>]*href=[\'"]([^\'"]+)[\'"].*?><h2>(.+?)</h2></a>#i', $str, $matches)){
$arr = array_combine($matches[1], $matches[2]);
echo '<pre>'.print_r($arr,true);
}

 

Outouts (via right-click view source):

Array
(
    [http://link to page] => LINK NAME(1)
    [http://link to page2] => LINK NAME(2)
)

 

All I've done here is capture the href values and the h2 tags, and combined the arrays so that the keys are the href values, and the key values are the h2 tags (this is not necessary though.. I just did this for organizational purposes... the href and h2 tag values are still stored into the arrays $matches[1] and $matches[2] respectively.

 

Alternatively, instead of using array_combine, you could use a simple foreach loop to extract the captured values:

 

foreach($matches[1] as $key=>$val){
echo 'Link: ' . $matches[1][$key] . ' - h2: ' . $matches[2][$key] . "<br />\n";
}

 

EDIT - One can also make use of DOMDocument / XPath instead of regex to find these tags as well.. and as a side note on those tags, seems like the website in question has validation issues.

Link to comment
Share on other sites

Thanks very much for the reply. You are right, the site does have poor coding. That may be the reason the codes you both posted did not work. Below is the exact way their links are in the page. I am not sure if them placing the class before the href makes a difference in the preg match

 

<a class="CLASS_NAME" href="URL" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2>
<br/>
</a>

 

Thanks again for the help.

Link to comment
Share on other sites

Let's try the Domdoc / xpath way:

 

Would something like this work?

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://somesite.whatever'); // obviously, use the real website url in question instead...
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a/h2');

foreach ($aTag as $val) {
echo $val->parentNode->getAttribute('href') . ' - ' . $val->nodeValue . "<br />\n";
}

Link to comment
Share on other sites

I'd upgrade the server instead

 

http://www.php.net/archive/2007.php

PHP 4 end of life announcement

[13-Jul-2007]

 

Today it is exactly three years ago since PHP 5 has been released. In those three years it has seen many improvements over PHP 4. PHP 5 is fast, stable & production-ready and as PHP 6 is on the way, PHP 4 will be discontinued.

 

The PHP development team hereby announces that support for PHP 4 will continue until the end of this year only. After 2007-12-31 there will be no more releases of PHP 4.4. We will continue to make critical security fixes available on a case-by-case basis until 2008-08-08. Please use the rest of this year to make your application suitable to run on PHP 5.

 

For documentation on migration for PHP 4 to PHP 5, we would like to point you to our migration guide. There is additional information available in the PHP 5.0 to PHP 5.1 and PHP 5.1 to PHP 5.2 migration guides as well.

PHP4 doesn't even get security fixes anymore.

Link to comment
Share on other sites

nrg your regex didn't work for him because it assumed <h2> instead of <h2[^>]*>

 

Yeah, I was going by what the OP gave initially:

 

The <h2> links on the page are like the following:

<a href="link to page"><h2>LINK NAME</h2></a>

 

So I was figuring the h2 tags contained no attributes :-/ (that'll learn me).

 

 

Link to comment
Share on other sites

Sorry for not posting the exact codes in the first place :(

My host will not update to php 5 and I would like a fix until I get a new host. The fix that Crayon Violent added still did not get any results.

 

This is how the links are laid out on the page:

<a class="CLASS_NAME" href="URL" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2>
<br/>
</a>

 

I am using file get content which gets the content fine but parsing it still gives no result. I am using the following with Crayon fix included

 

 $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*><h2[^>]*>(.*)<\/h2><\/a>"; 
if(preg_match_all("/$regexp/siU", $input, $matches)) {
     foreach($matches as $match) {
       echo $match[2];
       echo $match[3];

     }
}

 

Thanks again for any help :)

Link to comment
Share on other sites

My advice, switch hosting providers (seriously). The jump from version 4 to 5 is significant enough and warrants switching over. Even the PHP team themselves no longer support PHP 4.

 

$str = 'Some text...<a class="CLASS_NAME" href="URL" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME</h2>
<br/>
</a> some more text...<a class="CLASS_NAME" href="URL2" alt="ALT_TEXT">
<h2 class="ANOTHER_CLASS" style="display: inline;">LINK NAME2</h2>
<br/>
</a>';

if(preg_match_all('#<a [^>]*href=[\'"]([^\'"]+)[\'"].*?>.*?<h2[^>]*>(.+?)</h2>#si', $str, $matches)){
foreach($matches[1] as $key=>$val){
	echo 'Link: ' . $matches[1][$key] . ' - h2: ' . $matches[2][$key] . "<br />\n";
}
}

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.