[SOLVED] Getting external links.

Ninjakreborn · September 2, 2009

<?php
/* Return a list of all links found on a specific "page" */
function getLinks($url) {
    //$url = "http://www.example.net/somepage.html";
    $input = @file_get_contents($url) or die('Could not access file: $url');
    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
      echo '<pre>';
      print_r($matches);
      echo '</pre>';
    }
}
?>

That function above goes and get's all the links that are present on a website. How can I make it go get only external links. I guess all external links would have to start with http so I have to check to see if it matches. I am only mediocre at Regex. Any advice?

Ninjakreborn · September 2, 2009

I could loop through the matches..and do a strstr to pull out anything that isn't http...leaving me with

just the external links. However that is resource intensive. If anyone knows an easier way let me know.

Thanks.

Ninjakreborn · September 2, 2009

$regexp = "<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>";

That works.

Thanks.

Garethp · September 2, 2009

They could start with www even just end with .com

I'd type in google.com

Could be an https

could me a subdomain, such as mail.google.com

how about this?

'~<a[^>]+href=(\'|")?(http|https|www)?(.+)\.(com|net|org|gov)(.+)(\'|")?[^>]+>(.+)</a>~'

I 'spose the list of com net org gov and etc is because you could have a file ending with .exe, and you wouldn't want that mistaken as an external URL

Though I spose the biggest floor in my regex would be that if it doesn't have ' or " in the regex, then the (.+) after the extention would just keep going and going and going till the end.

nrg_alpha · September 3, 2009

When it comes to parsing html, I would consider using DOM and XPath instead of regex.

For example, let's look at wikipedia url http://en.wikipedia.org/wiki/Main_Page as an example. And given that tons of those links start with /wiki/, we'll make the assumption that all links that don't start with that are external.. we could use DOM/XPath like so:

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Main_Page');
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[@href]'); // get all anchor tags that contain the attribute href

foreach ($aTag as $val){
    if(strpos($val->getAttribute('href'), '/wiki/', 0) === false){
$externalLinks[] = $val->getAttribute('href');
    }
}

echo '<pre>'.print_r($externalLinks, true);

Garethp · September 3, 2009

I didn't know you could do that! Can you link me to a few good DOM tutorials? I find I have trouble learning off PHP.net

nrg_alpha · September 3, 2009

I didn't know you could do that! Can you link me to a few good DOM tutorials? I find I have trouble learning off PHP.net

You would have to google it.. I am learning off of small examples here and there... As far as actual tutorials go, you would have to hunt around. Google stuff like XPath tutorials and DOM tutorials... you'll find some sites to start off with.. I think W3School.com is one of them (which admittedly I'm not too fond of, but still learn the odd thing from them from time to time), as well as tizag.com and such.. anyways.. just google it, as I don't have an actual list of sites I frequent with regards to learning DOM/XPath.

As for the previous solution, I could also have used the start-with() function:

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Main_Page');
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[@href != starts-with(@href, "/wiki/")]'); // fetch all anchor tags containing href attribute that doesn't start with /wiki/
foreach ($aTag as $val){
    echo $val->getAttribute('href') . "<br />\n";
}

Ok, it's getting late here, and I'm tired. :sleeping:

Sign In

[SOLVED] Getting external links.

Recommended Posts

Ninjakreborn

Link to comment

Share on other sites

Ninjakreborn

Link to comment

Share on other sites

Ninjakreborn

Link to comment

Share on other sites

Garethp

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

Garethp

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information