Jump to content

[SOLVED] Getting external links.


Ninjakreborn

Recommended Posts

<?php
/* Return a list of all links found on a specific "page" */
function getLinks($url) {
    //$url = "http://www.example.net/somepage.html";
    $input = @file_get_contents($url) or die('Could not access file: $url');
    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
      echo '<pre>';
      print_r($matches);
      echo '</pre>';
    }
}
?>

That function above goes and get's all the links that are present on a website.  How can I make it go get only external links.  I guess all external links would have to start with http so I have to check to see if it matches. I am only mediocre at Regex.  Any advice?

Link to comment
Share on other sites

They could start with www even just end with .com

 

I'd type in google.com

 

Could be an https

 

could me a subdomain, such as mail.google.com

 

how about this?

 

'~<a[^>]+href=(\'|")?(http|https|www)?(.+)\.(com|net|org|gov)(.+)(\'|")?[^>]+>(.+)</a>~'

 

I 'spose the list of com net org gov and etc is because you could have a file ending with .exe, and you wouldn't want that mistaken as an external URL

 

Though I spose the biggest floor in my regex would be that if it doesn't have ' or " in the regex, then the (.+) after the extention would just keep going and going and going till the end.

Link to comment
Share on other sites

When it comes to parsing html, I would consider using DOM and XPath instead of regex.

For example, let's look at wikipedia url http://en.wikipedia.org/wiki/Main_Page as an example. And given that tons of those links start with /wiki/, we'll make the assumption that all links that don't start with that are external.. we could use DOM/XPath like so:

 

 

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Main_Page');
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[@href]'); // get all anchor tags that contain the attribute href

foreach ($aTag as $val){
    if(strpos($val->getAttribute('href'), '/wiki/', 0) === false){
$externalLinks[] = $val->getAttribute('href');
    }
}

echo '<pre>'.print_r($externalLinks, true);

Link to comment
Share on other sites

I didn't know you could do that! Can you link me to a few good DOM tutorials? I find I have trouble learning off PHP.net

 

You would have to google it.. I am learning off of small examples here and there... As far as actual tutorials go, you would have to hunt around. Google stuff like XPath tutorials and DOM tutorials... you'll find some sites to start off with.. I think W3School.com is one of them (which admittedly I'm not too fond of, but still learn the odd thing from them from time to time), as well as tizag.com and such.. anyways.. just google it, as I don't have an actual list of sites I frequent with regards to learning DOM/XPath.

 

 

As for the previous solution, I could also have used the start-with() function:

 

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Main_Page');
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[@href != starts-with(@href, "/wiki/")]'); // fetch all anchor tags containing href attribute that doesn't start with /wiki/
foreach ($aTag as $val){
    echo $val->getAttribute('href') . "<br />\n";
}

 

Ok, it's getting late here, and I'm tired.  :sleeping:

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.