[SOLVED] Getting external links.


/* Return a list of all links found on a specific "page" */
function getLinks($url) {
    //$url = "http://www.example.net/somepage.html";
    $input = @file_get_contents($url) or die('Could not access file: $url');
    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
      echo '<pre>';
      echo '</pre>';

That function above goes and get's all the links that are present on a website.  How can I make it go get only external links.  I guess all external links would have to start with http so I have to check to see if it matches. I am only mediocre at Regex.  Any advice?

They could start with www even just end with .com


I'd type in google.com


Could be an https


could me a subdomain, such as mail.google.com


how about this?




I 'spose the list of com net org gov and etc is because you could have a file ending with .exe, and you wouldn't want that mistaken as an external URL


Though I spose the biggest floor in my regex would be that if it doesn't have ' or " in the regex, then the (.+) after the extention would just keep going and going and going till the end.

When it comes to parsing html, I would consider using DOM and XPath instead of regex.

For example, let's look at wikipedia url http://en.wikipedia.org/wiki/Main_Page as an example. And given that tons of those links start with /wiki/, we'll make the assumption that all links that don't start with that are external.. we could use DOM/XPath like so:



$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[@href]'); // get all anchor tags that contain the attribute href

foreach ($aTag as $val){
    if(strpos($val->getAttribute('href'), '/wiki/', 0) === false){
$externalLinks[] = $val->getAttribute('href');

echo '<pre>'.print_r($externalLinks, true);

I didn't know you could do that! Can you link me to a few good DOM tutorials? I find I have trouble learning off PHP.net


You would have to google it.. I am learning off of small examples here and there... As far as actual tutorials go, you would have to hunt around. Google stuff like XPath tutorials and DOM tutorials... you'll find some sites to start off with.. I think W3School.com is one of them (which admittedly I'm not too fond of, but still learn the odd thing from them from time to time), as well as tizag.com and such.. anyways.. just google it, as I don't have an actual list of sites I frequent with regards to learning DOM/XPath.



As for the previous solution, I could also have used the start-with() function:


$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[@href != starts-with(@href, "/wiki/")]'); // fetch all anchor tags containing href attribute that doesn't start with /wiki/
foreach ($aTag as $val){
    echo $val->getAttribute('href') . "<br />\n";


Ok, it's getting late here, and I'm tired.  :sleeping:

