Ninjakreborn Posted September 2, 2009 Share Posted September 2, 2009 <?php /* Return a list of all links found on a specific "page" */ function getLinks($url) { //$url = "http://www.example.net/somepage.html"; $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { echo '<pre>'; print_r($matches); echo '</pre>'; } } ?> That function above goes and get's all the links that are present on a website. How can I make it go get only external links. I guess all external links would have to start with http so I have to check to see if it matches. I am only mediocre at Regex. Any advice? Quote Link to comment Share on other sites More sharing options...
Ninjakreborn Posted September 2, 2009 Author Share Posted September 2, 2009 I could loop through the matches..and do a strstr to pull out anything that isn't http...leaving me with just the external links. However that is resource intensive. If anyone knows an easier way let me know. Thanks. Quote Link to comment Share on other sites More sharing options...
Ninjakreborn Posted September 2, 2009 Author Share Posted September 2, 2009 $regexp = "<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>"; That works. Thanks. Quote Link to comment Share on other sites More sharing options...
Garethp Posted September 2, 2009 Share Posted September 2, 2009 They could start with www even just end with .com I'd type in google.com Could be an https could me a subdomain, such as mail.google.com how about this? '~<a[^>]+href=(\'|")?(http|https|www)?(.+)\.(com|net|org|gov)(.+)(\'|")?[^>]+>(.+)</a>~' I 'spose the list of com net org gov and etc is because you could have a file ending with .exe, and you wouldn't want that mistaken as an external URL Though I spose the biggest floor in my regex would be that if it doesn't have ' or " in the regex, then the (.+) after the extention would just keep going and going and going till the end. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted September 3, 2009 Share Posted September 3, 2009 When it comes to parsing html, I would consider using DOM and XPath instead of regex. For example, let's look at wikipedia url http://en.wikipedia.org/wiki/Main_Page as an example. And given that tons of those links start with /wiki/, we'll make the assumption that all links that don't start with that are external.. we could use DOM/XPath like so: $dom = new DOMDocument; @$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Main_Page'); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//a[@href]'); // get all anchor tags that contain the attribute href foreach ($aTag as $val){ if(strpos($val->getAttribute('href'), '/wiki/', 0) === false){ $externalLinks[] = $val->getAttribute('href'); } } echo '<pre>'.print_r($externalLinks, true); Quote Link to comment Share on other sites More sharing options...
Garethp Posted September 3, 2009 Share Posted September 3, 2009 I didn't know you could do that! Can you link me to a few good DOM tutorials? I find I have trouble learning off PHP.net Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted September 3, 2009 Share Posted September 3, 2009 I didn't know you could do that! Can you link me to a few good DOM tutorials? I find I have trouble learning off PHP.net You would have to google it.. I am learning off of small examples here and there... As far as actual tutorials go, you would have to hunt around. Google stuff like XPath tutorials and DOM tutorials... you'll find some sites to start off with.. I think W3School.com is one of them (which admittedly I'm not too fond of, but still learn the odd thing from them from time to time), as well as tizag.com and such.. anyways.. just google it, as I don't have an actual list of sites I frequent with regards to learning DOM/XPath. As for the previous solution, I could also have used the start-with() function: $dom = new DOMDocument; @$dom->loadHTMLFile('http://en.wikipedia.org/wiki/Main_Page'); $xpath = new DOMXPath($dom); $aTag = $xpath->query('//a[@href != starts-with(@href, "/wiki/")]'); // fetch all anchor tags containing href attribute that doesn't start with /wiki/ foreach ($aTag as $val){ echo $val->getAttribute('href') . "<br />\n"; } Ok, it's getting late here, and I'm tired. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.