byte1918 Posted June 6, 2009 Share Posted June 6, 2009 Hi, I'm trying to make a script which can get all website links. This is my code so far.. <?php set_time_limit (0); //$link = $samba , they have same values , $linka is supposed to store all the links on the website , and $not is supposed to remember all links that have already been scanned, so I don't go into an infinite loop. function get($link,$samba,$linka=array(),$not=array()){ $get = file_get_contents($link); //preg_match all the links. preg_match_all('/<a.*href="(.*)".*>.*<\/a>/smU',$get,$path); //preg match css links , I will need this sometime further. preg_match_all('/href="(.*\.css)"/',$get,$css); $links=array(); $links = array_merge ($path[1],$css[1]); $horse = array_unique($links); $not=array(); //in this foreach I try to remove all external links or links which I shouldn't take into consideration. foreach ($horse as $key =>$adr){ if (substr ($adr,0,7) == "http://") unset($horse[$key]); if (substr($adr,0,3) == "www") unset($horse[$key]); if (substr ($adr,0,6) == "mailto") unset($horse[$key]); if (substr($adr,0,3) == "../") unset ($horse[$key]); if (substr($adr,0,6) == "../../") unset ($horse[$key]); } //in this foreach I add the url ($samba) to the found adress.. for example the link found is : contact.html and I append the adress => http://www.whatever.com/contact.html foreach ($horse as $key =>$adr) $horse[$key]= $samba.$adr; //now here is the problem.. I go through every link one by one, and check if it hasn't already been scanned,if it wasn't I add the link to the array $not , and I call the function recursilvey (?). foreach ($horse as $adr){ if (!in_array($adr,$not)){ $not[]=$adr; $linka = array_merge ($horse,get($adr,$samba,$linka,$not)); } } //the problem is that (I think) the $not array doesn't keep the values that have already been scanned when I call the function again, and therefor my main function keeps going into an endless loop. $haha= array_unique($linka); return $haha; } $link = "http://rcsys.ro/"; $a = get ($link,$link,$q=array(),$not=array()); print_r($a); ?> I posted comments in the code with my problem , maybe someone who has already done this has the time to check it out. And yeah I know my coding looks like crap. Quote Link to comment https://forums.phpfreaks.com/topic/161171-get-all-links-from-a-website-script-help-recursive/ Share on other sites More sharing options...
gevans Posted June 6, 2009 Share Posted June 6, 2009 Are you calling this function multiple times from one script? Quote Link to comment https://forums.phpfreaks.com/topic/161171-get-all-links-from-a-website-script-help-recursive/#findComment-850523 Share on other sites More sharing options...
.josh Posted June 6, 2009 Share Posted June 6, 2009 I was working on something like this a while back. I never got around to completing it. I remember it mostly kind of sort of worked. Lot of stuff missing like doing something about relative links that use ../ or removing link from master list if cURL returns false, etc... May or may not inspire something other than a bowel movement... <form name='getPageForm' action='' method='post'> Domain (example: http://www.mysite.com/ note: currently only works with root domain name (no starting at xyz folder): <br/> <input type = 'text' name='pageName' size='50' /><br /> Number of links <input type = 'text' name='numLinks' size='2' value='50' /> (will not be exact. will return #+ whatever extra on current page iteration)<br /> <input type='submit' name='getPage' value='load' /> </form> <?php class scraper { var $linkList; // list of data scraped for current page var $rootURL; // root domain entered in from form var $maxLinks; // max links from form var $masterLinkList; // master list of links scraped /* function __construct: constructor, used to do initial property assignments, based on form input */ function __construct($rootURL,$max) { $this->rootURL = $rootURL; $this->maxLinks = $max; $this->masterLinkList[] = $this->rootURL; } // end function __construct /* function scrapePage: goal is to scrape the page content of the url passed to it and return all potential links. problem is that not all links are neatly placed inside a href tags, so using the php DOM will not always return all the links on the page. Solution so far is to assume that regardless of where the actual link resides, chances are its within quotes, so idea is to grab all things that are wrapped in quotes. */ function scrapePage($url) { $linkList = array(); $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // make the cURL request to $url $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $file = @curl_exec($ch); if (!$file) { $this->linkList = $linkList; } else { // assume everything inside quotes is a possible link preg_match_all('~[\'"]([^\'"]*)[\'"]~',$file,$links); // assign results to linkList $this->linkList = $links[1]; } } // end function getLinks /* function filterLinks: goal is to go through each item in linkList (stuff pulled from scrapePage) and try to validate it as an internal link. So far we use basename to look for a valid page extension (specified in $validPageExt). Need to add ability for user to enter in valid page extensions for the target domain. We also attempt to filter out external links by checking if element starts with 'http' and if it does, if element starts with rootURL. We also assume that if there's a space somewhere in the element that its not valid. Yes that's not 100% because you can technically have a link with spaces in it, but most people don't actually code that way, and it filters out a lot of stuff between quotes, so the benefits far outweigh the cost. */ function filterLinks() { // remove all elements that do not have valid basename extensions $validPageExt = array('htm','html','php','php4','php5','asp','aspx','cfm'); foreach ($this->linkList as $k => $v) { $v = basename($v); $v = explode('.',$v); if (!in_array($v[1],$validPageExt)) unset($this->linkList[$k]); } // end foreach linkList // remove external links, convert relatives to absolute foreach ($this->linkList as $k => $v) { // if $v starts with http... if (substr($v,0,4) == "http") { // if absolute link is not from domain, delete it if (!preg_match('~^'.rtrim($this->rootURL,'/').'~i',$v)) unset($this->linkList[$k]); } else { // if not start with http, assume it is relative, add rootURL to it $this->linkList[$k] = rtrim($this->rootURL,'/') . '/' . ltrim($this->linkList[$k],'/'); } // end else } // end foreach linkList // assume that if there's a space in there, it's not a valid link foreach ($this->linkList as $k => $v) { if (strpos($v,' ')) unset($this->linkList[$k]); } // end foreach linkList // filter out duplicates and reset keys $this->linkList = array_unique($this->linkList); $this->linkList = array_values($this->linkList); } // end function filterLinks /* function addLinksToMasterLinkList: goal here is once data is retrieved from current link and filtered, we will add the link to the master link list. Also we remove dupes from the master list and reset keys. This function can probably be put inside filterLinks (and it was initially...); I couldn't decide whether it deserved its own function or not so I ended up going for it. */ function addLinksToMasterLinkList() { // add each link to master link list foreach ($this->linkList as $v) $this->masterLinkList[] = $v; // filter out duplicates on master link list and reset keys $this->masterLinkList = array_unique($this->masterLinkList); $this->masterLinkList = array_values($this->masterLinkList); } // end function addLinksToMasterLinkList /* function getLinks: basically the main engine of this bot. Goal is to go down the master link list and call each of the other functions until we've passed max links specified. It's not coded to stop at exactly maxLinks; it's coded so if the count is less than max, it scrapes another page. So the end result will be the count before the last iteration, plus whatever is on the last page. So for example if max is 50 and so far we're at 45 links, another page gets scraped. Well if that page has 10 links on it, then the end result will be 55, not 50 links. Also, we make sure to break out of the while loop if there's no more links on the master link list to grab data from. This is for if the site only has a total of like 20 links and you set the number of links to like 100, it will break out of the loop. */ function getLinks() { // start at first element $x = 0; // while there are less links in the master link list than the max allowed... while ((count($this->masterLinkList) < $this->maxLinks)) { // break out of loop and end scraping if there are no more links on the master list if (!$this->masterLinkList[$x]) break; // scrape current page in the master link list $this->scrapePage($this->masterLinkList[$x]); // filter results from the scrape $this->filterLinks(); // add filtered results to master list $this->addLinksToMasterLinkList(); // move to next link in master link list $x++; } // end while count < max }// end function getLinks /* function dumpLinkList: simple function to dump out results. mostly a debugging thing */ function dumpLinkList () { echo "<pre>";print_r($this->masterLinkList); echo "</pre>"; } // end function dumpLinkList } //*** end class scraper // if user enters url... if ($_POST['pageName']) { // create object $scraper = new scraper($_POST['pageName'],$_POST['numLinks']); // grab links $scraper->getLinks(); // dump out results $scraper->dumpLinkList(); } // end if $_POST ?> Quote Link to comment https://forums.phpfreaks.com/topic/161171-get-all-links-from-a-website-script-help-recursive/#findComment-850527 Share on other sites More sharing options...
byte1918 Posted June 6, 2009 Author Share Posted June 6, 2009 Are you calling this function multiple times from one script? nope just once.. from here $linka = array_merge ($horse,get($adr,$samba,$linka,$not)); Quote Link to comment https://forums.phpfreaks.com/topic/161171-get-all-links-from-a-website-script-help-recursive/#findComment-850531 Share on other sites More sharing options...
gevans Posted June 6, 2009 Share Posted June 6, 2009 On line 13; $not=array(); That is reassigning $not with an empty array. Get rid of the line! Quote Link to comment https://forums.phpfreaks.com/topic/161171-get-all-links-from-a-website-script-help-recursive/#findComment-850537 Share on other sites More sharing options...
byte1918 Posted June 6, 2009 Author Share Posted June 6, 2009 On line 13; $not=array(); That is reassigning $not with an empty array. Get rid of the line! thanks alot Crayon Violent for the script and gevans for pointing out my mistake! Quote Link to comment https://forums.phpfreaks.com/topic/161171-get-all-links-from-a-website-script-help-recursive/#findComment-850542 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.