Jump to content

Recommended Posts

Hi,

I'm trying to make a script which can get all website links.

 

This is my code so far..

<?php
set_time_limit (0);
//$link = $samba , they have same values , $linka is supposed to store all the links on the website , and $not is supposed to remember all links that have already been scanned, so I don't go into an infinite loop.
function get($link,$samba,$linka=array(),$not=array()){
$get = file_get_contents($link);
//preg_match all the links.
preg_match_all('/<a.*href="(.*)".*>.*<\/a>/smU',$get,$path);
//preg match css links , I will need this sometime further.
preg_match_all('/href="(.*\.css)"/',$get,$css);
$links=array();
$links =  array_merge ($path[1],$css[1]);
$horse = array_unique($links);
$not=array();
//in this foreach I try to remove all external links or links which I shouldn't take into consideration.
foreach ($horse as $key =>$adr){
  	if (substr ($adr,0,7) == "http://")
		unset($horse[$key]);
if (substr($adr,0,3) == "www")
	unset($horse[$key]);
if (substr ($adr,0,6) == "mailto")
	unset($horse[$key]);
if (substr($adr,0,3) == "../")
	unset ($horse[$key]); 
if (substr($adr,0,6) == "../../")
	unset ($horse[$key]); 
}
//in this foreach I add the url ($samba) to the found adress.. for example the link found is : contact.html and I append the adress => http://www.whatever.com/contact.html 
foreach ($horse as $key =>$adr)
$horse[$key]= $samba.$adr;

//now here is the problem.. I go through every link one by one, and check if it hasn't already been scanned,if it wasn't I add the link to the array $not , and I call the function recursilvey (?).
foreach ($horse as $adr){
if (!in_array($adr,$not)){
$not[]=$adr;
$linka = array_merge ($horse,get($adr,$samba,$linka,$not));
}
}
//the problem is that (I think) the $not array doesn't keep the values that have already been scanned when I call the function again, and therefor my main function keeps going into an endless loop.
$haha= array_unique($linka);
return $haha;
}

$link = "http://rcsys.ro/";
$a = get ($link,$link,$q=array(),$not=array());
print_r($a);


?>

 

 

I posted comments in the code with my problem , maybe someone who has already done this has the time to check it out. And yeah I know my coding looks like crap.

 

I was working on something like this a while back.  I never got around to completing it.  I remember it mostly kind of sort of worked.  Lot of stuff missing like doing something about relative links that use ../ or removing link from master list if cURL returns false, etc... May or may not inspire something other than a bowel movement...

 

<form name='getPageForm' action='' method='post'>
   Domain (example: http://www.mysite.com/ note: currently only works with root domain name (no starting at xyz folder): <br/>
 <input type = 'text' name='pageName' size='50' /><br />
 Number of links <input type = 'text' name='numLinks' size='2' value='50' /> (will not be exact. will return #+ whatever extra on current page iteration)<br /> 
    <input type='submit' name='getPage' value='load' />
</form>

<?php
class scraper {
var $linkList;        // list of data scraped for current page
var $rootURL;         // root domain entered in from form
var $maxLinks;        // max links from form
var $masterLinkList;  // master list of links scraped

/* 
  function __construct:
  constructor, used to do initial property assignments, based on form input
*/
function __construct($rootURL,$max) { 
  $this->rootURL = $rootURL;
  $this->maxLinks = $max;
$this->masterLinkList[] = $this->rootURL;  
} // end function __construct

/* 
  function scrapePage: 
  goal is to scrape the page content of the url passed to it and return all potential links.
  problem is that not all links are neatly placed inside a href tags, so using the php DOM
	will not always return all the links on the page. Solution so far is to assume that 
	regardless of where the actual link resides, chances are its within quotes, so idea is
	to grab all things that are wrapped in quotes.
*/ 
function scrapePage($url) {
  $linkList = array();
  $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

  // make the cURL request to $url
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
  curl_setopt($ch, CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_FAILONERROR, true);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($ch, CURLOPT_AUTOREFERER, true);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
  curl_setopt($ch, CURLOPT_TIMEOUT, 10);
  $file = @curl_exec($ch);
  if (!$file) {
	$this->linkList = $linkList;
  } else {
    // assume everything inside quotes is a possible link	
  preg_match_all('~[\'"]([^\'"]*)[\'"]~',$file,$links);
    // assign results to linkList
  $this->linkList = $links[1];
  }
} // end function getLinks

/*
  function filterLinks:
  goal is to go through each item in linkList (stuff pulled from scrapePage) and try
	to validate it as an internal link.  So far we use basename to look for a valid
	page extension (specified in $validPageExt). Need to add ability for user to
	enter in valid page extensions for the target domain.

	We also attempt to filter out external links by checking if element starts with
	'http' and if it does, if element starts with rootURL.  

	We also assume that if there's a space somewhere in the element that its not valid. 
	Yes that's not 100% because you can technically have a link with spaces in it, but 
	most people don't actually code that way, and it filters out a lot of stuff between
	quotes, so the benefits far outweigh the cost.
*/
function filterLinks() {
  // remove all elements that do not have valid basename extensions
  $validPageExt = array('htm','html','php','php4','php5','asp','aspx','cfm');
  foreach ($this->linkList as $k => $v) {
    $v = basename($v);
	$v = explode('.',$v);
	if (!in_array($v[1],$validPageExt))
	  unset($this->linkList[$k]);
  } // end foreach linkList

  // remove external links, convert relatives to absolute
  foreach ($this->linkList as $k => $v) {
	// if $v starts with http...
  if (substr($v,0,4) == "http") { 
      // if absolute link is not from domain, delete it
	  if (!preg_match('~^'.rtrim($this->rootURL,'/').'~i',$v)) unset($this->linkList[$k]);
	} else {
      // if not start with http, assume it is relative, add rootURL to it
      $this->linkList[$k] = rtrim($this->rootURL,'/') . '/' . ltrim($this->linkList[$k],'/');
	} // end else
} // end foreach linkList

  // assume that if there's a space in there, it's not a valid link
  foreach ($this->linkList as $k => $v) {
	if (strpos($v,' ')) unset($this->linkList[$k]);
  } // end foreach linkList

// filter out duplicates and reset keys
  $this->linkList = array_unique($this->linkList);
  $this->linkList = array_values($this->linkList);
} // end function filterLinks

/*
  function addLinksToMasterLinkList:
  goal here is once data is retrieved from current link and filtered, we will add
	the link to the master link list. Also we remove dupes from the master list and
	reset keys.  This function can probably be put inside filterLinks (and it was
	initially...); I couldn't decide whether it deserved its own function or not so 
	I ended up going for it.  
*/
function addLinksToMasterLinkList() {
// add each link to master link list
foreach ($this->linkList as $v)	$this->masterLinkList[] = $v;

  // filter out duplicates on master link list and reset keys
  $this->masterLinkList = array_unique($this->masterLinkList);
$this->masterLinkList = array_values($this->masterLinkList);
} // end function addLinksToMasterLinkList

/*
  function getLinks:
  basically the main engine of this bot.  Goal is to go down the master link list
	and call each of the other functions until we've passed max links specified.
	It's not coded to stop at exactly maxLinks; it's coded so if the count is less
	than max, it scrapes another page.  So the end result will be the count before the
	last iteration, plus whatever is on the last page.  So for example if max is 50
	and so far we're at 45 links, another page gets scraped.  Well if that page has 10
	links on it, then the end result will be 55, not 50 links.  

	Also, we make sure to break out of the while loop if there's no more links on
	the master link list to grab data from.  This is for if the site only has a total
	of like 20 links and you set the number of links to like 100, it will break out
	of the loop.
*/
function getLinks() {
  // start at first element
  $x = 0;
  // while there are less links in the master link list than the max allowed...
while ((count($this->masterLinkList) < $this->maxLinks)) {
    // break out of loop and end scraping if there are no more links on the master list
    if (!$this->masterLinkList[$x]) break;
    // scrape current page in the master link list
    $this->scrapePage($this->masterLinkList[$x]);
    // filter results from the scrape
	$this->filterLinks();
    // add filtered results to master list 
    $this->addLinksToMasterLinkList();
    // move to next link in master link list
    $x++;
} // end while count < max
}// end function getLinks

/*
  function dumpLinkList:
  simple function to dump out results.  mostly a debugging thing
*/
function dumpLinkList () {
echo "<pre>";print_r($this->masterLinkList); echo "</pre>";
} // end function dumpLinkList

} //*** end class scraper

// if user enters url...
if ($_POST['pageName']) {
  // create object
  $scraper = new scraper($_POST['pageName'],$_POST['numLinks']);
  // grab links
  $scraper->getLinks();
  // dump out results
  $scraper->dumpLinkList();
} // end if $_POST

?>

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.