Scraper Help

twittoris · March 22, 2010

I have a scraper that takes links off a designated webpage and then lists the urls from the webpage. Can someone help me with the next part I want to do which is have the script grab the contents of each link and save it as an html file for each page?

Domain (example: http://www.mysite.com/ note: currently only works with root domain name (no starting at xyz folder): <br/>

Number of links <input type = 'text' name='numLinks' size='2' value='50' /> (will not be exact. will return #+ whatever extra on current page iteration)<br />

</form>

<?php

class scraper {

var $linkList; // list of data scraped for current page

var $rootURL; // root domain entered in from form

var $maxLinks; // max links from form

var $masterLinkList; // master list of links scraped

/*

function __construct:

constructor, used to do initial property assignments, based on form input

*/

function __construct($rootURL,$max) {

$this->rootURL = $rootURL;

$this->maxLinks = $max;

$this->masterLinkList[] = $this->rootURL;

} // end function __construct

/*

function scrapePage:

goal is to scrape the page content of the url passed to it and return all potential links.

problem is that not all links are neatly placed inside a href tags, so using the php DOM

will not always return all the links on the page. Solution so far is to assume that

regardless of where the actual link resides, chances are its within quotes, so idea is

to grab all things that are wrapped in quotes.

*/

function scrapePage($url) {

$linkList = array();

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $url

$ch = curl_init();

curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

curl_setopt($ch, CURLOPT_URL,$url);

curl_setopt($ch, CURLOPT_FAILONERROR, true);

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

curl_setopt($ch, CURLOPT_AUTOREFERER, true);

curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);

curl_setopt($ch, CURLOPT_TIMEOUT, 10);

$file = @curl_exec($ch);

if (!$file) {

$this->linkList = $linkList;

} else {

// assume everything inside quotes is a possible link

preg_match_all('~[\'"]([^\'"]*)[\'"]~',$file,$links);

// assign results to linkList

$this->linkList = $links[1];

}

} // end function getLinks

/*

function filterLinks:

goal is to go through each item in linkList (stuff pulled from scrapePage) and try

to validate it as an internal link. So far we use basename to look for a valid

page extension (specified in $validPageExt). Need to add ability for user to

enter in valid page extensions for the target domain.

We also attempt to filter out external links by checking if element starts with

'http' and if it does, if element starts with rootURL.

We also assume that if there's a space somewhere in the element that its not valid.

Yes that's not 100% because you can technically have a link with spaces in it, but

most people don't actually code that way, and it filters out a lot of stuff between

quotes, so the benefits far outweigh the cost.

*/

function filterLinks() {

// remove all elements that do not have valid basename extensions

$validPageExt = array('htm','html','php','php4','php5','asp','aspx','cfm');

foreach ($this->linkList as $k => $v) {

$v = basename($v);

$v = explode('.',$v);

if (!in_array($v[1],$validPageExt))

unset($this->linkList[$k]);

} // end foreach linkList

// remove external links, convert relatives to absolute

foreach ($this->linkList as $k => $v) {

// if $v starts with http...

if (substr($v,0,4) == "http") {

// if absolute link is not from domain, delete it

if (!preg_match('~^'.rtrim($this->rootURL,'/').'~i',$v)) unset($this->linkList[$k]);

} else {

// if not start with http, assume it is relative, add rootURL to it

$this->linkList[$k] = rtrim($this->rootURL,'/') . '/' . ltrim($this->linkList[$k],'/');

} // end else

} // end foreach linkList

// assume that if there's a space in there, it's not a valid link

foreach ($this->linkList as $k => $v) {

if (strpos($v,' ')) unset($this->linkList[$k]);

} // end foreach linkList

// filter out duplicates and reset keys

$this->linkList = array_unique($this->linkList);

$this->linkList = array_values($this->linkList);

} // end function filterLinks

/*

function addLinksToMasterLinkList:

goal here is once data is retrieved from current link and filtered, we will add

the link to the master link list. Also we remove dupes from the master list and

reset keys. This function can probably be put inside filterLinks (and it was

initially...); I couldn't decide whether it deserved its own function or not so

I ended up going for it.

*/

function addLinksToMasterLinkList() {

// add each link to master link list

foreach ($this->linkList as $v) $this->masterLinkList[] = $v;

// filter out duplicates on master link list and reset keys

$this->masterLinkList = array_unique($this->masterLinkList);

$this->masterLinkList = array_values($this->masterLinkList);

} // end function addLinksToMasterLinkList

/*

function getLinks:

basically the main engine of this bot. Goal is to go down the master link list

and call each of the other functions until we've passed max links specified.

It's not coded to stop at exactly maxLinks; it's coded so if the count is less

than max, it scrapes another page. So the end result will be the count before the

last iteration, plus whatever is on the last page. So for example if max is 50

and so far we're at 45 links, another page gets scraped. Well if that page has 10

links on it, then the end result will be 55, not 50 links.

Also, we make sure to break out of the while loop if there's no more links on

the master link list to grab data from. This is for if the site only has a total

of like 20 links and you set the number of links to like 100, it will break out

of the loop.

*/

function getLinks() {

// start at first element

$x = 0;

// while there are less links in the master link list than the max allowed...

while ((count($this->masterLinkList) < $this->maxLinks)) {

// break out of loop and end scraping if there are no more links on the master list

if (!$this->masterLinkList[$x]) break;

// scrape current page in the master link list

$this->scrapePage($this->masterLinkList[$x]);

// filter results from the scrape

$this->filterLinks();

// add filtered results to master list

$this->addLinksToMasterLinkList();

// move to next link in master link list

$x++;

} // end while count < max

}// end function getLinks

/*

function dumpLinkList:

simple function to dump out results. mostly a debugging thing

*/

function dumpLinkList () {

echo "<pre>";print_r($this->masterLinkList); echo "</pre>";

} // end function dumpLinkList

} //*** end class scraper

// if user enters url...

if ($_POST['pageName']) {

// create object

$scraper = new scraper($_POST['pageName'],$_POST['numLinks']);

// grab links

$scraper->getLinks();

// dump out results

$scraper->dumpLinkList();

} // end if $_POST

?>

twittoris · March 22, 2010

Come on.

Sign In

Scraper Help

Recommended Posts

twittoris

Link to comment

Share on other sites

twittoris

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information