Jump to content

Scraper Help


twittoris

Recommended Posts

I have a scraper that takes links off a designated webpage and then lists the urls from the webpage. Can someone help me with the next part I want to do which is have the script grab the contents of each link and save it as an html file for each page?

 

<form name='getPageForm' action='' method='post'>

  Domain (example: http://www.mysite.com/ note: currently only works with root domain name (no starting at xyz folder): <br/>

<input type = 'text' name='pageName' size='50' /><br />

Number of links <input type = 'text' name='numLinks' size='2' value='50' /> (will not be exact. will return #+ whatever extra on current page iteration)<br />

    <input type='submit' name='getPage' value='load' />

</form>

 

<?php

 

class scraper {

var $linkList;        // list of data scraped for current page

var $rootURL;        // root domain entered in from form

var $maxLinks;        // max links from form

var $masterLinkList;  // master list of links scraped

 

/*

  function __construct:

  constructor, used to do initial property assignments, based on form input

*/

 

function __construct($rootURL,$max) {

  $this->rootURL = $rootURL;

  $this->maxLinks = $max;

$this->masterLinkList[] = $this->rootURL; 

} // end function __construct

 

/*

  function scrapePage:

  goal is to scrape the page content of the url passed to it and return all potential links.

  problem is that not all links are neatly placed inside a href tags, so using the php DOM

will not always return all the links on the page. Solution so far is to assume that

regardless of where the actual link resides, chances are its within quotes, so idea is

to grab all things that are wrapped in quotes.

*/

function scrapePage($url) {

  $linkList = array();

  $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

 

 

  // make the cURL request to $url

  $ch = curl_init();

  curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

  curl_setopt($ch, CURLOPT_URL,$url);

  curl_setopt($ch, CURLOPT_FAILONERROR, true);

  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

  curl_setopt($ch, CURLOPT_AUTOREFERER, true);

  curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);

  curl_setopt($ch, CURLOPT_TIMEOUT, 10);

  $file = @curl_exec($ch);

  if (!$file) {

$this->linkList = $linkList;

  } else {

    // assume everything inside quotes is a possible link

  preg_match_all('~[\'"]([^\'"]*)[\'"]~',$file,$links);

    // assign results to linkList

  $this->linkList = $links[1];

  }

} // end function getLinks

 

/*

  function filterLinks:

  goal is to go through each item in linkList (stuff pulled from scrapePage) and try

to validate it as an internal link.  So far we use basename to look for a valid

page extension (specified in $validPageExt). Need to add ability for user to

enter in valid page extensions for the target domain.

 

We also attempt to filter out external links by checking if element starts with

'http' and if it does, if element starts with rootURL. 

 

We also assume that if there's a space somewhere in the element that its not valid.

Yes that's not 100% because you can technically have a link with spaces in it, but

most people don't actually code that way, and it filters out a lot of stuff between

quotes, so the benefits far outweigh the cost.

*/

function filterLinks() {

  // remove all elements that do not have valid basename extensions

  $validPageExt = array('htm','html','php','php4','php5','asp','aspx','cfm');

  foreach ($this->linkList as $k => $v) {

    $v = basename($v);

$v = explode('.',$v);

if (!in_array($v[1],$validPageExt))

  unset($this->linkList[$k]);

  } // end foreach linkList

 

  // remove external links, convert relatives to absolute

  foreach ($this->linkList as $k => $v) {

// if $v starts with http...

  if (substr($v,0,4) == "http") {

      // if absolute link is not from domain, delete it

  if (!preg_match('~^'.rtrim($this->rootURL,'/').'~i',$v)) unset($this->linkList[$k]);

} else {

      // if not start with http, assume it is relative, add rootURL to it

      $this->linkList[$k] = rtrim($this->rootURL,'/') . '/' . ltrim($this->linkList[$k],'/');

} // end else

} // end foreach linkList

 

  // assume that if there's a space in there, it's not a valid link

  foreach ($this->linkList as $k => $v) {

if (strpos($v,' ')) unset($this->linkList[$k]);

  } // end foreach linkList

 

// filter out duplicates and reset keys

  $this->linkList = array_unique($this->linkList);

  $this->linkList = array_values($this->linkList);

} // end function filterLinks

 

/*

  function addLinksToMasterLinkList:

  goal here is once data is retrieved from current link and filtered, we will add

the link to the master link list. Also we remove dupes from the master list and

reset keys.  This function can probably be put inside filterLinks (and it was

initially...); I couldn't decide whether it deserved its own function or not so

I ended up going for it. 

*/

function addLinksToMasterLinkList() {

// add each link to master link list

foreach ($this->linkList as $v) $this->masterLinkList[] = $v;

 

  // filter out duplicates on master link list and reset keys

  $this->masterLinkList = array_unique($this->masterLinkList);

$this->masterLinkList = array_values($this->masterLinkList);

} // end function addLinksToMasterLinkList

 

/*

  function getLinks:

  basically the main engine of this bot.  Goal is to go down the master link list

and call each of the other functions until we've passed max links specified.

It's not coded to stop at exactly maxLinks; it's coded so if the count is less

than max, it scrapes another page.  So the end result will be the count before the

last iteration, plus whatever is on the last page.  So for example if max is 50

and so far we're at 45 links, another page gets scraped.  Well if that page has 10

links on it, then the end result will be 55, not 50 links. 

 

Also, we make sure to break out of the while loop if there's no more links on

the master link list to grab data from.  This is for if the site only has a total

of like 20 links and you set the number of links to like 100, it will break out

of the loop.

*/

function getLinks() {

  // start at first element

  $x = 0;

  // while there are less links in the master link list than the max allowed...

while ((count($this->masterLinkList) < $this->maxLinks)) {

    // break out of loop and end scraping if there are no more links on the master list

    if (!$this->masterLinkList[$x]) break;

    // scrape current page in the master link list

    $this->scrapePage($this->masterLinkList[$x]);

    // filter results from the scrape

$this->filterLinks();

    // add filtered results to master list

    $this->addLinksToMasterLinkList();

    // move to next link in master link list

    $x++;

} // end while count < max

}// end function getLinks

 

/*

  function dumpLinkList:

  simple function to dump out results.  mostly a debugging thing

*/

function dumpLinkList () {

echo "<pre>";print_r($this->masterLinkList); echo "</pre>";

 

} // end function dumpLinkList

 

} //*** end class scraper

 

// if user enters url...

 

 

if ($_POST['pageName']) {

  // create object

  $scraper = new scraper($_POST['pageName'],$_POST['numLinks']);

  // grab links

  $scraper->getLinks();

  // dump out results

  $scraper->dumpLinkList();

} // end if $_POST

 

?> 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.