Jump to content

Building a Web Crawler with PHP


joeiscoolone

Recommended Posts

Hi i am trying to build a web crawler in php but i need some help. I know how to parse text and links out of a page, but I don't know how to crawl the links to the pages they link to and index the page. Also if you parse links out of a page how do you make sure to index all the pages they link to and then crawl those pages for links and go to the pages they link to?

Thanks
Link to comment
https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/
Share on other sites

The answer depends on what you're using the crawler for.  For a simple crawler, you can use recursion.

First, create a function which crawls a URL.  Then when you encounter a link, just call that function again with the link as its argument, instead of the original url.

For something more heavy duty, a good approach would be to store all links you've seen in a database.  Then when you've finished with the current page, you pick another from the database and crawl that.  This gives you much more flexibility in the order of crawling, which you'll need for something large-scale.
I am trying to build a crawler that will search an entire site. I am having problems when I extract links, I use reular expresions to parse out the links, but what I get is a link with my domain and the site im crawlings domain that if I click on will take me to an error page. Then the http://whatever in text and then the anchor text for that link and it does this for all the links it loops through. How do I just get the link http://whatever so I can send it to the function to go to that page? Here is the script:
[code]
<?php
$url = ('http://www.domainnamehere.com/');
if( !$url )
{
  die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
  $url = "http://$url";
}
if( !($fd = fopen($url,"r")) )
  die( "Could not open URL!" );

while( $buf = fgets($fd,1024) )
{

 
  preg_match_all("/<a.*? href=\"(.*?)\".*?>(.*?)<\/a>/i",$buf,$links);

  for( $i = 0; $links[$i]; $i++ )
  {
      for( $j = 0; $links[$i][$j]; $j++ )
      {
    $cur_link = addslashes( strtolower($links[$i][$j]) );
    print "Indexing: $cur_link<br>";
}
}
}
fclose($fd);
?>
[/code]
Thanks

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.