Building a Web Crawler with PHP

joeiscoolone · December 18, 2006

Hi i am trying to build a web crawler in php but i need some help. I know how to parse text and links out of a page, but I don't know how to crawl the links to the pages they link to and index the page. Also if you parse links out of a page how do you make sure to index all the pages they link to and then crawl those pages for links and go to the pages they link to?

Thanks

btherl · December 18, 2006

The answer depends on what you're using the crawler for. For a simple crawler, you can use recursion.

First, create a function which crawls a URL. Then when you encounter a link, just call that function again with the link as its argument, instead of the original url.

For something more heavy duty, a good approach would be to store all links you've seen in a database. Then when you've finished with the current page, you pick another from the database and crawl that. This gives you much more flexibility in the order of crawling, which you'll need for something large-scale.

joeiscoolone · December 18, 2006

I am trying to build a crawler that will search an entire site. I am having problems when I extract links, I use reular expresions to parse out the links, but what I get is a link with my domain and the site im crawlings domain that if I click on will take me to an error page. Then the http://whatever in text and then the anchor text for that link and it does this for all the links it loops through. How do I just get the link http://whatever so I can send it to the function to go to that page? Here is the script:
[code]
<?php
$url = ('http://www.domainnamehere.com/');
if( !$url )
{
die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
$url = "http://$url";
}
if( !($fd = fopen($url,"r")) )
die( "Could not open URL!" );

while( $buf = fgets($fd,1024) )
{

preg_match_all("/<a.*? href=\"(.*?)\".*?>(.*?)<\/a>/i",$buf,$links);

for( $i = 0; $links[$i]; $i++ )
{
for( $j = 0; $links[$i][$j]; $j++ )
{
$cur_link = addslashes( strtolower($links[$i][$j]) );
print "Indexing: $cur_link<br>";
}
}
}
fclose($fd);
?>
[/code]
Thanks

drifter · December 18, 2006

you may want to look at something like php dig and save some work... (I think that is still around right? have not used it in a few years)

joeiscoolone · December 18, 2006

PHPDig is still around but I want to build my own crawler.

joeiscoolone · December 20, 2006

Can someone point me to a tutorial on how to build a web crawler in php one that crawls links looks at the robots.txt? I have looked but can't find one.

joeiscoolone · December 21, 2006

*bump*

Sign In

Building a Web Crawler with PHP

Recommended Posts

joeiscoolone

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

joeiscoolone

Link to comment

Share on other sites

drifter

Link to comment

Share on other sites

joeiscoolone

Link to comment

Share on other sites

joeiscoolone

Link to comment

Share on other sites

joeiscoolone

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information