joeiscoolone Posted December 18, 2006 Share Posted December 18, 2006 Hi i am trying to build a web crawler in php but i need some help. I know how to parse text and links out of a page, but I don't know how to crawl the links to the pages they link to and index the page. Also if you parse links out of a page how do you make sure to index all the pages they link to and then crawl those pages for links and go to the pages they link to?Thanks Link to comment https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/ Share on other sites More sharing options...
btherl Posted December 18, 2006 Share Posted December 18, 2006 The answer depends on what you're using the crawler for. For a simple crawler, you can use recursion.First, create a function which crawls a URL. Then when you encounter a link, just call that function again with the link as its argument, instead of the original url.For something more heavy duty, a good approach would be to store all links you've seen in a database. Then when you've finished with the current page, you pick another from the database and crawl that. This gives you much more flexibility in the order of crawling, which you'll need for something large-scale. Link to comment https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/#findComment-143294 Share on other sites More sharing options...
joeiscoolone Posted December 18, 2006 Author Share Posted December 18, 2006 I am trying to build a crawler that will search an entire site. I am having problems when I extract links, I use reular expresions to parse out the links, but what I get is a link with my domain and the site im crawlings domain that if I click on will take me to an error page. Then the http://whatever in text and then the anchor text for that link and it does this for all the links it loops through. How do I just get the link http://whatever so I can send it to the function to go to that page? Here is the script:[code]<?php$url = ('http://www.domainnamehere.com/');if( !$url ){ die( "You need to define a URL to process." );}else if( substr($url,0,7) != "http://" ){ $url = "http://$url";}if( !($fd = fopen($url,"r")) ) die( "Could not open URL!" );while( $buf = fgets($fd,1024) ){ preg_match_all("/<a.*? href=\"(.*?)\".*?>(.*?)<\/a>/i",$buf,$links); for( $i = 0; $links[$i]; $i++ ) { for( $j = 0; $links[$i][$j]; $j++ ) { $cur_link = addslashes( strtolower($links[$i][$j]) ); print "Indexing: $cur_link<br>"; } }} fclose($fd);?>[/code]Thanks Link to comment https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/#findComment-143325 Share on other sites More sharing options...
drifter Posted December 18, 2006 Share Posted December 18, 2006 you may want to look at something like php dig and save some work... (I think that is still around right? have not used it in a few years) Link to comment https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/#findComment-143336 Share on other sites More sharing options...
joeiscoolone Posted December 18, 2006 Author Share Posted December 18, 2006 PHPDig is still around but I want to build my own crawler. Link to comment https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/#findComment-143339 Share on other sites More sharing options...
joeiscoolone Posted December 20, 2006 Author Share Posted December 20, 2006 Can someone point me to a tutorial on how to build a web crawler in php one that crawls links looks at the robots.txt? I have looked but can't find one. Link to comment https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/#findComment-145463 Share on other sites More sharing options...
joeiscoolone Posted December 21, 2006 Author Share Posted December 21, 2006 *bump* Link to comment https://forums.phpfreaks.com/topic/31041-building-a-web-crawler-with-php/#findComment-146106 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.