advancedfuture Posted March 25, 2007 Share Posted March 25, 2007 I am writing a simple spider that will grab links off of a page. So far I have no problem grabbing all the links off the first seed page. But I am stuck on how I can get it to follow those links to the next page and grab additional links perpetually. Heres what I've written so far. <?php $seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/'; $data = file_get_contents($seed); if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) { for ($i=0;$i<count($links[0]);$i++) { echo "<font size=\"2\" face=\"verdana\">".$links[0][$i]."</font><br>"; } } ?> Link to comment https://forums.phpfreaks.com/topic/44205-solved-simple-spider/ Share on other sites More sharing options...
Orio Posted March 25, 2007 Share Posted March 25, 2007 Create a function and call it recursively... But be warned, you can enter an endless loop here, so don't use set_time_limit(0). Also, calling function recursively can be slow in php (after a few calls), so keep that in mind too. <?php $seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/'; spider_man($seed); function spider_man($url) { echo "Following ".$url." <br>\n"; $data = file_get_contents($seed); if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) { foreach ($links[0]) as $link) { spider_man($link); } } } ?> Orio. Link to comment https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214700 Share on other sites More sharing options...
advancedfuture Posted March 25, 2007 Author Share Posted March 25, 2007 yeah I was working more on it. I was trying to output the links to a cache file. Then on the next loop call the first line of the cache file, and make that URL the new seed. Only problem I am having with my wife. Is when the loop starts over again.. The file gets emptied, so Its deleting the data in cache.txt before looping. See my code below (sorry hope my comments are sufficient!): spider.php <?php //INITIAL SEED STARTING SITE $seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/'; //CACHE FILE FOR LINK QUEUE $filename="cache.txt"; //START SPIDER LOOP for ($i = 1; $i < 3; $i++) { $data = file_get_contents($seed); if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) { // WRITE PAGE LINKS TO CACHE FILE & PRINT LINK TO SCREEN for ($i=0;$i<count($links[0]);$i++) { $q = $links[0][$i] . "\n"; $f = fopen($filename,"a"); fwrite ($f, $q); print $q . "<br>"; //print lines to screen fclose($f); } } //GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE. //IMPORT CACHE.TXT TO ARRAY $oldfiledata = file($filename); //SET SEED TO FIRST LINE OF CACHE $seed = $oldfiledata[0]; //DELETE FIRST LINE FROM ARRAY $newfiledata = array_shift($oldfiledata); //OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA $file = fopen($filename,"w"); fwrite($file, $newfiledata); fclose($file); } //Close Loop ?> ..... /edit LOL i just realized how tired I am when I wrote above: Only problem I am having with my wife. Link to comment https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214703 Share on other sites More sharing options...
Orio Posted March 25, 2007 Share Posted March 25, 2007 Replace this part: $f = fopen($filename,"a"); fwrite ($f, $q); print $q . "<br>"; //print lines to screen fclose($f); With: file_put_contents($filename, file_get_contents($filename).$q); print $q . "<br>"; //print lines to screen Orio. Link to comment https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214711 Share on other sites More sharing options...
advancedfuture Posted March 25, 2007 Author Share Posted March 25, 2007 I dont think thats the actual issue in my code. I think its more along the lines of the bottom section of the code ( i modified it some more) the $seed value keeps getting returned as: "Arrayhttp://report-abuse.dmoz.org/?cat=Arts/Animation/Anime/Distribution/Companies" It inserting the word array is probably screwing it up big time. //GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE. //IMPORT CACHE.TXT TO ARRAY $oldfiledata = file($filename); //SET SEED TO FIRST LINE OF CACHE $seed = $oldfiledata[0]; print "<b>" .$oldfiledata[0]."</b>"; //DELETE FIRST LINE FROM ARRAY $oldfiledata = array_shift($oldfiledata); //OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA for ($j=0;$j<count($oldfiledata[0]);$j++) { $q = $oldfiledata[0][$j] . "\n"; $file = fopen($filename,"w"); fwrite($file, $oldfiledata); fclose($file); } } //Close Loop ?> Link to comment https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214715 Share on other sites More sharing options...
advancedfuture Posted March 25, 2007 Author Share Posted March 25, 2007 OK fixed the problem with it not looping and resetting the seed! The file is looping and following all links. Here's my final code. The ONLY problem now is. the Cache.txt file has HUGE amounts of white space between entries. I don't know where its putting all those carriage returns in from. It should just be line after line. final code spider.php <?php //INITIAL SEED STARTING SITE $seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/'; //CACHE FILE FOR LINK QUEUE $filename="cache.txt"; //START SPIDER LOOP for ($i = 1; $i < 2000; $i++) { $data = file_get_contents($seed); if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) { // WRITE PAGE LINKS TO CACHE FILE & PRINT LINK TO SCREEN for ($i=0;$i<count($links[0]);$i++) { $q = $links[0][$i] . "\n"; $f = fopen($filename,"a"); fwrite ($f, $q); print $q . "<br>"; //print lines to screen fclose($f); } } //GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE. //IMPORT CACHE.TXT TO ARRAY $oldfiledata = file($filename); //DELETE FIRST LINE FROM ARRAY array_shift($oldfiledata); $newfiledata = implode("\n", $oldfiledata); //OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA $file = fopen($filename,"w"); fwrite($file, $newfiledata); fclose($file); //SET SEED TO FIRST LINE OF CACHE $var1 = file($filename); $seed = $var1[0]; } //Close Loop ?> Link to comment https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214726 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.