Jump to content

[SOLVED] Simple Spider


advancedfuture

Recommended Posts

I am writing a simple spider that will grab links off of a page. So far I have no problem grabbing all the links off the first seed page. But I am stuck on how I can get it to follow those links to the next page and grab additional links perpetually.

 

Heres what I've written so far.  ;D

 

<?php
$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
{
     for ($i=0;$i<count($links[0]);$i++) 
     {
          echo "<font size=\"2\" face=\"verdana\">".$links[0][$i]."</font><br>";
     }
}
?>

 

 

Link to comment
https://forums.phpfreaks.com/topic/44205-solved-simple-spider/
Share on other sites

Create a function and call it recursively... But be warned, you can enter an endless loop here, so don't use set_time_limit(0). Also, calling function recursively can be slow in php (after a few calls), so keep that in mind too.

 

<?php

$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';
spider_man($seed);

function spider_man($url)
{
echo "Following ".$url." <br>\n";
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
{
	foreach ($links[0]) as $link) 
	{
		spider_man($link);
	}
}
}

?>

 

 

Orio.

Link to comment
https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214700
Share on other sites

yeah I was working more on it. I was trying to output the links to a cache file. Then on the next loop call the first line of the cache file, and make that URL the new seed. Only problem I am having with my wife. Is when the loop starts over again.. The file gets emptied, so Its deleting the data in cache.txt before looping.

 

See my code below (sorry hope my comments are sufficient!):

 

spider.php

 

<?php

//INITIAL SEED STARTING SITE
$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';

//CACHE FILE FOR LINK QUEUE
$filename="cache.txt";

//START SPIDER LOOP
for ($i = 1; $i < 3; $i++)
{
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
	{

	// WRITE PAGE LINKS TO CACHE FILE & PRINT LINK TO SCREEN
	for ($i=0;$i<count($links[0]);$i++) 
		{
		$q = $links[0][$i] . "\n";
		$f = fopen($filename,"a");
		fwrite ($f, $q);
		print $q . "<br>"; //print lines to screen
		fclose($f);		
	}
}

//GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE.

//IMPORT CACHE.TXT TO ARRAY
$oldfiledata = file($filename);

//SET SEED TO FIRST LINE OF CACHE
$seed = $oldfiledata[0];						

//DELETE FIRST LINE FROM ARRAY
$newfiledata = array_shift($oldfiledata);

//OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA
$file = fopen($filename,"w");
fwrite($file, $newfiledata);
fclose($file);

}	//Close Loop

?> 

 

 

 

 

 

 

.....

/edit

 

LOL i just realized how tired I am when I wrote above:

Only problem I am having with my wife.
Link to comment
https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214703
Share on other sites

I dont think thats the actual issue in my code.

 

I think its more along the lines of the bottom section of the code ( i modified it some more)

 

the $seed value keeps getting returned as:

"Arrayhttp://report-abuse.dmoz.org/?cat=Arts/Animation/Anime/Distribution/Companies"

 

It inserting the word array is probably screwing it up big time.

 

//GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE.

//IMPORT CACHE.TXT TO ARRAY
$oldfiledata = file($filename);

//SET SEED TO FIRST LINE OF CACHE
$seed = $oldfiledata[0];
print "<b>" .$oldfiledata[0]."</b>";						

//DELETE FIRST LINE FROM ARRAY
$oldfiledata = array_shift($oldfiledata);

//OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA
for ($j=0;$j<count($oldfiledata[0]);$j++)
{
$q = $oldfiledata[0][$j] . "\n";
$file = fopen($filename,"w");
fwrite($file, $oldfiledata);
fclose($file);
}

}	//Close Loop

?> 

Link to comment
https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214715
Share on other sites

OK fixed the problem with it not looping and resetting the seed!

 

The file is looping and following all links. Here's my final code. The ONLY problem now is. the Cache.txt file has HUGE amounts of white space between entries. I don't know where its putting all those carriage returns in from. It should just be line after line.

 

final code spider.php

<?php

//INITIAL SEED STARTING SITE
$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';

//CACHE FILE FOR LINK QUEUE
$filename="cache.txt";

//START SPIDER LOOP
for ($i = 1; $i < 2000; $i++)
{
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
	{

	// WRITE PAGE LINKS TO CACHE FILE & PRINT LINK TO SCREEN
	for ($i=0;$i<count($links[0]);$i++) 
		{
		$q = $links[0][$i] . "\n";
		$f = fopen($filename,"a");
		fwrite ($f, $q);
		print $q . "<br>"; //print lines to screen
		fclose($f);		
	}
}

//GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE.

//IMPORT CACHE.TXT TO ARRAY
$oldfiledata = file($filename);

//DELETE FIRST LINE FROM ARRAY
array_shift($oldfiledata);
$newfiledata = implode("\n", $oldfiledata);

//OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA
$file = fopen($filename,"w");
fwrite($file, $newfiledata);
fclose($file);

//SET SEED TO FIRST LINE OF CACHE
$var1 = file($filename);
$seed = $var1[0];

}	//Close Loop

?> 

Link to comment
https://forums.phpfreaks.com/topic/44205-solved-simple-spider/#findComment-214726
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.