[SOLVED] Simple Spider

advancedfuture · March 25, 2007

I am writing a simple spider that will grab links off of a page. So far I have no problem grabbing all the links off the first seed page. But I am stuck on how I can get it to follow those links to the next page and grab additional links perpetually.

Heres what I've written so far.

<?php
$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
{
     for ($i=0;$i<count($links[0]);$i++) 
     {
          echo "<font size=\"2\" face=\"verdana\">".$links[0][$i]."</font><br>";
     }
}
?>

Orio · March 25, 2007

Create a function and call it recursively... But be warned, you can enter an endless loop here, so don't use set_time_limit(0). Also, calling function recursively can be slow in php (after a few calls), so keep that in mind too.

<?php

$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';
spider_man($seed);

function spider_man($url)
{
echo "Following ".$url." <br>\n";
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
{
	foreach ($links[0]) as $link) 
	{
		spider_man($link);
	}
}
}

?>

Orio.

advancedfuture · March 25, 2007

yeah I was working more on it. I was trying to output the links to a cache file. Then on the next loop call the first line of the cache file, and make that URL the new seed. Only problem I am having with my wife. Is when the loop starts over again.. The file gets emptied, so Its deleting the data in cache.txt before looping.

See my code below (sorry hope my comments are sufficient!):

spider.php

<?php

//INITIAL SEED STARTING SITE
$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';

//CACHE FILE FOR LINK QUEUE
$filename="cache.txt";

//START SPIDER LOOP
for ($i = 1; $i < 3; $i++)
{
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
	{

	// WRITE PAGE LINKS TO CACHE FILE & PRINT LINK TO SCREEN
	for ($i=0;$i<count($links[0]);$i++) 
		{
		$q = $links[0][$i] . "\n";
		$f = fopen($filename,"a");
		fwrite ($f, $q);
		print $q . "<br>"; //print lines to screen
		fclose($f);		
	}
}

//GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE.

//IMPORT CACHE.TXT TO ARRAY
$oldfiledata = file($filename);

//SET SEED TO FIRST LINE OF CACHE
$seed = $oldfiledata[0];						

//DELETE FIRST LINE FROM ARRAY
$newfiledata = array_shift($oldfiledata);

//OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA
$file = fopen($filename,"w");
fwrite($file, $newfiledata);
fclose($file);

}	//Close Loop

?>

.....

/edit

LOL i just realized how tired I am when I wrote above:

Only problem I am having with my wife.

Orio · March 25, 2007

Replace this part:

$f = fopen($filename,"a");
fwrite ($f, $q);
print $q . "<br>"; //print lines to screen
fclose($f);

With:

file_put_contents($filename, file_get_contents($filename).$q);
print $q . "<br>"; //print lines to screen

Orio.

advancedfuture · March 25, 2007

I dont think thats the actual issue in my code.

I think its more along the lines of the bottom section of the code ( i modified it some more)

the $seed value keeps getting returned as:

"Arrayhttp://report-abuse.dmoz.org/?cat=Arts/Animation/Anime/Distribution/Companies"

It inserting the word array is probably screwing it up big time.

//GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE.

//IMPORT CACHE.TXT TO ARRAY
$oldfiledata = file($filename);

//SET SEED TO FIRST LINE OF CACHE
$seed = $oldfiledata[0];
print "<b>" .$oldfiledata[0]."</b>";						

//DELETE FIRST LINE FROM ARRAY
$oldfiledata = array_shift($oldfiledata);

//OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA
for ($j=0;$j<count($oldfiledata[0]);$j++)
{
$q = $oldfiledata[0][$j] . "\n";
$file = fopen($filename,"w");
fwrite($file, $oldfiledata);
fclose($file);
}

}	//Close Loop

?>

advancedfuture · March 25, 2007

OK fixed the problem with it not looping and resetting the seed!

The file is looping and following all links. Here's my final code. The ONLY problem now is. the Cache.txt file has HUGE amounts of white space between entries. I don't know where its putting all those carriage returns in from. It should just be line after line.

final code spider.php

<?php

//INITIAL SEED STARTING SITE
$seed = 'http://www.dmoz.org/Arts/Animation/Anime/Distribution/Companies/';

//CACHE FILE FOR LINK QUEUE
$filename="cache.txt";

//START SPIDER LOOP
for ($i = 1; $i < 2000; $i++)
{
$data = file_get_contents($seed);
if (preg_match_all("/http:\/\/[^\"\s']+/", $data, $links)) 
	{

	// WRITE PAGE LINKS TO CACHE FILE & PRINT LINK TO SCREEN
	for ($i=0;$i<count($links[0]);$i++) 
		{
		$q = $links[0][$i] . "\n";
		$f = fopen($filename,"a");
		fwrite ($f, $q);
		print $q . "<br>"; //print lines to screen
		fclose($f);		
	}
}

//GET NEXT LINK FROM CACHE AND PLACE IN SEED THEN DELETE LINK FROM CACHE.

//IMPORT CACHE.TXT TO ARRAY
$oldfiledata = file($filename);

//DELETE FIRST LINE FROM ARRAY
array_shift($oldfiledata);
$newfiledata = implode("\n", $oldfiledata);

//OPEN CACHE.TXT AND IMPORT NEW ARRAY DATA
$file = fopen($filename,"w");
fwrite($file, $newfiledata);
fclose($file);

//SET SEED TO FIRST LINE OF CACHE
$var1 = file($filename);
$seed = $var1[0];

}	//Close Loop

?>

Sign In

[SOLVED] Simple Spider

Recommended Posts

advancedfuture

Link to comment

Share on other sites

Orio

Link to comment

Share on other sites

advancedfuture

Link to comment

Share on other sites

Orio

Link to comment

Share on other sites

advancedfuture

Link to comment

Share on other sites

advancedfuture

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information