Web Crawler: How to

logged_with_bugmenot · July 11, 2007

I want to search certain sites (remote) for certain type of files. I dont know from where to start. Any ideas, links..........?????????

AbydosGater · July 11, 2007

Firstly: A script like this would need to be done on the command line (CLI).

You could do it by:

1. Opening a socket to predefined servers;

2. Requesting the index of the site with the HTTP "GET /" header.

3. Save all the html they send page into a variable.

4. Scan the variable for links (a "href"s) and parse the links and save all the files (and folders if your going advanced) into an array (or arrays ie: $files[], $folders[])

5. Go through all the links you have saved and pull out the ones that end in your file type that you want. save the full URLs to them ones to another array.

6. So at this stage you have all your files from that site (or the ones linked on the page you requested) saved into a variable. You have all the files you wanted.

7. Now continue the loop to keep requesting each of the files you have and scan all the links on them files...

And it keeps going so on and on.. untill it doesnt get anymore files ending.. that or your server crashes

[PS: I have never done this before, sorry if i have made a mistake.]

Andy

per1os · July 11, 2007

Firstly: A script like this would need to be done on the command line (CLI).

Not really. Just need to use www.php.net/file_get_contents

The tricky part is you have to grab all the links from the site and than open them up too, and parse them etc. I created something similar in C# but that was a pain in the ass.

To be honest with php timeouts etc you are better off coding in a desktop application such as C++, Java etc.

But either way have fun.

AbydosGater · July 11, 2007

Frost: Ohhh Actually didnt think of that. It would be much easier to do it that way Sorry. Good thinking.

I would also agree. I have never heard of a fully finished web spider/crawler in php. Maybe it has been done but anyone that wants one uses other languages like frost said c# or others.

Andy

AbydosGater · August 28, 2007

Hey i know this topic is dead, But you never know.. Might be helpful... If your still looking for an answer.. You could use something like this, It will gather all the links for you. Then you just have to go through each link and see what it ends with.. If it ends with what you want save it to an array.

Andy

<?php
function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
}

$target_url = "http://www.somesite.com/";
$userAgent = 'Your Web Crawlers Name';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "cURL error number:" .curl_errno($ch);
echo "cURL error:" . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate(”/html/body//a”);

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute(’href’);
storeLink($url,$target_url);
echo “Link stored: $url”;
}
?>

Sign In

Web Crawler: How to

Recommended Posts

logged_with_bugmenot

Link to comment

Share on other sites

AbydosGater

Link to comment

Share on other sites

per1os

Link to comment

Share on other sites

AbydosGater

Link to comment

Share on other sites

AbydosGater

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information