Jump to content

Recommended Posts

Firstly: A script like this would need to be done on the command line (CLI).

 

You could do it by:

1. Opening a socket to predefined servers;

2. Requesting the index of the site with the HTTP "GET /" header.

3. Save all the html they send page into a variable.

4. Scan the variable for links (a "href"s) and parse the links and save all the files (and folders if your going advanced) into an array (or arrays ie: $files[], $folders[])

5. Go through all the links you have saved and pull out the ones that end in your file type that you want. save the full URLs to them ones to another array.

6. So at this stage you have all your files from that site (or the ones linked on the page you requested) saved into a variable. You have all the files you wanted.

7. Now continue the loop to keep requesting each of the files you have and scan all the links on them files...

 

And it keeps going so on and on.. untill it doesnt get anymore files ending.. that or your server crashes :P

 

[PS: I have never done this before, sorry if i have made a mistake.]

Andy

Link to comment
https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-295758
Share on other sites

Firstly: A script like this would need to be done on the command line (CLI).

 

Not really. Just need to use www.php.net/file_get_contents

 

The tricky part is you have to grab all the links from the site and than open them up too, and parse them etc. I created something similar in C# but that was a pain in the ass.

 

To be honest with php timeouts etc you are better off coding in a desktop application such as C++, Java etc.

 

But either way have fun.

Link to comment
https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-295761
Share on other sites

Frost: Ohhh Actually didnt think of that. It would be much easier to do it that way :P Sorry. Good thinking.

 

I would also agree. I have never heard of a fully finished web spider/crawler in php. Maybe it has been done but anyone that wants one uses other languages like frost said c# or others.

 

Andy

Link to comment
https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-295776
Share on other sites

  • 1 month later...

Hey i know this topic is dead, But you never know.. Might be helpful... If your still looking for an answer.. You could use something like this, It will gather all the links for you. Then you just have to go through each link and see what it ends with.. If it ends with what you want save it to an array.

 

Andy

 

<?php
function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
}

$target_url = "http://www.somesite.com/";
$userAgent = 'Your Web Crawlers Name';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "cURL error number:" .curl_errno($ch);
echo "cURL error:" . curl_error($ch);
exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate(”/html/body//a”);

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute(’href’);
storeLink($url,$target_url);
echo “Link stored: $url”;
}
?>

Link to comment
https://forums.phpfreaks.com/topic/59503-web-crawler-how-to/#findComment-336242
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.