Jump to content

Web spider,crawler using php


Gruzin

Recommended Posts

Hey guys,

 

I'am working on a project, which includes developing a search engine site. So I need to write a program wich will index (crawl) pages and store cache in db, just like Google does.

 

Is it possible with php? And from where do I  have to start?

 

Thanks,

George

Link to comment
https://forums.phpfreaks.com/topic/39291-web-spidercrawler-using-php/
Share on other sites

This is a bot detector, i found it on some site, i dont know if it will be useful to you:

internetvista.com
setuptime.com
host-tracker.com
econtrolportal.net


8 hour(s) 40 min(s) 14 sec(s).


Some small tips:
When using the search option, leave the fields blank allows you to browse through all the profiles, that is when you cannot find any match results when narrowing down your categories.





<?php

    $botlist = array(   
                "Teoma",                   
                "alexa",
                "froogle",
                "inktomi",
                "looksmart",
                "URL_Spider_SQL",
                "Firefly",
                "NationalDirectory",
                "Ask Jeeves",
                "TECNOSEEK",
                "InfoSeek",
                "WebFindBot",
                "girafabot",
                "crawler",
                "www.galaxy.com",
                "Googlebot",
                "Scooter",
                "Slurp",
                "appie",
                "FAST",
                "WebBug",
                "Spade",
                "ZyBorg",
                "rabaz");


    foreach($botlist as $bot) {

      if(ereg($bot, $HTTP_USER_AGENT)) {

          if($bot == "Googlebot") {
            if (substr($REMOTE_HOST, 0, 11) == "216.239.46.") $bot = "Googlebot Deep Crawl";
            elseif (substr($REMOTE_HOST, 0,7) == "64.68.8") $bot = "Google Freshbot";
          }
          if ($QUERY_STRING != "") {
            $url = "http://" . $SERVER_NAME . $PHP_SELF . "?" . $QUERY_STRING . "";
          } else {
            $url = "http://" . $SERVER_NAME . $PHP_SELF . "";
          }

// settings
$to = "[email protected]";
$subject = "Detected: $bot on $url";
$body = "$bot was deteched on $url\n\n
Date.............: " . date("F j, Y, g:i a") . "
Page.............: " . $url . "
Robot Name.......: " . $HTTP_USER_AGENT . "
Robot Address....: " . $REMOTE_ADDR . "
Robot Host.......: " . $REMOTE_HOST . "
";

mail($to, $subject, $body);

      }

    }

?>

Ted

Look at http://uk2.php.net/curl

 

Pop a url from the queue ( read this : http://en.wikipedia.org/wiki/Random_walk )

 

Curl grabs the page text. Then use either:

 

- simple regex to get URLs

- HTML document parser

 

put the *unseen* urls into the queue. Then either:

 

- Write page report

- Update your google-like database

- Spam my comments

 

(that last one is a joke  :) )

 

keep looping until you have read the 'net.

 

monk.e.boy

You will first need a socket streaming system, that just fetches pages held in the cache, all that class would do is dump output into extractor directory. Then the extractor would index that page and then load all the links found into the socket stream object that if is found running would add all those urls to it's interface based on the page id that contained those links, if not running, it would be started. I have have all kinds of functions and classes that can help you, if you them PM me your email and I will send them to you.

 

Here is example of extractor class, thats very efficient, it converts all links to full verified urls, it checks each link after it converts them  (../path/page.html => http:/www.site.com/path/page.html) if the element[0] equals 0, the link is active (HTTP 1.1 200 found), it uses socket streaming to load 100 urls at on time, you can load even more if you want, the class can set how many connects per server are allowed, so you don't overload a single server you are connecting to.

 

// php.net home page (extactor) done in real time

 

http://www.ya-right.com/extractor.php

 

 

orintf

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.