Web spider,crawler using php

Gruzin · February 20, 2007

Hey guys,

I'am working on a project, which includes developing a search engine site. So I need to write a program wich will index (crawl) pages and store cache in db, just like Google does.

Is it possible with php? And from where do I have to start?

Thanks,

George

ted_chou12 · February 20, 2007

This is a bot detector, i found it on some site, i dont know if it will be useful to you:

internetvista.com
setuptime.com
host-tracker.com
econtrolportal.net


8 hour(s) 40 min(s) 14 sec(s).


Some small tips:
When using the search option, leave the fields blank allows you to browse through all the profiles, that is when you cannot find any match results when narrowing down your categories.





<?php

    $botlist = array(   
                "Teoma",                   
                "alexa",
                "froogle",
                "inktomi",
                "looksmart",
                "URL_Spider_SQL",
                "Firefly",
                "NationalDirectory",
                "Ask Jeeves",
                "TECNOSEEK",
                "InfoSeek",
                "WebFindBot",
                "girafabot",
                "crawler",
                "www.galaxy.com",
                "Googlebot",
                "Scooter",
                "Slurp",
                "appie",
                "FAST",
                "WebBug",
                "Spade",
                "ZyBorg",
                "rabaz");


    foreach($botlist as $bot) {

      if(ereg($bot, $HTTP_USER_AGENT)) {

          if($bot == "Googlebot") {
            if (substr($REMOTE_HOST, 0, 11) == "216.239.46.") $bot = "Googlebot Deep Crawl";
            elseif (substr($REMOTE_HOST, 0,7) == "64.68.8") $bot = "Google Freshbot";
          }
          if ($QUERY_STRING != "") {
            $url = "http://" . $SERVER_NAME . $PHP_SELF . "?" . $QUERY_STRING . "";
          } else {
            $url = "http://" . $SERVER_NAME . $PHP_SELF . "";
          }

// settings
$to = "[email protected]";
$subject = "Detected: $bot on $url";
$body = "$bot was deteched on $url\n\n
Date.............: " . date("F j, Y, g:i a") . "
Page.............: " . $url . "
Robot Name.......: " . $HTTP_USER_AGENT . "
Robot Address....: " . $REMOTE_ADDR . "
Robot Host.......: " . $REMOTE_HOST . "
";

mail($to, $subject, $body);

      }

    }

?>

Ted

Gruzin · February 20, 2007

Thanks Ted,

Actually I want to write it myself, at least I want to try, so I just need to know how all this thing works...

Regards,

George

monk.e.boy · February 20, 2007

Look at http://uk2.php.net/curl

Pop a url from the queue ( read this : http://en.wikipedia.org/wiki/Random_walk )

Curl grabs the page text. Then use either:

- simple regex to get URLs

- HTML document parser

put the *unseen* urls into the queue. Then either:

- Write page report

- Update your google-like database

- Spam my comments

(that last one is a joke )

keep looping until you have read the 'net.

monk.e.boy

Gruzin · February 20, 2007

Thanks mate!

Think curl is the thing I want for now...

Regards,

George

printf · February 20, 2007

You will first need a socket streaming system, that just fetches pages held in the cache, all that class would do is dump output into extractor directory. Then the extractor would index that page and then load all the links found into the socket stream object that if is found running would add all those urls to it's interface based on the page id that contained those links, if not running, it would be started. I have have all kinds of functions and classes that can help you, if you them PM me your email and I will send them to you.

Here is example of extractor class, thats very efficient, it converts all links to full verified urls, it checks each link after it converts them (../path/page.html => http:/www.site.com/path/page.html) if the element[0] equals 0, the link is active (HTTP 1.1 200 found), it uses socket streaming to load 100 urls at on time, you can load even more if you want, the class can set how many connects per server are allowed, so you don't overload a single server you are connecting to.

// php.net home page (extactor) done in real time

http://www.ya-right.com/extractor.php

orintf

Sign In

Web spider,crawler using php

Recommended Posts

Gruzin

Link to comment

Share on other sites

ted_chou12

Link to comment

Share on other sites

Gruzin

Link to comment

Share on other sites

monk.e.boy

Link to comment

Share on other sites

Gruzin

Link to comment

Share on other sites

printf

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information