Gruzin Posted February 20, 2007 Share Posted February 20, 2007 Hey guys, I'am working on a project, which includes developing a search engine site. So I need to write a program wich will index (crawl) pages and store cache in db, just like Google does. Is it possible with php? And from where do I have to start? Thanks, George Link to comment https://forums.phpfreaks.com/topic/39291-web-spidercrawler-using-php/ Share on other sites More sharing options...
ted_chou12 Posted February 20, 2007 Share Posted February 20, 2007 This is a bot detector, i found it on some site, i dont know if it will be useful to you: internetvista.com setuptime.com host-tracker.com econtrolportal.net 8 hour(s) 40 min(s) 14 sec(s). Some small tips: When using the search option, leave the fields blank allows you to browse through all the profiles, that is when you cannot find any match results when narrowing down your categories. <?php $botlist = array( "Teoma", "alexa", "froogle", "inktomi", "looksmart", "URL_Spider_SQL", "Firefly", "NationalDirectory", "Ask Jeeves", "TECNOSEEK", "InfoSeek", "WebFindBot", "girafabot", "crawler", "www.galaxy.com", "Googlebot", "Scooter", "Slurp", "appie", "FAST", "WebBug", "Spade", "ZyBorg", "rabaz"); foreach($botlist as $bot) { if(ereg($bot, $HTTP_USER_AGENT)) { if($bot == "Googlebot") { if (substr($REMOTE_HOST, 0, 11) == "216.239.46.") $bot = "Googlebot Deep Crawl"; elseif (substr($REMOTE_HOST, 0,7) == "64.68.8") $bot = "Google Freshbot"; } if ($QUERY_STRING != "") { $url = "http://" . $SERVER_NAME . $PHP_SELF . "?" . $QUERY_STRING . ""; } else { $url = "http://" . $SERVER_NAME . $PHP_SELF . ""; } // settings $to = "[email protected]"; $subject = "Detected: $bot on $url"; $body = "$bot was deteched on $url\n\n Date.............: " . date("F j, Y, g:i a") . " Page.............: " . $url . " Robot Name.......: " . $HTTP_USER_AGENT . " Robot Address....: " . $REMOTE_ADDR . " Robot Host.......: " . $REMOTE_HOST . " "; mail($to, $subject, $body); } } ?> Ted Link to comment https://forums.phpfreaks.com/topic/39291-web-spidercrawler-using-php/#findComment-189388 Share on other sites More sharing options...
Gruzin Posted February 20, 2007 Author Share Posted February 20, 2007 Thanks Ted, Actually I want to write it myself, at least I want to try, so I just need to know how all this thing works... Regards, George Link to comment https://forums.phpfreaks.com/topic/39291-web-spidercrawler-using-php/#findComment-189389 Share on other sites More sharing options...
monk.e.boy Posted February 20, 2007 Share Posted February 20, 2007 Look at http://uk2.php.net/curl Pop a url from the queue ( read this : http://en.wikipedia.org/wiki/Random_walk ) Curl grabs the page text. Then use either: - simple regex to get URLs - HTML document parser put the *unseen* urls into the queue. Then either: - Write page report - Update your google-like database - Spam my comments (that last one is a joke ) keep looping until you have read the 'net. monk.e.boy Link to comment https://forums.phpfreaks.com/topic/39291-web-spidercrawler-using-php/#findComment-189443 Share on other sites More sharing options...
Gruzin Posted February 20, 2007 Author Share Posted February 20, 2007 Thanks mate! Think curl is the thing I want for now... Regards, George Link to comment https://forums.phpfreaks.com/topic/39291-web-spidercrawler-using-php/#findComment-189464 Share on other sites More sharing options...
printf Posted February 20, 2007 Share Posted February 20, 2007 You will first need a socket streaming system, that just fetches pages held in the cache, all that class would do is dump output into extractor directory. Then the extractor would index that page and then load all the links found into the socket stream object that if is found running would add all those urls to it's interface based on the page id that contained those links, if not running, it would be started. I have have all kinds of functions and classes that can help you, if you them PM me your email and I will send them to you. Here is example of extractor class, thats very efficient, it converts all links to full verified urls, it checks each link after it converts them (../path/page.html => http:/www.site.com/path/page.html) if the element[0] equals 0, the link is active (HTTP 1.1 200 found), it uses socket streaming to load 100 urls at on time, you can load even more if you want, the class can set how many connects per server are allowed, so you don't overload a single server you are connecting to. // php.net home page (extactor) done in real time http://www.ya-right.com/extractor.php orintf Link to comment https://forums.phpfreaks.com/topic/39291-web-spidercrawler-using-php/#findComment-189468 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.