hackalive Posted June 8, 2010 Share Posted June 8, 2010 Okay so thorpe has locked my post and pointed out it is far too vague. So here I go again. Firstly I know how to build a basic PHP cURL web page crawler. Now my question is more specific to this, how do I get Site Stats to recognise the crawler as other ones such as Google Bot and MLBot are listed (MLbot http://www.metadatalabs.com/mlbot, so mine would be HackAliveCrawler htpp://mysite.com/hackalivecrawler or similar...). And how do I get it to recognise ROBOTS meta tags, ie NOINDEX, NOFOLLOW, or INDEX, FOLLOW etc and for robots.txt (to do this I need to first part to work, the name set part (eg GoogleBot or MLBot etc). Hope this is less vague and can produce some answers. Links to tutorials or other forums that will achieve these desired result as well as personal opinion and commernts are most welcome. Many thanks in advance. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/ Share on other sites More sharing options...
ignace Posted June 8, 2010 Share Posted June 8, 2010 Now my question is more specific to this, how do I get Site Stats to recognise the crawler as other ones such as Google Bot and MLBot are listed The User-Agent directive. And how do I get it to recognise ROBOTS meta tags, ie NOINDEX, NOFOLLOW, or INDEX, FOLLOW etc and for robots.txt (to do this I need to first part to work, the name set part (eg GoogleBot or MLBot etc). Upon crawling you check if a robots.txt file exists and adhere to it. You crawl the pages (or pages within directories) which are not blacklisted in the robots.txt file and search for a ROBOTS meta tag. If the robots meta-tag contains you act accordingly. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069310 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 Okay thanks once again ignace, just unsure how to set the User-Agent directive, if you know a good link for this or can tell me it will be much apprreciated. Thanks again ignace Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069312 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 oh bye the way it will be running as part of a cron job and somethimes via direct browser. Thought I might just add this as it may affect the User-Agent directive mentioned Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069313 Share on other sites More sharing options...
ignace Posted June 8, 2010 Share Posted June 8, 2010 http://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#user-agent It's part of the request headers. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069314 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 Okay so I get it is part of the header request so how then can I set my own like MLBot and Google Crawler for when it is executed via a cron job? Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069315 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 Becuase MLBot have MLBot (http://www.metadatalabs.com/mlbot) this is what I want to achieve Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069316 Share on other sites More sharing options...
ignace Posted June 8, 2010 Share Posted June 8, 2010 User-Agent: HackAliveCrawler htpp://mysite.com/hackalivecrawler Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069325 Share on other sites More sharing options...
Adam Posted June 8, 2010 Share Posted June 8, 2010 I'm guessing you mean with PHP? You can send it as a header: header('User-Agent: HackAliveCrawler htpp://mysite.com/hackalivecrawler'); Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069329 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 Yes with PHP thanks a million once again to ignace (who many a time has saved me from trauling the entiere internet) and also to MrAdam thanks. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069332 Share on other sites More sharing options...
Adam Posted June 8, 2010 Share Posted June 8, 2010 Sorry, if you're using cURL you'll need to set the cURL option: curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler'); Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069333 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 thanks again Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069335 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 okay slight problem, this code <?php header('User-Agent: HackAliveCrawler http://mysite.com/hackalivecrawler'); echo $_SERVER['HTTP_USER_AGENT']; ?> returns Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E) not HackAliveCrawler http://mysite.com/hackalivecrawler any reason why? if I use USER_AGENT and not HTTP_USER_AGENT it returns nothing thanks once again Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069337 Share on other sites More sharing options...
Adam Posted June 8, 2010 Share Posted June 8, 2010 I believe that'll be the browser over-writing whatever you send. If you're using cURL to crawl the pages as I mentioned before just send it as a cURL option. Edit: Should add I'm only speculating about the browser, I'm not fully sure how that part works. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069338 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 yeah I am just doing it on a smaple page at the moment with no curl, just exactly what I have posted Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069340 Share on other sites More sharing options...
ignace Posted June 8, 2010 Share Posted June 8, 2010 header obviously doesn't work because it sends response headers not request headers you need cURL in order to do that. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069356 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 MrAdam Warning: curl_setopt(): supplied argument is not a valid cURL handle resource for this code <?php curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler'); echo $_SERVER['HTTP_USER_AGENT']; echo "|"; echo $_SERVER['USER_AGENT']; ?> Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069357 Share on other sites More sharing options...
trq Posted June 8, 2010 Share Posted June 8, 2010 MrAdam Warning: curl_setopt(): supplied argument is not a valid cURL handle resource for this code <?php curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler'); echo $_SERVER['HTTP_USER_AGENT']; echo "|"; echo $_SERVER['USER_AGENT']; ?> You haven't defined $ch. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069359 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 okay thorpe so what do I set $ch as? Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069362 Share on other sites More sharing options...
trq Posted June 8, 2010 Share Posted June 8, 2010 Why not take a look at the manual for the curl extension? Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069364 Share on other sites More sharing options...
Adam Posted June 8, 2010 Share Posted June 8, 2010 MrAdam Warning: curl_setopt(): supplied argument is not a valid cURL handle resource for this code <?php curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler'); echo $_SERVER['HTTP_USER_AGENT']; echo "|"; echo $_SERVER['USER_AGENT']; ?> That's not how it works. As thorpe has just replied, you need to define $ch (look at the cURL examples on the manual), but even then it only sets the user agent for the request you're going to make with cURL. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069365 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 <?php $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler'); ?> Yeah so if I do the above then do the crawl using cURL the user Agent will be recorded as HackAliveCrawler yes? Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069366 Share on other sites More sharing options...
trq Posted June 8, 2010 Share Posted June 8, 2010 And this is the same guy going on about designing and implementing his own OS virtualization? Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069367 Share on other sites More sharing options...
hackalive Posted June 8, 2010 Author Share Posted June 8, 2010 Well thorpe my OS virtulization project has started and is going well, so umm yeah. Just trying to clean this cURL thing out of my inbox (so to speak) Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069368 Share on other sites More sharing options...
trq Posted June 8, 2010 Share Posted June 8, 2010 I'm sure it is. Quote Link to comment https://forums.phpfreaks.com/topic/204165-crawler-2/#findComment-1069373 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.