otuatail Posted April 29, 2010 Share Posted April 29, 2010 Hi I am getting some hits from robots. ONE of these guys claims that he adhears to the robot text file. Can someone tell me if I have got it wrong. Only 4 specified bots should look at my site including the root index.php User-agent: Googlebot Disallow: /news/ Disallow: /cms/ User-agent: Slurp Disallow: /news/ Disallow: /cms/ User-agent: Teoma Disallow: /news/ Disallow: /cms/ User-agent: msnbot Disallow: /news/ Disallow: /cms/ User-agent: * Disallow: / Quote Link to comment Share on other sites More sharing options...
teamatomic Posted April 29, 2010 Share Posted April 29, 2010 Put the general restriction first followed by the bot specific rules. HTH Teamatomic Quote Link to comment Share on other sites More sharing options...
otuatail Posted April 30, 2010 Author Share Posted April 30, 2010 Ok this is getting confusing. I have had an email from a source of a web robot. What I am doing is when I get a hit form somewhere I do the following. Call a function isRobot(); This gets $_SERVER['HTTP_USER_AGENT']; and logs information in a database. My complaint was that I am getting records in my DB with his information. He says that web robots access my website in order to read the robot file and that is why I get the record. My interpretation is that the robot reads the file and exits. If what he says is correct, then all robots including the bad ones will always leave a record. Any info on this please. TIA Desmond. Quote Link to comment Share on other sites More sharing options...
teamatomic Posted April 30, 2010 Share Posted April 30, 2010 Yes, they come and read the file, so there should be a log entry for their access to the robots.txt file. But just because they read the file does not mean they have to honor its rules. Does your logfile show eveidence that the bot in question accessed any other file but the robots.txt? HTH Teamatomic Quote Link to comment Share on other sites More sharing options...
otuatail Posted April 30, 2010 Author Share Posted April 30, 2010 No what I was anoyed at before was. When you visit a website, it says you are the 1342355 person to visit the website. I know this is crap, because 90% of this is web robots. My idea was to first write a robot file to exclude all but one or two of them. I then had the idea of (because of the bad ones) check $_SERVER['HTTP_USER_AGENT']; if I can tell that it is a bot then dont increase the hit counter, but log the entry in the database ayway. I have a list of known bots that I don't want and exit(); so they don't get anywhere. Problem is there are thousands of them. I was hoping that the text file would keep them out. What I am now told is I will always get a visit from a robot regardless. The problem is I dont want to get this far and have to examin $_SERVER['HTTP_USER_AGENT']; This means regardless of my text file. My database will always record a hit from a human. Desmond. Quote Link to comment Share on other sites More sharing options...
andrewgauger Posted April 30, 2010 Share Posted April 30, 2010 Couldn't help but not help: http://ars.userfriendly.org/cartoons/?id=20060315&mode=classic Was all I found on the subject. Quote Link to comment Share on other sites More sharing options...
otuatail Posted April 30, 2010 Author Share Posted April 30, 2010 The problem I saw is having a website that said you where the 1000 visitor when 90% of these where robots. I created an MySql table. Every page calls a function to log the datetime. IP address. The page being viewed and the browser information. I checked for googlebot and Yahoo. Sand identified them in the table as robot. With this I can see how many people have visited the website in a month and which pages. I can separate the people from the robots and do the same. A successful website would have these visitors and you could tell if your website is not as successful with the robots and work out why. I created a robot file allowing 4 major robots and refused the others. I see others in the table but because I did not identify them they are seen as real visitors. I have to create a hit list of all these as I find them. This guy says that he does observe the robots file and by reading this file I will get a hit. I thought this would not happen at this stage. This means that there is no way I can get a true count of real visitors as both good and bad robots will cause a hit just by reading the robots text file. Any suggestions TIA Desmond. Quote Link to comment Share on other sites More sharing options...
teamatomic Posted April 30, 2010 Share Posted April 30, 2010 1. Take it as a given that a bot wil read the robots.txt file first. 2. grab the IP of anything that reads the robots.txt file. 3. dont count anything with the IP of the robot. #1 makes an asumption that does not, and may not always be true but its probably the closest you can come. You may also want to read this thread http://www.phpfreaks.com/forums/index.php/topic,291102.msg1378430.html#msg1378430 HTH Teamatomic Quote Link to comment Share on other sites More sharing options...
otuatail Posted April 30, 2010 Author Share Posted April 30, 2010 thanks teamatomic problem is 2) grab the IP of anything that reads the robots.txt file. How do I check this? does my SERVER['REMOTE_ADDR'] on each wep page not get this 3) don't count anything with the IP of the robot. How do I know the IP of robots there are TOOOO many not all bots are bad. It seems that all bots will be rcorded by my storedata funtion if they stoped at just the reading of the file that would be great. 1) Take it as a given that a bot wil read the robots.txt file first this dosn't seem to be the problem. Seem to get a hit regardless. Quote Link to comment Share on other sites More sharing options...
darkfreaks Posted April 30, 2010 Share Posted April 30, 2010 your code is disallowing those 4 bots. try this: User-Agent: * Disallow: / User-agent: Googlebot Allow: /news/ Allow: /cms/ User-agent: Slurp Allow: /news/ Allow: /cms/ User-agent: Teoma Allow: /news/ Allow: /cms/ User-agent: msnbot Allow: /news/ Allow: /cms/ Quote Link to comment Share on other sites More sharing options...
DennsR Posted April 30, 2010 Share Posted April 30, 2010 You may want to check out a free program: BotSplit at http://www.BotSplit.com (disclosure: I wrote the program). This programs operates offline to do an aggressive job of rooting out robot visitors. Note that there are 7 rules used and listed. Checking for access to robots.txt is a subset of one of the rules. Some of rules are applied after the current session restarts and others require visibility of the entire session. Thus, trying to do this in real time before access is made is *tough*. As to what you are trying to do: tell visitors which visitor number they are, let me suggest calculate the number of human visitors offline and reset your online count periodically. Then increment your online count either with every visitor or a fraction representing experiience. Note that tracking IP's of Bots is an exercise in futility. The MSDN robot used 37 different IP addresses in one run alone. Quote Link to comment Share on other sites More sharing options...
otuatail Posted April 30, 2010 Author Share Posted April 30, 2010 Thanks DennsR for this information and the hard work you have done in this area. It is so annoying thinking your website is fantastic only to realise that no one looks at it. It's also anoying when you visit a website that says you are the milionth person making you think this is a very good hot and trusted website. I will take all this on board and re-work the website. Desmond. Quote Link to comment Share on other sites More sharing options...
teamatomic Posted May 1, 2010 Share Posted May 1, 2010 2) grab the IP of anything that reads the robots.txt file. How do I check this? does my SERVER['REMOTE_ADDR'] on each wep page not get this parse the access log. Apache does a good job of logging. 3) don't count anything with the IP of the robot. How do I know the IP of robots there are TOOOO many see the above answer HTH Teamatomic Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.