Jump to content

robot text file


otuatail

Recommended Posts

Hi I am getting some hits from robots. ONE of these guys claims that he adhears to the robot text file. Can someone tell me if I have got it wrong. Only 4 specified bots should look at my site including the root index.php

 

User-agent: Googlebot 
Disallow: /news/
Disallow: /cms/

User-agent: Slurp
Disallow: /news/
Disallow: /cms/

User-agent: Teoma
Disallow: /news/
Disallow: /cms/

User-agent: msnbot
Disallow: /news/
Disallow: /cms/

User-agent: *
Disallow: /

Link to comment
Share on other sites

Ok this is getting confusing. I have had an email from a source of a web robot.

What I am doing is when I get a hit form somewhere

I do the following. Call a function isRobot();

This gets $_SERVER['HTTP_USER_AGENT']; and logs information in a database.

 

My complaint was that I am getting records in my DB with his information. He says that web robots access my website in order to read the robot file and that is why I get the record. My interpretation is that the robot reads the file and exits. If what he says is correct, then all robots including the bad ones will always leave a record.

 

Any info on this please.

 

TIA Desmond.

 

 

Link to comment
Share on other sites

Yes, they come and read the file, so there should be a log entry for their access to the robots.txt file. But just because they read the file does not mean they have to honor its rules. Does your logfile show eveidence that the bot in question accessed any other file but the robots.txt?

 

 

HTH

Teamatomic

Link to comment
Share on other sites

No what I was anoyed at before was. When you visit a website, it says you are the 1342355 person to visit the website.

I know this is crap, because 90% of this is web robots.

My idea was to first write a robot file to exclude all but one or two of them.

I then had the idea of (because of the bad ones) check  $_SERVER['HTTP_USER_AGENT'];

if I can tell that it is a bot then dont increase the hit counter, but log the entry in the database ayway.

 

I have a list of known bots that I don't want and exit(); so they don't get anywhere.

Problem is  there are thousands of them.

I was hoping that the text file would keep them out. What I am now told is I will always get a visit from a robot regardless.

The problem is I dont want to get this far and have to examin  $_SERVER['HTTP_USER_AGENT'];

This means regardless of my text file. My database will always record a hit from a human.

 

Desmond.

 

 

Link to comment
Share on other sites

The problem I saw is having a website that said you where the 1000 visitor when 90% of these where robots. I created an MySql table. Every page calls a function to log the datetime. IP address. The page being viewed and the browser information. I checked for googlebot and Yahoo. Sand identified them in the table as robot. With this I can see how many people have visited the website in a month and which pages. I can separate the people from the robots and do the same.  A successful website would have these visitors and you could tell if your website is not as successful with the robots and work out why.

I created a robot file allowing 4 major robots and refused the others. I see others in the table but because I did not identify them they are seen as real visitors. I have to create a hit list of all these as I find them.

This guy says that he does observe the robots file and by reading this file I will get a hit. I thought this would not happen at this stage. This means that there is no way I can get a true count of real visitors as both good and bad robots will cause a hit just by reading the robots text file.

 

Any suggestions

 

TIA Desmond.

Link to comment
Share on other sites

1. Take it as a given that a bot wil read the robots.txt file first.

2. grab the IP of anything that reads the robots.txt file.

3. dont count anything with the IP of the robot.

 

#1 makes an asumption that does not, and may not always be true but its probably the closest you can come.

 

You may also want to read this thread

http://www.phpfreaks.com/forums/index.php/topic,291102.msg1378430.html#msg1378430

 

 

HTH

Teamatomic

 

 

Link to comment
Share on other sites

thanks teamatomic

problem is

 

2)

grab the IP of anything that reads the robots.txt file.

How do I check this? does my

SERVER['REMOTE_ADDR'] on each wep page not get this

 

3) don't count anything with the IP of the robot.

How do I know the IP of robots there are TOOOO many

 

not all bots are bad. It seems that all bots will be rcorded by my storedata funtion

if they stoped at just the reading of the file that would be great.

 

1) Take it as a given that a bot wil read the robots.txt file first

this dosn't seem to be the problem. Seem to get a hit regardless.

 

Link to comment
Share on other sites

your code is disallowing those 4 bots.

 

try this:

User-Agent: *
Disallow: /

User-agent: Googlebot
Allow: /news/
Allow: /cms/

User-agent: Slurp
Allow: /news/
Allow: /cms/

User-agent: Teoma
Allow: /news/
Allow: /cms/

User-agent: msnbot
Allow: /news/
Allow: /cms/


Link to comment
Share on other sites

You may want to check out a free program: BotSplit at http://www.BotSplit.com (disclosure: I wrote the program). 

 

This programs operates offline to do an aggressive job of rooting out robot visitors.  Note that there are 7 rules used and listed.  Checking for access to robots.txt is a subset of one of the rules.  Some of rules are applied after the current session restarts and others require visibility of the entire session.

 

Thus, trying to do this in real time before access is made is *tough*.

 

As to what you are trying to do: tell visitors which visitor number they are, let me suggest calculate the number of human visitors offline and reset your online count periodically.  Then increment your online count either with every visitor or a fraction representing experiience.

 

Note that tracking IP's of Bots is an exercise in futility.  The MSDN  robot used 37 different IP addresses in one run alone. 

Link to comment
Share on other sites

Thanks  DennsR

for this information and the hard work you have done in this area. It is so annoying thinking your website is fantastic only to realise that no one looks at it. It's also anoying when you visit a website that says you are the milionth person making you think this is a very good hot and trusted website.  I will take all this on board and re-work the website.

 

Desmond.

 

Link to comment
Share on other sites

2)

grab the IP of anything that reads the robots.txt file.

How do I check this? does my

SERVER['REMOTE_ADDR'] on each wep page not get this

parse the access log. Apache does a good job of logging.

3) don't count anything with the IP of the robot.

How do I know the IP of robots there are TOOOO many

see the above answer

 

 

HTH

Teamatomic

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.