robot text files explained.

otuatail · September 12, 2008

Hi

I have tried asking this before but got nowhere. I have a robots text file like.

User-agent: Googlebot

Disallow: /news/

Disallow: /cms/

User-agent: Slurp

Disallow: /news/

Disallow: /cms/

This allows Google and Yahoo to crawl pages other than the ones listed above. My problem is if there is a new robot how do you get the User-agent string. For example I have been told to stop Yahoo use Slurp. how do you arive at that.

The complete string is

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Am I to understand that I only need to use a part of this i.e. Slurp and that the Robots do a string test to see if Slurp is in there string. Or is there a seperate list of strings just for the robot.txt file.

Desmond.

Daniel0 · September 12, 2008

You could just do User-agent: * instead.

otuatail · September 12, 2008

Yes but that would defeat the object. I don't want to kill off all Robots. They can be usefull.

As I said is there an up to date list of all Robot.txt text or does the robots use

a strstr()

Desmond.

Daniel0 · September 12, 2008

I'm sorry, that doesn't make sense. You don't want to disallow all, but you want to keep a list of all robots and put it in there? That's essentially the same thing.

Just use whitelisting. Disallow all, but allow a few you want.

otuatail · September 12, 2008

Sorry you do not understand.

What do the lines below do.

User-agent: Slurp

Disallow: /news/

Disallow: /cms/

Stop Microsoft Robot from looking at news & cms NO

This is for Yahoo. How do you know thay. It is not doccumented anywhere.

I want to add a line

User-agent: WXYZ where WXYZ is a particular robot That I don't want. How do I find out what to put in there.

For example this is a new one.

Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4

I am not interested here what it is and why I want to dissalow it. Just what do i replace. WXYZ with.

My first post was to do with string searching.

Daniel0 · September 12, 2008

Ask the owner of the robot then?

corbin · September 12, 2008

Like Daniel said, you just find the bot name somewhere.... A simple google found MSN's bots' names. (http://www.google.com/search?hl=en&q=msn+search+user-agent&btnG=Search) If it's a more obscure search engine, good luck. Also, bots don't have to listen to your robots file, if it's something malicious about which you're worried.

CroNiX · September 15, 2008

Not too hard with google. Search "web+bot+list". Anyway, the above post is correct. There is no law that a bot has to report itself as a bot and if there was it would be ignored. This seems like a waste of your time to try to implement what you are trying to do. Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits.

Daniel0 · September 15, 2008

Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits.

You should only use it for things that you don't want search engines to know about. You shouldn't use for protection. Anything that you need to keep safe must be protected with login or perhaps removed from the document root.

corbin · September 15, 2008

"Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits."

Anything that's not publically linked to, if the robot is legit and nice, will not need to be flagged as hidden. Google Bot and stuff don't look for your includes folder (example), so unless you have a link to it somewhere, it won't index it.

Daniel0 · September 16, 2008

Well, other people might link to it, corbin. But things like an include folder or config folder shouldn't be within document root in the first place.

corbin · September 16, 2008

Yeah I agree that it shouldn't be in the doc root, but on some lower end hosting, sometimes there isn't a choice. (very rarely)

Also, if someone is linking to it, he/she has already found it, meaning other people could just as easily. I personally just either htaccess deny the folder, or make the files fail if directly accessed. Guess a search engine wouldn't like that though.

Sign In

robot text files explained.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information