Jump to content

robot text files explained.


otuatail

Recommended Posts

Hi

 

I have tried asking this before but got nowhere. I have a robots text file like.

 

User-agent: Googlebot

Disallow: /news/

Disallow: /cms/

 

User-agent: Slurp

Disallow: /news/

Disallow: /cms/

 

This allows Google and Yahoo to crawl pages other than the ones listed above. My problem is if there is a new robot how do you get the User-agent string. For example I have been told to stop Yahoo use Slurp. how do you arive at that.

 

The complete string is

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

 

Am I to understand that I only need to use a part of this i.e. Slurp and that the Robots do a string test to see if Slurp is in there string.  Or is there a seperate list of strings just for the robot.txt file.

 

 

Desmond.

 

 

 

Link to comment
Share on other sites

Sorry you do not understand.

 

What do the lines below do.

User-agent: Slurp

Disallow: /news/

Disallow: /cms/

 

Stop Microsoft Robot from looking at news & cms NO

This is for Yahoo. How do you know thay. It is not doccumented anywhere.

 

I want to add a line

User-agent: WXYZ where WXYZ is a particular robot That I don't want. How do I find out what to put in there.

For example this is a new one.

Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4

 

I am not interested here what it is and why I want to dissalow it. Just what do i replace. WXYZ with.

 

My first post was to do with string searching.

 

 

 

Link to comment
Share on other sites

Not too hard with google.  Search "web+bot+list".  Anyway, the above post is correct.  There is no law that a bot has to report itself as a bot and if there was it would be ignored.  This seems like a waste of your time to try to implement what you are trying to do.  Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits.

Link to comment
Share on other sites

Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits.

 

You should only use it for things that you don't want search engines to know about. You shouldn't use for protection. Anything that you need to keep safe must be protected with login or perhaps removed from the document root.

Link to comment
Share on other sites

"Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits."

 

 

Anything that's not publically linked to, if the robot is legit and nice, will not need to be flagged as hidden.  Google Bot and stuff don't look for your includes folder (example), so unless you have a link to it somewhere, it won't index it.

Link to comment
Share on other sites

Yeah I agree that it shouldn't be in the doc root, but on some lower end hosting, sometimes there isn't a choice.  (very rarely)

 

Also, if someone is linking to it, he/she has already found it, meaning other people could just as easily.  I personally just either htaccess deny the folder, or make the files fail if directly accessed.  Guess a search engine wouldn't like that though.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.