otuatail Posted September 12, 2008 Share Posted September 12, 2008 Hi I have tried asking this before but got nowhere. I have a robots text file like. User-agent: Googlebot Disallow: /news/ Disallow: /cms/ User-agent: Slurp Disallow: /news/ Disallow: /cms/ This allows Google and Yahoo to crawl pages other than the ones listed above. My problem is if there is a new robot how do you get the User-agent string. For example I have been told to stop Yahoo use Slurp. how do you arive at that. The complete string is Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) Am I to understand that I only need to use a part of this i.e. Slurp and that the Robots do a string test to see if Slurp is in there string. Or is there a seperate list of strings just for the robot.txt file. Desmond. Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted September 12, 2008 Share Posted September 12, 2008 You could just do User-agent: * instead. Quote Link to comment Share on other sites More sharing options...
otuatail Posted September 12, 2008 Author Share Posted September 12, 2008 Yes but that would defeat the object. I don't want to kill off all Robots. They can be usefull. As I said is there an up to date list of all Robot.txt text or does the robots use a strstr() Desmond. Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted September 12, 2008 Share Posted September 12, 2008 I'm sorry, that doesn't make sense. You don't want to disallow all, but you want to keep a list of all robots and put it in there? That's essentially the same thing. Just use whitelisting. Disallow all, but allow a few you want. Quote Link to comment Share on other sites More sharing options...
otuatail Posted September 12, 2008 Author Share Posted September 12, 2008 Sorry you do not understand. What do the lines below do. User-agent: Slurp Disallow: /news/ Disallow: /cms/ Stop Microsoft Robot from looking at news & cms NO This is for Yahoo. How do you know thay. It is not doccumented anywhere. I want to add a line User-agent: WXYZ where WXYZ is a particular robot That I don't want. How do I find out what to put in there. For example this is a new one. Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20080721 BonEcho/2.0.0.4 I am not interested here what it is and why I want to dissalow it. Just what do i replace. WXYZ with. My first post was to do with string searching. Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted September 12, 2008 Share Posted September 12, 2008 Ask the owner of the robot then? Quote Link to comment Share on other sites More sharing options...
corbin Posted September 12, 2008 Share Posted September 12, 2008 Like Daniel said, you just find the bot name somewhere.... A simple google found MSN's bots' names. (http://www.google.com/search?hl=en&q=msn+search+user-agent&btnG=Search) If it's a more obscure search engine, good luck. Also, bots don't have to listen to your robots file, if it's something malicious about which you're worried. Quote Link to comment Share on other sites More sharing options...
CroNiX Posted September 15, 2008 Share Posted September 15, 2008 Not too hard with google. Search "web+bot+list". Anyway, the above post is correct. There is no law that a bot has to report itself as a bot and if there was it would be ignored. This seems like a waste of your time to try to implement what you are trying to do. Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits. Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted September 15, 2008 Share Posted September 15, 2008 Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits. You should only use it for things that you don't want search engines to know about. You shouldn't use for protection. Anything that you need to keep safe must be protected with login or perhaps removed from the document root. Quote Link to comment Share on other sites More sharing options...
corbin Posted September 15, 2008 Share Posted September 15, 2008 "Actually, I have read that having a robots.txt is a bad idea because it shows some of your directory structure to unscrupulous people who look for exploits." Anything that's not publically linked to, if the robot is legit and nice, will not need to be flagged as hidden. Google Bot and stuff don't look for your includes folder (example), so unless you have a link to it somewhere, it won't index it. Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted September 16, 2008 Share Posted September 16, 2008 Well, other people might link to it, corbin. But things like an include folder or config folder shouldn't be within document root in the first place. Quote Link to comment Share on other sites More sharing options...
corbin Posted September 16, 2008 Share Posted September 16, 2008 Yeah I agree that it shouldn't be in the doc root, but on some lower end hosting, sometimes there isn't a choice. (very rarely) Also, if someone is linking to it, he/she has already found it, meaning other people could just as easily. I personally just either htaccess deny the folder, or make the files fail if directly accessed. Guess a search engine wouldn't like that though. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.