redsmurph Posted March 11, 2012 Share Posted March 11, 2012 I've developed a file upload service of sorts. Think TwitPic, TwitVid and Twaud combined and it's not that far off. I need to determine which content pages are most viewed. Right now I just count each and every page request, so... The problem I run into (of course) is that also agent/bot requests are counted, and for some reason search engines (or other services) are very keen on accessing pages with many pictures on them, so such accesses can easily "overload" and make the access counting quite skewed. Is there a black list of search engine user agents I could deploy, or is there a more generic way of knowing whether I get a request from a search engine rather than a user? I still want the content pages indexed by search engines, so they shouldn't become "invisible" to them, yet I need to be able to determine what accesses are from primarily real users. There's no need to log in to see the content, so I can't use that as a differentiator. I suspect there are also image grabbing agents accessing my pages, and honestly they could be shut out completely. Thanks in advance, Anders Quote Link to comment Share on other sites More sharing options...
xyph Posted March 11, 2012 Share Posted March 11, 2012 Check out their USER AGENT. Most friendly crawlers will identify themselves within the request headers. This isn't true for unfriendly crawlers though. Quote Link to comment Share on other sites More sharing options...
redsmurph Posted March 11, 2012 Author Share Posted March 11, 2012 I already know to check the user agent, so the question is what to check for? There must be a white list or black list of user agents I could check against, right? A white list might be as good, provided it's updated on a daily basis, as I would otherwise miss accesses from the many new mobile phones coming out every day. I'm using WURFL for device characteristics, and possibly that would work for this too, counting only user agents WURFL knows about, as it also contains PC browsers. As all media file upload sites need this, I'd be surprised if there's no tried and true solution around. Cheers, Anders Quote Link to comment Share on other sites More sharing options...
ignace Posted March 11, 2012 Share Posted March 11, 2012 Check out robots.txt @ http://browsers.garykeith.com/downloads Could be a good starting point. Quote Link to comment Share on other sites More sharing options...
redsmurph Posted March 11, 2012 Author Share Posted March 11, 2012 Thanks mate. It admittedly looks a bit like WURFL, even though I understand the intent is a bit different. Anders Quote Link to comment Share on other sites More sharing options...
xyph Posted March 11, 2012 Share Posted March 11, 2012 robots.txt isn't ideal. It will only disallow access to certain crawlers. You want Google to index your page, but you don't want it's request to increase your hit counter. I already know to check the user agent, so the question is what to check for? There is, did you try looking for it? http://www.google.com/search?q=bot+user+agent Quote Link to comment Share on other sites More sharing options...
ignace Posted March 12, 2012 Share Posted March 12, 2012 Check out robots.txt @ http://browsers.garykeith.com/downloads Could be a good starting point. Euhm.. must have been quite late when I posted this. I meant: use the user-agent strings within the robots.txt file and match them with the HTTP_USER_AGENT string (some are missing of course: all the legit once, you should add these too). Obviously just putting a simple robots.txt in your root would do nothing. But you'll never be 100% sure to count only real user views and not bot views since they could use the User-Agent string of say Chrome or Firefox.. I think the main reason Google uses JS is to make sure it does not count crawlers (use a clever <noscript> to count those that have JS disabled?). IMO that's the way to go. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.