Counting real page views

redsmurph · March 11, 2012

I've developed a file upload service of sorts. Think TwitPic, TwitVid and Twaud combined and it's not that far off.

I need to determine which content pages are most viewed. Right now I just count each and every page request, so...

The problem I run into (of course) is that also agent/bot requests are counted, and for some reason search engines (or other services) are very keen on accessing pages with many pictures on them, so such accesses can easily "overload" and make the access counting quite skewed.

Is there a black list of search engine user agents I could deploy, or is there a more generic way of knowing whether I get a request from a search engine rather than a user?

I still want the content pages indexed by search engines, so they shouldn't become "invisible" to them, yet I need to be able to determine what accesses are from primarily real users.

There's no need to log in to see the content, so I can't use that as a differentiator.

I suspect there are also image grabbing agents accessing my pages, and honestly they could be shut out completely.

Thanks in advance,

Anders

xyph · March 11, 2012

Check out their USER AGENT.

Most friendly crawlers will identify themselves within the request headers.

This isn't true for unfriendly crawlers though.

redsmurph · March 11, 2012

I already know to check the user agent, so the question is what to check for?

There must be a white list or black list of user agents I could check against, right? A white list might be as good, provided it's updated on a daily basis, as I would otherwise miss accesses from the many new mobile phones coming out every day.

I'm using WURFL for device characteristics, and possibly that would work for this too, counting only user agents WURFL knows about, as it also contains PC browsers.

As all media file upload sites need this, I'd be surprised if there's no tried and true solution around.

Cheers,

Anders

ignace · March 11, 2012

Check out robots.txt @ http://browsers.garykeith.com/downloads Could be a good starting point.

redsmurph · March 11, 2012

Thanks mate.

It admittedly looks a bit like WURFL, even though I understand the intent is a bit different.

Anders

xyph · March 11, 2012

robots.txt isn't ideal. It will only disallow access to certain crawlers.

You want Google to index your page, but you don't want it's request to increase your hit counter.

I already know to check the user agent, so the question is what to check for?

There is, did you try looking for it?

http://www.google.com/search?q=bot+user+agent

ignace · March 12, 2012

Check out robots.txt @ http://browsers.garykeith.com/downloads Could be a good starting point.

Euhm.. must have been quite late when I posted this. I meant: use the user-agent strings within the robots.txt file and match them with the HTTP_USER_AGENT string (some are missing of course: all the legit once, you should add these too). Obviously just putting a simple robots.txt in your root would do nothing.

But you'll never be 100% sure to count only real user views and not bot views since they could use the User-Agent string of say Chrome or Firefox.. I think the main reason Google uses JS is to make sure it does not count crawlers (use a clever <noscript> to count those that have JS disabled?). IMO that's the way to go.

Sign In

Counting real page views

Recommended Posts

redsmurph

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

redsmurph

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

redsmurph

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information