Jump to content

Recommended Posts

I've developed a file upload service of sorts. Think TwitPic, TwitVid and Twaud combined and it's not that far off.

 

I need to determine which content pages are most viewed. Right now I just count each and every page request, so...

 

The problem I run into (of course) is that also agent/bot requests are counted, and for some reason search engines (or other services) are very keen on accessing pages with many pictures on them, so such accesses can easily "overload" and make the access counting quite skewed.

 

Is there a black list of search engine user agents I could deploy, or is there a more generic way of knowing whether I get a request from a search engine rather than a user?

 

I still want the content pages indexed by search engines, so they shouldn't become "invisible" to them, yet I need to be able to determine what accesses are from primarily real users.

 

There's no need to log in to see the content, so I can't use that as a differentiator.

 

I suspect there are also image grabbing agents accessing my pages, and honestly they could be shut out completely.

 

Thanks in advance,

Anders

 

Link to comment
https://forums.phpfreaks.com/topic/258689-counting-real-page-views/
Share on other sites

I already know to check the user agent, so the question is what to check for?

 

There must be a white list or black list of user agents I could check against, right? A white list might be as good, provided it's updated on a daily basis, as I would otherwise miss accesses from the many new mobile phones coming out every day.

 

I'm using WURFL for device characteristics, and possibly that would work for this too, counting only user agents WURFL knows about, as it also contains PC browsers.

 

As all media file upload sites need this, I'd be surprised if there's no tried and true solution around.

 

Cheers,

Anders

robots.txt isn't ideal. It will only disallow access to certain crawlers.

 

You want Google to index your page, but you don't want it's request to increase your hit counter.

 

I already know to check the user agent, so the question is what to check for?

 

There is, did you try looking for it?

http://www.google.com/search?q=bot+user+agent

Check out robots.txt @ http://browsers.garykeith.com/downloads Could be a good starting point.

 

Euhm.. must have been quite late when I posted this. I meant: use the user-agent strings within the robots.txt file and match them with the HTTP_USER_AGENT string (some are missing of course: all the legit once, you should add these too). Obviously just putting a simple robots.txt in your root would do nothing.

 

But you'll never be 100% sure to count only real user views and not bot views since they could use the User-Agent string of say Chrome or Firefox.. I think the main reason Google uses JS is to make sure it does not count crawlers (use a clever <noscript> to count those that have JS disabled?). IMO that's the way to go.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.