Jump to content

[SOLVED] check referrer is a search engine


Dragen

Recommended Posts

Hi,

I've got a file which collects data of people who view a site. It collects the referrer, and I'm now trying to ascertain whether the ref url is a search engine or not. I've got a large list of available search engines and I could simple check the url starts with the search engine url..

such as:

if(ereg('^' . $s_engine, $ref){ // $s_engine is the search engine's url, $ref is the referrer's
echo 'seach engine found';
}else{
echo 'not a search engine';
}

But that wouldn't take into account all of the sub-domains in search engines. For instance google, isn't just google.com.. you've got:

google.com
images.google.com
news.google.com
maps.google.com
etc

and even then you've got the .co.uk, .whatever endings.

 

I'm using parse_url on the referrer to get the hostname:

<?php
$ref = 'http://www.google.co.uk/search?q=jkgvyk';
$ref = parse_url($ref, PHP_URL_HOST);

echo $ref; // this would output: www.google.co.uk
?>

Which is simple enough. I can then get rid of the 'www.' and '.co.uk' and I'm left with google which I can run through the database.

 

Now if I have 'www.images.google' It's more awkward. I don't want to have to store every different subdomain and '.com, .co.uk etc'..

So I'm trying to extract just the part I need. It's also because if someone has a url such as:

http://www.google.mysite.com/

I blatantly don't want that to be classed as a search engine url.

 

Any ideas?

Link to comment
Share on other sites

hmm... Thinking about it, all I need to get rid of is the '.co.uk', and '.com' etc.

Which I could easily do using str_replace and an array, containing every possible one, but that's a bit annoying. There must be an all round ereg that can catch it, but all the one's I've seen have major problems, such as not recognising a lot of less well known ones.

Link to comment
Share on other sites

just thought I'd say that I've solved it.. really simple.

I wrote a function which I run through the urls to check them:

<?php
function search_hits(array $hit_list, $k){
	$this->engines = array(
		'alexa' => 'http://www.alexa.com/',
		'altavista' => 'http://www.altavista.com/',
		'ask' => 'http://www.ask.com/',
		'dogpile' => 'http://www.dogpile.com/',
		'exalead' => 'http://www.exalead.com/',
		'gigablast' => 'http://www.gigablast.com/',
		'google' => 'http://www.google.com/',
		'live' => 'http://www.live.com/',
		'searchenginewatch' => 'http://searchenginewatch.com/',
		'yahoo' => 'http://search.yahoo.com/',
		'yell' => 'http://www.yell.com/',
	);

	foreach($hit_list as $v){
		if(isset($v[$k]) && (($v[$k] != '') && ($host = parse_url($v[$k], PHP_URL_HOST)))){
			$host = ereg_replace('\.([a-z]{2}(\.[a-z]{2})?|[a-z]{3})$', '', str_replace('www.', '', $host));
			if(strstr($host, '.') !== false){
				$host = strstr($host, '.');
			}
			$host = trim($host, '.');

			if(array_key_exists($host, $this->engines)){
				$r[$host]['ip'][] = $v['ip'];
			}
		}
	}

	if(isset($r) && is_array($r)){
		return $r;
	}else{
		return false;
	}
}
?>

Basically I have an array with the name and url for each search engine (name, must be the domain of url).

Then go through my array of urls and get the host with parse_url, which gives me something like:

www.mydomain.com

 

Using str_replace to get rid of the 'www.' and  eregi to remove the end section. It checks for several end combinations:

.'2 letters' (i.e; .us)

.'2 letters'.'2 letters' (i.e; .co.uk)

.'3 letters' (i.e; .com)

 

Then simply go through the search engine array for matches.

 

simple ;)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.