Dragen Posted November 12, 2007 Share Posted November 12, 2007 Hi, I've got a file which collects data of people who view a site. It collects the referrer, and I'm now trying to ascertain whether the ref url is a search engine or not. I've got a large list of available search engines and I could simple check the url starts with the search engine url.. such as: if(ereg('^' . $s_engine, $ref){ // $s_engine is the search engine's url, $ref is the referrer's echo 'seach engine found'; }else{ echo 'not a search engine'; } But that wouldn't take into account all of the sub-domains in search engines. For instance google, isn't just google.com.. you've got: google.com images.google.com news.google.com maps.google.com etc and even then you've got the .co.uk, .whatever endings. I'm using parse_url on the referrer to get the hostname: <?php $ref = 'http://www.google.co.uk/search?q=jkgvyk'; $ref = parse_url($ref, PHP_URL_HOST); echo $ref; // this would output: www.google.co.uk ?> Which is simple enough. I can then get rid of the 'www.' and '.co.uk' and I'm left with google which I can run through the database. Now if I have 'www.images.google' It's more awkward. I don't want to have to store every different subdomain and '.com, .co.uk etc'.. So I'm trying to extract just the part I need. It's also because if someone has a url such as: http://www.google.mysite.com/ I blatantly don't want that to be classed as a search engine url. Any ideas? Quote Link to comment Share on other sites More sharing options...
Dragen Posted November 12, 2007 Author Share Posted November 12, 2007 hmm... Thinking about it, all I need to get rid of is the '.co.uk', and '.com' etc. Which I could easily do using str_replace and an array, containing every possible one, but that's a bit annoying. There must be an all round ereg that can catch it, but all the one's I've seen have major problems, such as not recognising a lot of less well known ones. Quote Link to comment Share on other sites More sharing options...
Dragen Posted November 13, 2007 Author Share Posted November 13, 2007 any ideas? Quote Link to comment Share on other sites More sharing options...
Dragen Posted November 17, 2007 Author Share Posted November 17, 2007 just thought I'd say that I've solved it.. really simple. I wrote a function which I run through the urls to check them: <?php function search_hits(array $hit_list, $k){ $this->engines = array( 'alexa' => 'http://www.alexa.com/', 'altavista' => 'http://www.altavista.com/', 'ask' => 'http://www.ask.com/', 'dogpile' => 'http://www.dogpile.com/', 'exalead' => 'http://www.exalead.com/', 'gigablast' => 'http://www.gigablast.com/', 'google' => 'http://www.google.com/', 'live' => 'http://www.live.com/', 'searchenginewatch' => 'http://searchenginewatch.com/', 'yahoo' => 'http://search.yahoo.com/', 'yell' => 'http://www.yell.com/', ); foreach($hit_list as $v){ if(isset($v[$k]) && (($v[$k] != '') && ($host = parse_url($v[$k], PHP_URL_HOST)))){ $host = ereg_replace('\.([a-z]{2}(\.[a-z]{2})?|[a-z]{3})$', '', str_replace('www.', '', $host)); if(strstr($host, '.') !== false){ $host = strstr($host, '.'); } $host = trim($host, '.'); if(array_key_exists($host, $this->engines)){ $r[$host]['ip'][] = $v['ip']; } } } if(isset($r) && is_array($r)){ return $r; }else{ return false; } } ?> Basically I have an array with the name and url for each search engine (name, must be the domain of url). Then go through my array of urls and get the host with parse_url, which gives me something like: www.mydomain.com Using str_replace to get rid of the 'www.' and eregi to remove the end section. It checks for several end combinations: .'2 letters' (i.e; .us) .'2 letters'.'2 letters' (i.e; .co.uk) .'3 letters' (i.e; .com) Then simply go through the search engine array for matches. simple Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.