Jump to content

Crawler (2)


hackalive

Recommended Posts

Okay so thorpe has locked my post and pointed out it is far too vague. So here I go again.

 

Firstly I know how to build a basic PHP cURL web page crawler. Now my question is more specific to this, how do I get Site Stats to recognise the crawler as other ones such as Google Bot and MLBot are listed (MLbot http://www.metadatalabs.com/mlbot, so mine would be HackAliveCrawler htpp://mysite.com/hackalivecrawler or similar...). And how do I get it to recognise ROBOTS meta tags, ie NOINDEX, NOFOLLOW, or INDEX, FOLLOW etc and for robots.txt (to do this I need to first part to work, the name set part (eg GoogleBot or MLBot etc). Hope this is less vague and can produce some answers.

 

Links to tutorials or other forums that will achieve these desired result as well as personal opinion and commernts are most welcome. Many thanks in advance.

Link to comment
Share on other sites

Now my question is more specific to this, how do I get Site Stats to recognise the crawler as other ones such as Google Bot and MLBot are listed

 

The User-Agent directive.

 

And how do I get it to recognise ROBOTS meta tags, ie NOINDEX, NOFOLLOW, or INDEX, FOLLOW etc and for robots.txt (to do this I need to first part to work, the name set part (eg GoogleBot or MLBot etc).

 

Upon crawling you check if a robots.txt file exists and adhere to it. You crawl the pages (or pages within directories) which are not blacklisted in the robots.txt file and search for a ROBOTS meta tag. If the robots meta-tag contains you act accordingly.

Link to comment
Share on other sites

okay slight problem, this code

<?php
header('User-Agent: HackAliveCrawler http://mysite.com/hackalivecrawler');
echo $_SERVER['HTTP_USER_AGENT'];
?>

 

returns

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)

 

not

 

any reason why?

 

if I use USER_AGENT and not HTTP_USER_AGENT it returns nothing

 

thanks once again

 

Link to comment
Share on other sites

I believe that'll be the browser over-writing whatever you send. If you're using cURL to crawl the pages as I mentioned before just send it as a cURL option.

 

Edit: Should add I'm only speculating about the browser, I'm not fully sure how that part works.

Link to comment
Share on other sites

MrAdam

Warning: curl_setopt(): supplied argument is not a valid cURL handle resource

 

for this code

<?php
curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler');
echo $_SERVER['HTTP_USER_AGENT'];
echo "|";
echo $_SERVER['USER_AGENT'];
?>

Link to comment
Share on other sites

MrAdam

Warning: curl_setopt(): supplied argument is not a valid cURL handle resource

 

for this code

<?php
curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler');
echo $_SERVER['HTTP_USER_AGENT'];
echo "|";
echo $_SERVER['USER_AGENT'];
?>

 

You haven't defined $ch.

Link to comment
Share on other sites

MrAdam

Warning: curl_setopt(): supplied argument is not a valid cURL handle resource

 

for this code

<?php
curl_setopt($ch, CURLOPT_USERAGENT, 'HackAliveCrawler htpp://mysite.com/hackalivecrawler');
echo $_SERVER['HTTP_USER_AGENT'];
echo "|";
echo $_SERVER['USER_AGENT'];
?>

 

That's not how it works. As thorpe has just replied, you need to define $ch (look at the cURL examples on the manual), but even then it only sets the user agent for the request you're going to make with cURL.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.