Jump to content

identifying web crawlers / spiders by ip address


ajetrumpet

Recommended Posts

hey guys,

in the attached image, i'm logged into another forum I'm part of and I'm looking at the page called "who's online".  I have a php traffic report page that, when accessed, echoes out database data that has been stored by way of another php script that captures geoLocation data (ip address of ISP, referrer page, date/time of visit) using PHP global variables.  my question is - how does this forum script know the identity of the google spiders?  in my traffic report, i am only capturing the ip address of the ISP as the identifying information.  from what I understand, it's not possible to capture the actual location of the visitor, only the ISP's location.  if I look up the ip address on an ip lookup website, i can see that it is a google spider, but can this be done through PHP scripting?

615425440_visitoridentification.thumb.jpg.764af92588c2959229fc90e3f8c08630.jpg

Edited by ajetrumpet
Link to comment
Share on other sites

Friendly spiders such as google's will identify them via the User-Agent header in their HTTP requests.  For example, google sends: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) This header is most likely how the forum is deciding if it's google or not.

If you follow that link in the user agent header for google, they mention being able to verify an IP belongs to google bot by doing a reverse DNS lookup on it.

Other spiders may or may not have a similar IP verification mechanism, you'd have to research them individually.

 

Link to comment
Share on other sites

kicken,

I ran a test with all of these included:

<?php

echo "ip - " . $_SERVER['REMOTE_ADDR'];
echo "<br>";
echo "gethostbyaddr - " . gethostbyaddr($_SERVER['REMOTE_ADDR']);
echo "<br>";
echo "uname - " . php_uname();
echo "<br>";
echo "gethostname() - " . gethostname();
echo "<br>";
echo "HTTP_HOST - " . $_SERVER['HTTP_HOST'];
echo "<br>";
echo "SERVER_NAME - " . $_SERVER['SERVER_NAME'];

?>

this is really good info and I think I'll use it.  one question though:   HTTP_HOST and SERVER_NAME return the same result.  is there any scenario where they would *not* return the same?

Link to comment
Share on other sites

additionally kicken,

I think I might have a corrupted file.  My query for my report is:

    $sql = mysqli_query($conn, "SELECT ip
                                     , page
                                     , CASE WHEN referrer = ''
                                            THEN 'N/A'
                                            ELSE referrer
                                       END as referrer     
                                     , DATE_FORMAT(date, '%m/%d/%y') as date
                                     , TIME_FORMAT(logged, '%T') as time
                                FROM tblTraffic 
                                ORDER BY date DESC, time DESC");

and my PHP echo code is:

<body>
    <table border='1'>
        <tr>
            <th>VISITOR IP ADDRESS, ISP NAME</th>
            <th>VISITOR DOMAIN ADDRESS<th>
            <th>PAGE VISITED</th>
            <th>DATE</th>
            <th>TIME</th>
        </tr>
        
        <?php
            // printing table rows
            while($row = mysqli_fetch_row($sql)) {
                echo '<tr>'; 
                foreach ($row as $key => $col) {
					echo "<td>$col</td>";
                }
                echo '</tr>';
            }
            
        ?>
    </table>
</body>

I attached an image of what I'm seeing as an output.  There is an extra column without a header and the data is still being outputted although i'm not querying 6 columns.  can you see something wrong with this?

output.jpg

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.