Jump to content

Recommended Posts

Hi

 

Please take a look at the following code. It is running as a shell process. I have tested a number of times and it simply stops inserting data into the database at random intervals...as in one minute it may insert 100 records then stop doing anything, the next time may only enter 20 and stop...the next test may enter 500 then stop.

 

Its really bugging me!

 

<?php
set_time_limit(0);

require 'dbconnect.php';

// starting point
$startingUrl = 'http://www.alexa.com/topsites/category';

// set xPath
$xPath = "/html/body//div[@class='categories top']//ul//a";

// scrape through categories at all levels
if ( scrapeCategories ( $startingUrl, $xPath ) ) {
        
    // terminate app
    exit;
    
}

function scrapeCategories ( $url, $currentXPath, $parentID = 0 ) {
    
    // sleep for random seconds 5-20
    //sleep ( rand(5,20));
    
    // reset error number
    $errorNo = 0;
    $errorTxt = '';
    $return = '';
    
    // loop while timeout is occuring or cant connect or empty reply or 403
    do {
        
        // load up a random proxy    
        $proxy = mysql_query ( "SELECT * FROM `proxies` ORDER BY rand() LIMIT 0,1");
        $proxy = mysql_fetch_array ( $proxy );
    
        // start curl
        $curl = curl_init ( $url );
                
        // set curl options
        curl_setopt ( $curl, CURLOPT_FAILONERROR, true );
        curl_setopt ( $curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt ( $curl, CURLOPT_RETURNTRANSFER, true );
        curl_setopt ( $curl, CURLOPT_CONNECTTIMEOUT, 10 );
        curl_setopt ( $curl, CURLOPT_COOKIE, "safe-mode=off");
        curl_setopt( $curl, CURLOPT_PROXY, $proxy['ip'] . ':' . $proxy['port'] );
        curl_setopt ( $curl, CURLOPT_HEADER, true );
        curl_setopt ( $curl, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 10);
        
        // run curl
        $return = curl_exec ( $curl );
            
        // get curl error number
        $errorNo = curl_errno ( $curl );
        $errorTxt = curl_error ( $curl );
        
        // close off curl and free memory
        curl_close ( $curl );

    } while ( $errorNo > 0 && $errorTxt != 'The requested URL returned error: 404' );
    
    // if the error number is bigger than 400
    // and the text reads: The requested URL returned error: 404
    // then the page does not exist!
    if ( $errorNo == 0 && $errorTxt != 'The requested URL returned error: 404' ) {
       
        // got the page !!!!
        
        // create DOM object
        $dom = new DOMDocument();
        
        // load up page html into dom
        $dom->loadHTML ( $return );
        
        // create xpath object
        $xpath = new DOMXpath ( $dom );
        
        // find links
        $hrefs = $xpath->evaluate ( $currentXPath );
        
        
        // if we have links, then there are categories to scrape!
        if ( $hrefs->length > 0 ) {
            
            for ( $i = 0; $i < $hrefs->length; $i++ ) {
               
               // get link object
               $href = $hrefs->item ( $i );
               
               // get url
               $linkUrl = 'http://www.alexa.com' . $href->getAttribute ( 'href' );
               
               // get link text
               $linkText = $href->nodeValue;
               
               echo "<a href='http://www.alexa.com$linkUrl'>$linkText</a><br />";
               // insert into database
               mysql_query ( "INSERT INTO `categories` VALUES (
					    '0',
					    '$parentID',
					    '" . trim(utf8_encode ($linkText)) . "',
					    '" . trim(utf8_encode ($linkUrl)) . "',
                                                    '0',
                                                    '0'
					    );") or die ( mysql_error());
               
               // get the category id
               //$catID = mysql_insert_id();
               
               // scrape this sub category
               
               // new xpath
               //$xpath = "/html/body//div[@id='catList']//ul//a";
               
               // recursive function call
               //scrapeCategories ( 'http://www.alexa.com' . $linkUrl, $xpath, $catID );
               
            }
        
            // destroy DOM
            unset ( $dom );
            unset ( $xpath );
            
            //return;
            
            // select all categories with scraped = '0'
            $categories = mysql_query ( "SELECT * FROM `categories` WHERE `scraped` = '0' AND `parent_id` = '$parentID'") or die ( mysql_error() );
            
            if ( mysql_num_rows ( $categories ) ) {
                
                while ( $category = mysql_fetch_array ( $categories ) ) {
                    
                    // set xpath for sub categories
                    $xpath = "/html/body//div[@id='catList']//ul//a";
                    
                    // set this category to scraped!
                    mysql_query ( "UPDATE `categories` SET `scraped` = '1' WHERE `id` = '$category[id]'") or die ( mysql_error() );
                    
                    // scrape!
                    scrapeCategories ( $category['url'], $xpath, $category['id'] );
                    
                    
                    
                }
                
                // clear memory                
                return;
            } else {
            
                // clear memory
                return;
            
            }
        
        } else {
      
            return;
        }
    
    }
}

?>

Link to comment
https://forums.phpfreaks.com/topic/202985-code-stops-working-at-random-intervals/
Share on other sites

Are you running this via the CLI or browser? If a browser, some browsers have their own timeouts built in, IE: if no data is displayed on the page or sent to the browser in x amount of time they time out. So you may try running it as a CLI interface if you are not currently.

 

If you insist on using a browser, you will need to try and flush some data to the browser in an attempt to keep it alive, however that only works on some browsers.

I think it dies when it tries to use one of the proxies. It would make sense, as the script dies at random times and you are fetching random proxies from a table. Also try writing the errors to a file (that's what I do with my cron scripts), then you can debug it a lot easier.

hi

 

the problem there is that if you notice the proxy selection is in a loop which checks to see if an error is produced other than a 404...so it would loop back and find a better proxy!

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.