Code Stops Working At Random Intervals

glenelkins · May 26, 2010

Hi

Please take a look at the following code. It is running as a shell process. I have tested a number of times and it simply stops inserting data into the database at random intervals...as in one minute it may insert 100 records then stop doing anything, the next time may only enter 20 and stop...the next test may enter 500 then stop.

Its really bugging me!

<?php
set_time_limit(0);

require 'dbconnect.php';

// starting point
$startingUrl = 'http://www.alexa.com/topsites/category';

// set xPath
$xPath = "/html/body//div[@class='categories top']//ul//a";

// scrape through categories at all levels
if ( scrapeCategories ( $startingUrl, $xPath ) ) {
        
    // terminate app
    exit;
    
}

function scrapeCategories ( $url, $currentXPath, $parentID = 0 ) {
    
    // sleep for random seconds 5-20
    //sleep ( rand(5,20));
    
    // reset error number
    $errorNo = 0;
    $errorTxt = '';
    $return = '';
    
    // loop while timeout is occuring or cant connect or empty reply or 403
    do {
        
        // load up a random proxy    
        $proxy = mysql_query ( "SELECT * FROM `proxies` ORDER BY rand() LIMIT 0,1");
        $proxy = mysql_fetch_array ( $proxy );
    
        // start curl
        $curl = curl_init ( $url );
                
        // set curl options
        curl_setopt ( $curl, CURLOPT_FAILONERROR, true );
        curl_setopt ( $curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt ( $curl, CURLOPT_RETURNTRANSFER, true );
        curl_setopt ( $curl, CURLOPT_CONNECTTIMEOUT, 10 );
        curl_setopt ( $curl, CURLOPT_COOKIE, "safe-mode=off");
        curl_setopt( $curl, CURLOPT_PROXY, $proxy['ip'] . ':' . $proxy['port'] );
        curl_setopt ( $curl, CURLOPT_HEADER, true );
        curl_setopt ( $curl, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 10);
        
        // run curl
        $return = curl_exec ( $curl );
            
        // get curl error number
        $errorNo = curl_errno ( $curl );
        $errorTxt = curl_error ( $curl );
        
        // close off curl and free memory
        curl_close ( $curl );

    } while ( $errorNo > 0 && $errorTxt != 'The requested URL returned error: 404' );
    
    // if the error number is bigger than 400
    // and the text reads: The requested URL returned error: 404
    // then the page does not exist!
    if ( $errorNo == 0 && $errorTxt != 'The requested URL returned error: 404' ) {
       
        // got the page !!!!
        
        // create DOM object
        $dom = new DOMDocument();
        
        // load up page html into dom
        $dom->loadHTML ( $return );
        
        // create xpath object
        $xpath = new DOMXpath ( $dom );
        
        // find links
        $hrefs = $xpath->evaluate ( $currentXPath );
        
        
        // if we have links, then there are categories to scrape!
        if ( $hrefs->length > 0 ) {
            
            for ( $i = 0; $i < $hrefs->length; $i++ ) {
               
               // get link object
               $href = $hrefs->item ( $i );
               
               // get url
               $linkUrl = 'http://www.alexa.com' . $href->getAttribute ( 'href' );
               
               // get link text
               $linkText = $href->nodeValue;
               
               echo "<a href='http://www.alexa.com$linkUrl'>$linkText</a><br />";
               // insert into database
               mysql_query ( "INSERT INTO `categories` VALUES (
					    '0',
					    '$parentID',
					    '" . trim(utf8_encode ($linkText)) . "',
					    '" . trim(utf8_encode ($linkUrl)) . "',
                                                    '0',
                                                    '0'
					    );") or die ( mysql_error());
               
               // get the category id
               //$catID = mysql_insert_id();
               
               // scrape this sub category
               
               // new xpath
               //$xpath = "/html/body//div[@id='catList']//ul//a";
               
               // recursive function call
               //scrapeCategories ( 'http://www.alexa.com' . $linkUrl, $xpath, $catID );
               
            }
        
            // destroy DOM
            unset ( $dom );
            unset ( $xpath );
            
            //return;
            
            // select all categories with scraped = '0'
            $categories = mysql_query ( "SELECT * FROM `categories` WHERE `scraped` = '0' AND `parent_id` = '$parentID'") or die ( mysql_error() );
            
            if ( mysql_num_rows ( $categories ) ) {
                
                while ( $category = mysql_fetch_array ( $categories ) ) {
                    
                    // set xpath for sub categories
                    $xpath = "/html/body//div[@id='catList']//ul//a";
                    
                    // set this category to scraped!
                    mysql_query ( "UPDATE `categories` SET `scraped` = '1' WHERE `id` = '$category[id]'") or die ( mysql_error() );
                    
                    // scrape!
                    scrapeCategories ( $category['url'], $xpath, $category['id'] );
                    
                    
                    
                }
                
                // clear memory                
                return;
            } else {
            
                // clear memory
                return;
            
            }
        
        } else {
      
            return;
        }
    
    }
}

?>

premiso · May 26, 2010

Are you running this via the CLI or browser? If a browser, some browsers have their own timeouts built in, IE: if no data is displayed on the page or sent to the browser in x amount of time they time out. So you may try running it as a CLI interface if you are not currently.

If you insist on using a browser, you will need to try and flush some data to the browser in an attempt to keep it alive, however that only works on some browsers.

glenelkins · May 26, 2010

hi

i am running it from CLI

mattal999 · May 26, 2010

I think it dies when it tries to use one of the proxies. It would make sense, as the script dies at random times and you are fetching random proxies from a table. Also try writing the errors to a file (that's what I do with my cron scripts), then you can debug it a lot easier.

glenelkins · May 26, 2010

hi

the problem there is that if you notice the proxy selection is in a loop which checks to see if an error is produced other than a 404...so it would loop back and find a better proxy!

Sign In

Code Stops Working At Random Intervals

Recommended Posts

glenelkins

Link to comment

Share on other sites

premiso

Link to comment

Share on other sites

glenelkins

Link to comment

Share on other sites

mattal999

Link to comment

Share on other sites

glenelkins

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information