Jump to content

Script Running As Shell Process Stops Doing Anything


glenelkins

Recommended Posts

Hi

 

i have written some code that basically scrapes some links. The links are in 1000s of categories and sub-categoires. So in the first instance I created a category scraper which went down through category levels and saved the category name and its url. The best way I could think to make this work was through a reccursive function which uses curl through a random proxy server. ... i finally managed to get this to complete the scraping but not until it seemed to stop doing anything and i had to kill the process and start it off where it finished

 

The second script loads the database of categories and loops through them again using curl through random proxy each time to pull links off each category page. And now I looked this morning and again the process is still running on there server, but its inserting nothing new into the database

 

Why do my scripts keep hanging? They still show in the process list but do absolutely nothing after a while. I don't quite understand why it works, stores a load of info then just stops doing anything for no reason. Could there be a memory leak somewhere where perhaps curl isn't giving the memory back??? I would of thought my server would tell me if that was the case though?

Link to comment
Share on other sites

<?php

// Load up database connection
require 'dbconnect.php';

// create xpath
$_xPath = "/html/body//li[@class='site-listing']//h2//a";

// select all categories
$_result = mysql_query ( "SELECT * FROM `categories` WHERE `scraped` = '0'");

// loop through categories
while ( $category = mysql_fetch_array ( $_result ) ) {
    
    // set return flag for ping
    $return = 1;
    
    // chose a random proxy and make sure it is responding
    do {
        
        // select
        $_proxy = mysql_query ( "SELECT * FROM `proxies` ORDER BY rand() LIMIT 0,1");
        
        // check a proxy is selected
        if ( mysql_num_rows ( $_proxy ) ) {
            
            $_proxy = mysql_fetch_array ( $_proxy );
        
            // ping
            exec ( 'ping ' . $_proxy['ip'] . ' -t 10 -c 5',$output, $return );
            
            // if the return value is not 0..the proxy is not responding...so delete it
            if ( $return != 0 ) {
            
                echo "PING FAILED - DELETING PROXY<br />";
                
                 // delete
                mysql_query ( "DELETE FROM `proxies` WHERE `id` = '$_proxy[id]'");

                
            } else {
            
                echo "PING SUCCESS<br />";
                
                break;
            
            }
            
            
            
        } else {
            
            // no proxies in database
            die ( "NO PROXIES IN DATABASE");
        }
        
        
    } while ( $return != 0 );
    
    // get the page
    $_page = getPage ( 'http://www.alexa.com' . $category['url'], $_proxy['ip'], $_proxy['port'] );
    
    // check to see if the page is an array - if so its error
    if ( !is_array ( $_page ) ) {
        
        // update the category as being scraped
        mysql_query ( "UPDATE `categories` SET `scraped` = '1' WHERE `id` = '$category[id]'");
        
        // we have html...scrape websites
        $dom = new DOMDocument();

// load up page html
$dom->loadHTML ( $_page );

// create xpath
$xpath = new DOMXPath ( $dom );

// find websites links
$hrefs = $xpath->evaluate ( $_xPath );

        // if we have links
if ( $hrefs->length > 0 ) {

            for ( $i = 0; $i < $hrefs->length; $i++ ) {

                $href= $hrefs->item ($i);

                $url = $href->nodeValue;
                        
                mysql_query ( "INSERT INTO `websites` VALUES (
				'0',
				'$category[id]',
				'" . utf8_encode( $url ) . "'
			);" );
                    
           }
           
        }
        
    } else {
        
        // store error
        mysql_query ( "INSERT INTO `failed_websites` VALUES (
                        '',
                        '$category[id]',
                        '$category[url]',
                        '" . $_page[1] . "',
                        '" . $_proxy['ip'] . $_proxy['port']  . "'
                    );");
    }

    
}

/* function: getPage
    grabs a specified url using CURL and returns the HTML
    on failure, returns the curl error as an array ( code, message );
*/
function getPage ( $url, $proxy, $port ) {

    // start curl
    $_curl = curl_init();
    
    // set curl options
    curl_setopt ( $_curl, CURLOPT_URL, $url );
    curl_setopt ( $_curl, CURLOPT_FAILONERROR, true );
    curl_setopt ( $_curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt ( $_curl, CURLOPT_AUTOREFERER, true );
    curl_setopt ( $_curl, CURLOPT_RETURNTRANSFER, true );
    curl_setopt ( $_curl, CURLOPT_CONNECTTIMEOUT, 20 );
    curl_setopt ( $_curl, CURLOPT_COOKIE, "safe-mode=off");
    curl_setopt( $_curl, CURLOPT_PROXY, $proxy );
    curl_setopt( $_curl, CURLOPT_PROXYPORT, $port);
    
    // get the html
    $_html = curl_exec ( $_curl );
    
    //check to see if there is an error
    if (  curl_errno($_curl) > 0 ) {
        
        // make error array
        $_error = array ( curl_errno ( $_curl ), curl_error ( $_curl ) );


        curl_close($_curl);
        
        return $_error;
    
    } else{
        
        curl_close ( $_curl );
        
        return $_html;
    
    }
    
}


?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.