glenelkins Posted May 21, 2010 Share Posted May 21, 2010 Hi i have written some code that basically scrapes some links. The links are in 1000s of categories and sub-categoires. So in the first instance I created a category scraper which went down through category levels and saved the category name and its url. The best way I could think to make this work was through a reccursive function which uses curl through a random proxy server. ... i finally managed to get this to complete the scraping but not until it seemed to stop doing anything and i had to kill the process and start it off where it finished The second script loads the database of categories and loops through them again using curl through random proxy each time to pull links off each category page. And now I looked this morning and again the process is still running on there server, but its inserting nothing new into the database Why do my scripts keep hanging? They still show in the process list but do absolutely nothing after a while. I don't quite understand why it works, stores a load of info then just stops doing anything for no reason. Could there be a memory leak somewhere where perhaps curl isn't giving the memory back??? I would of thought my server would tell me if that was the case though? Quote Link to comment Share on other sites More sharing options...
glenelkins Posted May 21, 2010 Author Share Posted May 21, 2010 bump Quote Link to comment Share on other sites More sharing options...
kenrbnsn Posted May 21, 2010 Share Posted May 21, 2010 If you don't post your code, we can't help you. Please post the code between tags. Ken Quote Link to comment Share on other sites More sharing options...
glenelkins Posted May 21, 2010 Author Share Posted May 21, 2010 <?php // Load up database connection require 'dbconnect.php'; // create xpath $_xPath = "/html/body//li[@class='site-listing']//h2//a"; // select all categories $_result = mysql_query ( "SELECT * FROM `categories` WHERE `scraped` = '0'"); // loop through categories while ( $category = mysql_fetch_array ( $_result ) ) { // set return flag for ping $return = 1; // chose a random proxy and make sure it is responding do { // select $_proxy = mysql_query ( "SELECT * FROM `proxies` ORDER BY rand() LIMIT 0,1"); // check a proxy is selected if ( mysql_num_rows ( $_proxy ) ) { $_proxy = mysql_fetch_array ( $_proxy ); // ping exec ( 'ping ' . $_proxy['ip'] . ' -t 10 -c 5',$output, $return ); // if the return value is not 0..the proxy is not responding...so delete it if ( $return != 0 ) { echo "PING FAILED - DELETING PROXY<br />"; // delete mysql_query ( "DELETE FROM `proxies` WHERE `id` = '$_proxy[id]'"); } else { echo "PING SUCCESS<br />"; break; } } else { // no proxies in database die ( "NO PROXIES IN DATABASE"); } } while ( $return != 0 ); // get the page $_page = getPage ( 'http://www.alexa.com' . $category['url'], $_proxy['ip'], $_proxy['port'] ); // check to see if the page is an array - if so its error if ( !is_array ( $_page ) ) { // update the category as being scraped mysql_query ( "UPDATE `categories` SET `scraped` = '1' WHERE `id` = '$category[id]'"); // we have html...scrape websites $dom = new DOMDocument(); // load up page html $dom->loadHTML ( $_page ); // create xpath $xpath = new DOMXPath ( $dom ); // find websites links $hrefs = $xpath->evaluate ( $_xPath ); // if we have links if ( $hrefs->length > 0 ) { for ( $i = 0; $i < $hrefs->length; $i++ ) { $href= $hrefs->item ($i); $url = $href->nodeValue; mysql_query ( "INSERT INTO `websites` VALUES ( '0', '$category[id]', '" . utf8_encode( $url ) . "' );" ); } } } else { // store error mysql_query ( "INSERT INTO `failed_websites` VALUES ( '', '$category[id]', '$category[url]', '" . $_page[1] . "', '" . $_proxy['ip'] . $_proxy['port'] . "' );"); } } /* function: getPage grabs a specified url using CURL and returns the HTML on failure, returns the curl error as an array ( code, message ); */ function getPage ( $url, $proxy, $port ) { // start curl $_curl = curl_init(); // set curl options curl_setopt ( $_curl, CURLOPT_URL, $url ); curl_setopt ( $_curl, CURLOPT_FAILONERROR, true ); curl_setopt ( $_curl, CURLOPT_FOLLOWLOCATION, true); curl_setopt ( $_curl, CURLOPT_AUTOREFERER, true ); curl_setopt ( $_curl, CURLOPT_RETURNTRANSFER, true ); curl_setopt ( $_curl, CURLOPT_CONNECTTIMEOUT, 20 ); curl_setopt ( $_curl, CURLOPT_COOKIE, "safe-mode=off"); curl_setopt( $_curl, CURLOPT_PROXY, $proxy ); curl_setopt( $_curl, CURLOPT_PROXYPORT, $port); // get the html $_html = curl_exec ( $_curl ); //check to see if there is an error if ( curl_errno($_curl) > 0 ) { // make error array $_error = array ( curl_errno ( $_curl ), curl_error ( $_curl ) ); curl_close($_curl); return $_error; } else{ curl_close ( $_curl ); return $_html; } } ?> Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.