glenelkins Posted May 26, 2010 Share Posted May 26, 2010 Hi Please take a look at the following code. It is running as a shell process. I have tested a number of times and it simply stops inserting data into the database at random intervals...as in one minute it may insert 100 records then stop doing anything, the next time may only enter 20 and stop...the next test may enter 500 then stop. Its really bugging me! <?php set_time_limit(0); require 'dbconnect.php'; // starting point $startingUrl = 'http://www.alexa.com/topsites/category'; // set xPath $xPath = "/html/body//div[@class='categories top']//ul//a"; // scrape through categories at all levels if ( scrapeCategories ( $startingUrl, $xPath ) ) { // terminate app exit; } function scrapeCategories ( $url, $currentXPath, $parentID = 0 ) { // sleep for random seconds 5-20 //sleep ( rand(5,20)); // reset error number $errorNo = 0; $errorTxt = ''; $return = ''; // loop while timeout is occuring or cant connect or empty reply or 403 do { // load up a random proxy $proxy = mysql_query ( "SELECT * FROM `proxies` ORDER BY rand() LIMIT 0,1"); $proxy = mysql_fetch_array ( $proxy ); // start curl $curl = curl_init ( $url ); // set curl options curl_setopt ( $curl, CURLOPT_FAILONERROR, true ); curl_setopt ( $curl, CURLOPT_FOLLOWLOCATION, true); curl_setopt ( $curl, CURLOPT_RETURNTRANSFER, true ); curl_setopt ( $curl, CURLOPT_CONNECTTIMEOUT, 10 ); curl_setopt ( $curl, CURLOPT_COOKIE, "safe-mode=off"); curl_setopt( $curl, CURLOPT_PROXY, $proxy['ip'] . ':' . $proxy['port'] ); curl_setopt ( $curl, CURLOPT_HEADER, true ); curl_setopt ( $curl, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5 ); curl_setopt($curl, CURLOPT_TIMEOUT, 10); // run curl $return = curl_exec ( $curl ); // get curl error number $errorNo = curl_errno ( $curl ); $errorTxt = curl_error ( $curl ); // close off curl and free memory curl_close ( $curl ); } while ( $errorNo > 0 && $errorTxt != 'The requested URL returned error: 404' ); // if the error number is bigger than 400 // and the text reads: The requested URL returned error: 404 // then the page does not exist! if ( $errorNo == 0 && $errorTxt != 'The requested URL returned error: 404' ) { // got the page !!!! // create DOM object $dom = new DOMDocument(); // load up page html into dom $dom->loadHTML ( $return ); // create xpath object $xpath = new DOMXpath ( $dom ); // find links $hrefs = $xpath->evaluate ( $currentXPath ); // if we have links, then there are categories to scrape! if ( $hrefs->length > 0 ) { for ( $i = 0; $i < $hrefs->length; $i++ ) { // get link object $href = $hrefs->item ( $i ); // get url $linkUrl = 'http://www.alexa.com' . $href->getAttribute ( 'href' ); // get link text $linkText = $href->nodeValue; echo "<a href='http://www.alexa.com$linkUrl'>$linkText</a><br />"; // insert into database mysql_query ( "INSERT INTO `categories` VALUES ( '0', '$parentID', '" . trim(utf8_encode ($linkText)) . "', '" . trim(utf8_encode ($linkUrl)) . "', '0', '0' );") or die ( mysql_error()); // get the category id //$catID = mysql_insert_id(); // scrape this sub category // new xpath //$xpath = "/html/body//div[@id='catList']//ul//a"; // recursive function call //scrapeCategories ( 'http://www.alexa.com' . $linkUrl, $xpath, $catID ); } // destroy DOM unset ( $dom ); unset ( $xpath ); //return; // select all categories with scraped = '0' $categories = mysql_query ( "SELECT * FROM `categories` WHERE `scraped` = '0' AND `parent_id` = '$parentID'") or die ( mysql_error() ); if ( mysql_num_rows ( $categories ) ) { while ( $category = mysql_fetch_array ( $categories ) ) { // set xpath for sub categories $xpath = "/html/body//div[@id='catList']//ul//a"; // set this category to scraped! mysql_query ( "UPDATE `categories` SET `scraped` = '1' WHERE `id` = '$category[id]'") or die ( mysql_error() ); // scrape! scrapeCategories ( $category['url'], $xpath, $category['id'] ); } // clear memory return; } else { // clear memory return; } } else { return; } } } ?> Quote Link to comment https://forums.phpfreaks.com/topic/202985-code-stops-working-at-random-intervals/ Share on other sites More sharing options...
premiso Posted May 26, 2010 Share Posted May 26, 2010 Are you running this via the CLI or browser? If a browser, some browsers have their own timeouts built in, IE: if no data is displayed on the page or sent to the browser in x amount of time they time out. So you may try running it as a CLI interface if you are not currently. If you insist on using a browser, you will need to try and flush some data to the browser in an attempt to keep it alive, however that only works on some browsers. Quote Link to comment https://forums.phpfreaks.com/topic/202985-code-stops-working-at-random-intervals/#findComment-1063679 Share on other sites More sharing options...
glenelkins Posted May 26, 2010 Author Share Posted May 26, 2010 hi i am running it from CLI Quote Link to comment https://forums.phpfreaks.com/topic/202985-code-stops-working-at-random-intervals/#findComment-1063774 Share on other sites More sharing options...
mattal999 Posted May 26, 2010 Share Posted May 26, 2010 I think it dies when it tries to use one of the proxies. It would make sense, as the script dies at random times and you are fetching random proxies from a table. Also try writing the errors to a file (that's what I do with my cron scripts), then you can debug it a lot easier. Quote Link to comment https://forums.phpfreaks.com/topic/202985-code-stops-working-at-random-intervals/#findComment-1063778 Share on other sites More sharing options...
glenelkins Posted May 26, 2010 Author Share Posted May 26, 2010 hi the problem there is that if you notice the proxy selection is in a loop which checks to see if an error is produced other than a 404...so it would loop back and find a better proxy! Quote Link to comment https://forums.phpfreaks.com/topic/202985-code-stops-working-at-random-intervals/#findComment-1063789 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.