Jump to content

Problem Processing Large cURL Request


joallen

Recommended Posts

I am having a heck of a time trying to process a large cURL request. I keep running into issues with the mysql server timing out and also using the callback function within the cURL script (see below). What I am attempting to do is to utilize cURL to log a user into a system (*due to legality issues I cannot specify which) and pull all of their work for the day. I have been successful at pulling all of the work, but each order contains multiple sub-items, each with a specific url. For instance, 300 work orders would translate to approximately 2000 sub-items.

 

Pulling the 300 work order takes approximately 1.6 minutes. For some reason just pulling 10 sub-items is taking in upwards of 3 minutes. After hundreds (and I am not exaggerating) of attempts I have finally decided to reach out to see if someone can take a look at my script and offer some knowledge.

 

Here is the process from a logic standpoint:

  1. Pull all user login data from database and log them into the system through cURL (*Works fine)
  2. Request all activity and customer information and Insert into database (*Works fine)
  3. Get all sub-items and insert them into the database (*ISSUES)

 

Here is the process from a script standpoint:

  1. User clicks "Import" button which sends AJAX request to run importWork PHP function. This function only handles requesting the activity and customer information through cURL. (Due to the amount of time it takes for the sub-items to process I have broken up the process).
  2. importWork function returns via jSON the number of work orders processed.
  3. ***In testing I have also had the importWork function store the urls for all of the sub-items to my database. The only issue is that the logins will start to timeout (Not on my server but the server I am pulling the data from) before all the sub-items can process.
  4. javascript automatically sends another AJAX request to pull all of the sub-items.

I am using a cURL Multi function to process the url requests. The function will return an array containing the html for each of the urls. I then parse the html to search for the underlying hrefs I need to access the workorders, customer information, and sub-items.

 

So overall, my question is, what is the best way to handle a large cURL request of 2000 urls? Below you will see the rolling_curl function which I am attempting to use to handle the line items. For some reason it doesnt work at all. What I would like to do is simply send an array of urls to the rolling_curl function and have it request all the html for each url. Once a url is finished processing it should run the callback script to insert the data into the database. I figured it would be the best way to handle such a large request in a timely manner.

 

 

ROLLING CURL FUNCTION:

explanation: A function will put all sub-item urls and the corresponding activity ids into an associative array and pass it to the rolling_curl function. The callback function will parse the html and insert the needed data into the database. The only thing this function is doing at this time is dumping "Failed". I have ran the script using the same urls through the standard cURL multi function (See Below) and verified it is pulling the html (So it isn't an issue with the urls).

public function rolling_curl($urldata, $callback = null, $custom_options = null) {
  	set_time_limit(0);
	    //extract data from $urldata
	    $urls = $urldata['urls'];
	    $activities = $urldata['activities'];
	  
	    // make sure the rolling window isn't greater than the # of urls
	    $rolling_window = 95;
	    $rolling_window = (sizeof($urls) < $rolling_window) ? sizeof($urls) : $rolling_window;
	
	    $master = curl_multi_init();
	    $curl_arr = array();
	
	    // add additional curl options here
	    $std_options = array(CURLOPT_RETURNTRANSFER => true,
	    CURLOPT_FOLLOWLOCATION => true,
	    CURLOPT_MAXREDIRS => 5);
	    $options = ($custom_options) ? ($std_options + $custom_options) : $std_options;
	
	    // start the first batch of requests
	    for ($i = 0; $i < $rolling_window; $i++) {
	        $ch = curl_init();
	        $options[CURLOPT_URL] = $urls[$i];
	        curl_setopt_array($ch,$options);
	        curl_multi_add_handle($master, $ch);
	    }
	
	    do {
	        while(($execrun = curl_multi_exec($master, $running)) == CURLM_CALL_MULTI_PERFORM);
	        if($execrun != CURLM_OK)
	            break;
	        // a request was just completed -- find out which one
	        while($done = curl_multi_info_read($master)) {
	            $info = curl_getinfo($done['handle']);
	            if ($info['http_code'] == 200)  {
	                $output = curl_multi_getcontent($done['handle']);
	                // request successful.  process output using the callback function.
	                $ref = array_search($info['url'],$urls);
	                $callback($output, $activities[$ref],1);	                
	
	                // start a new request (it's important to do this before removing the old one)
	                $ch = curl_init();
	                $options[CURLOPT_URL] = $urls[$i++];  // increment i
	                curl_setopt_array($ch,$options);
	                curl_multi_add_handle($master, $ch);
	
	                // remove the curl handle that just completed
	                curl_multi_remove_handle($master, $done['handle']);
	            } else {
	                // request failed.  add error handling.
	                $dmp = 'Failed!';
	                var_dump($dmp);
	            }
	        }
	    } while ($running);
	    
	    curl_multi_close($master);
	    return true;
	}
}

STANDARD cURL MULTI FUNCTION:

public function requestData($urls)
  {
  	set_time_limit(0);
    // Create get requests for each URL
    $mh = curl_multi_init();
    foreach($urls as $i => $url)
    {
      $ch[$i] = curl_init($url);
      curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, 1);
      curl_multi_add_handle($mh, $ch[$i]);
    }

    // Start performing the request
    do {
        $execReturnValue = curl_multi_exec($mh, $runningHandles);
    } while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
    // Loop and continue processing the request
    while ($runningHandles && $execReturnValue == CURLM_OK) {
      // Wait forever for network
      $numberReady = curl_multi_select($mh);
      if ($numberReady != -1) {
        // Pull in any new data, or at least handle timeouts
        do {
          $execReturnValue = curl_multi_exec($mh, $runningHandles);
        } while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
      }
    }

    // Check for any errors
    if ($execReturnValue != CURLM_OK) {
      trigger_error("Curl multi read error $execReturnValue\n", E_USER_WARNING);
    }

    // Extract the content
    foreach($urls as $i => $url)
    {
      // Check for errors
      $curlError = curl_error($ch[$i]);
      if($curlError == "") {
        $res[$i] = curl_multi_getcontent($ch[$i]);
      } else {
        return "Curl error on handle $i: $curlError\n";
      }
      // Remove and close the handle
      curl_multi_remove_handle($mh, $ch[$i]);
      curl_close($ch[$i]);
    }
    // Clean up the curl_multi handle
    curl_multi_close($mh);
    
    // Print the response data
    
    return $res;
  }

An assistance would be greatly appreciated!!! I am racking my head against my desk at this point =0). I am open to any suggestions. I will completely scrap the code and take an alternate approach if you would be so kind as to direct me accordingly. 

 

FYI - I am running on a hosted, shared server which I have little control over. PHP plugins might not be a route I can take at this point. But if there is something you know of that will assist me, shoot it at me and I will talk with my hosting provider.

 

THANK YOU!!!!

Link to comment
Share on other sites

Your rolling_curl function is checking for a 200 status code, while your other function is not. To start debugging, I'd suggest dumping the $info variable in your rolling_curl variable to see what sort of data it contains. Perhaps the server is sending back some other status code with the work orders that is causing it to fail.

Link to comment
Share on other sites

First off, I sincerely appreciate your quick response!!  :happy-04:

 

The $info variable returns the following:

array(20) {
  ["url"]=>
  string(241) "I REMOVED THE URL DUE TO THE LEGAL STUFF =0) BUT IT WAS HERE"
  ["content_type"]=>
  string(23) "text/html;charset=UTF-8"
  ["http_code"]=>
  int(200)
  ["header_size"]=>
  int(220)
  ["request_size"]=>
  int(271)
  ["filetime"]=>
  int(-1)
  ["ssl_verify_result"]=>
  int(0)
  ["redirect_count"]=>
  int(0)
  ["total_time"]=>
  float(0.939846)
  ["namelookup_time"]=>
  float(1.5E-5)
  ["connect_time"]=>
  float(0.173083)
  ["pretransfer_time"]=>
  float(0.371921)
  ["size_upload"]=>
  float(0)
  ["size_download"]=>
  float(7950)
  ["speed_download"]=>
  float(8458)
  ["speed_upload"]=>
  float(0)
  ["download_content_length"]=>
  float(-1)
  ["upload_content_length"]=>
  float(0)
  ["starttransfer_time"]=>
  float(0.886433)
  ["redirect_time"]=>
  float(0)
}

And the $content variable did return the corresponding html for the url. It seems to have processed fine.

Took 7.11 seconds to log the user in, fetch sub-item urls, which there were 35, and call the rolling_curl script to dump the aforementioned data.

 

 

Let me elaborate a little further. Consider the following functions (albeit they are simply for logic sake). The rolling_curl function will return true if it completes; should I be checking for that? And how would I go about doing that?:

function getSubItems(){
    //log in users
    //get urls and put into an array named $urls
    //get activities and put into an array named $activities

    $curlData = array('urls'=>$urls,'activities'=>$activities);
    $rd = new curlProcess;
    $processCurl = $rd->rolling_curl($curlData,'processSubItem');
}

function processSubItem($content, $activity, $updated){
   //updated will always be 1 for now, you will see it in the rolling_curl callback.
   
   //array of data needed 
   $dataneeded = array('blah','blah','blah','etc','etc');
   $datatoinsert = array();
   //Parse html using simpleHTMLDOM scripts
   $html = str_get_html($content);
   $tds = $html->find('td');
   foreach ($tds AS $td){
         //Get all the data I need and put into an array
         if (in_array($td->innerhtml,$dataneeded) !== false)){
               $datatoinsert[] = $td->innerhtml;
         }
   }
   foreach ($datatoinsert AS $data){
         //Insert the data into the database
   }
}

FYI: While I was writing this I ran the script again and it took 3.3 minutes and returned failed for all 35 sub-items.

 

Thank you in advance!!!

Link to comment
Share on other sites

I finally got it to work, but it was one of those "I have no clue what I did to make it work" type things. I modified a few sql statements in the processSubItem function which may have been hanging up the script.

 

I also added "return true" at the end of the processSubItem function which I did not think was necessary at all. Can someone advise if it is necessary to return something if you do not intend to receive any info from a function? In this case the rolling_curl function calls processSubItem each time a cURL request completes. Is it necessary for the processSubItem function to return a value?

 

Also, can someone advise what would constitute a "large" cURL request? Thousands of urls seems large to me, but what should a cURL multi request be able to process? and generally how fast? I know it depends on the server I am requesting information from, but I look at it this way, if I navigate to a page on that server it displays almost instantly, should the cURL request occur in the same amount of time for that same url?

 

This is bothering me because another site has somewhat already accomplished what I am trying to do. But, of course, I cannot view their PHP script and there is no way they would be willing to share it with me =0).

 

Both of the cURL scripts work perfectly by the way; I do not believe that was the issue. So for those who are reading up on cURL multi-requests, the two provided above will more than likely suffice for your needs (again, I am able to pull 300 urls within a matter of 1.6 minutes max).

 

I appreciate the help!!

Link to comment
Share on other sites

Can someone advise if it is necessary to return something if you do not intend to receive any info from a function?

The rolling_curl function you showed above does not use the return value of the callback in any way, so a return true is unnecessary and adding it would have no effect on the outcome of the script. You said you were seeing "failed" for each of the requests. Given the code you showed, the only way that would appear is if the $info['http_code'] value was for some reason not equal to 200, which is why I suggested you check that. Assuming the dump you posted and the source of rolling_curl are accurate then the value is 200 and you should not be seeing a failed message

 

 

but what should a cURL multi request be able to process? and generally how fast?

cURL can process however many URLs you need, especially with the rolling method where only a certain number of urls are active at a given moment. The only factor is how long it takes to process them, which is pretty much completely dependent on

- the network connection to the server

- how many parallel requests the remote server allows/can handle

- the time your processing function takes to complete.

 

The first two items you have essentially no control over, you are just at their mercy. The last item you can look into and try to optimize as much as possible, for example, use a multi-row insert when adding the data to your DB rather than one query per data item.

 

I know it depends on the server I am requesting information from, but I look at it this way, if I navigate to a page on that server it displays almost instantly, should the cURL request occur in the same amount of time for that same url?

Each individual curl request should behave very similarly yes, but when you are trying to perform several requests all at once you may run into issues caused by either network issues or the servers. For example the server may be setup to only allow a certain number of concurrent connections per IP (or in general) which means if you exceed that number then you'll be waiting for one to process before starting downloading the next item.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.