Improving efficiency on a time-consuming Cron job

RealityRipple · January 12, 2013

I run a service via a shared host provided by 1and1, which, if a user subscribes for it, stores their login information for a website and regularly grabs some information from a page on the site, parses it, and stores the data on my site. The user later downloads the data through a client program running on their computer. This is all for tracking bandwidth usage on Satellite Internet, which has very rigorous limits.

The issue I'm running into is that this cron job running every 15 minutes is slowing down now that I have a handful of clients, and is starting to pop ISE 500's on my website every fifteen minutes past the hour for two or three minutes. Each client potentially has to load a different address, parse a different style layout of data, and the rest of it, basically as an individual script. However, I tried to simplify matters by making a single script which then exec's the client script with the required parameters in a loop. (One strange thing I noticed was my webhost doesn't allow this kind of exec "splitting" on client-requested pages, but it works fine within Cron jobs or when run from SSH. However, since the loop of the main script waits for each exec to complete before the next starts anyway, I don't see why they label it splitting anyway, but that's shared hosting for you.)

Anyway, I would like to know what the best method for doing this sort of loading of pages (there's HTTPS, POSTs and GETs involved regardless of the address) and saving of data in an efficient method through PHP scripts called by a cron job. Is there a way to actually split these calls into threads so they don't wait, and hopefully execute simultaneously so as not to interfere with the site's normal operations? Do I need to talk to 1and1? Should I be using some other looping methodology?

kicken · January 12, 2013

The first thing you should probably do is profile your cron script and try and identify any areas that are running slow. You may be able to speed it up using alternative algorithms / functions. If you want, post the script(s) here and someone could look at them for you.

As far as trying to do some type of threading, if 1and1 offers pcntl you can use pcntl_fork to spawn new processes from the main script. If it doesn't offer that extension but allows you to use exec you can spawn a new PHP process and put it into the background by adding >/dev/null 2>&1 & to the end of the command line.

However, I would try the profiling first to make sure you have the script running as efficiently as possible. Spawning off new processes will not necessarily run any faster as you're still doing the same amount of work. You'd only see an improvement through multiple processes if the server is not busy doing other tasks and can spare other CPU cores to run the processes in parallel. For a shared host, there might not be much for spare cpu cycles between all the other sites/scripts trying to run so you'd see little to no improvement in such a case.

RealityRipple · January 12, 2013

I'm not actually doing a lot of work, the time consuming part is entirely the HTTPS communication for logging into the sites. There was a bit of complex work due to the fact I have to store passwords in a decryptable format on my site and did so via byte-mangling, but I got around that by storing the passwords in temporary memory and flushing it regularly. I also know that 1and1 doesn't provide any of that threading stuff, or a way to install it T_T

I'll try tossing that little bit to the end of exec, because that's basically all I'm allowed to do. The whole system is a lot more complex than I could paste anywhere, since I used quite a few forms of encryption (no HTTPS on my server available, I had to basically create my own SRP-based system for sending passwords, then my own storage method because the entire internet says "never store passwords, just hashes", and that would defeat the purpose of this system). The actual code being run in cron is all just requesting and parsing various HTML page layouts; there's nothing much to be improved unless I wanted to muck about in DOM.

Edit - Wow, I just added >/dev/null 2>&1 & to the end of exec and it ran them all simultaneously in less than a minute. That definitely did the trick - even if they still cause an ISE when they're running, they went so quickly it shouldn't bother a thing.

Edited January 12, 2013 by RealityRipple

kicken · January 12, 2013

You can run multiple downloads at the same time without having to involve multiple processes. For example, curl_multi_exec to download several things at once using curl. There are other ways but curl is probably the easiest and is generally supported.

RealityRipple · January 12, 2013

That method seems extremely confusing for returning data. I need to get the "effective url" and the contents of pages that I read via both GET and POST (depending on the host and page I'm grabbing at the time). Due to the nature of each host's various differences in load times and request styles, running the whole thing linearly through curl_multi_exec for each request seems like it would increase the amount of time taken overall. I will keep looking into it though.

kicken · January 12, 2013

There is a library called rolling-curl you could check into that may simplify the process for you.

The idea with curl multi exec though is that you'd setup your list of URL's that you need to download and then enter a loop. curl will attempt to download them all in parallel and once one of the URL's has finished downloading you can run it through whatever processing function you need. This is just pseudo-code, but the process essentially looks something like this:

addUrlsToList();
while (urlsToDownload() > 0){
   doDownload(); //curl will try and retrieve data from any of the urls in the list
   if (urlIsFinished()){
      processUrlData();
   }
}

Like I said that's just pseudo-code to illustrate the process, the way you actually go about setting it up is a bit more complex, but the library should help with that.

RealityRipple · January 12, 2013

Hmm... the constancy of rolling-curl is a bit more than I need, but it at least sounds like the best source of an implementation example... The issue I'm seeing with that sort of linear loop layout is that I need to send a POST or GET to a login page, get cookie data from the response and parse it to ensure the login was successful, then request another page using the cookie data to grab its contents accurately. The loop sounds like it would get blocked while either A) I finish getting all the "login" request data, or B) I get the usage page after each login and block them from looping to check for another one to finish until it's done. It may be better to run it as one thread for the sake of the server not seeming to take kindly to multithreading, but it just seems to go against modern development techniques. If my site starts complaining again in some manner, I'll switch over to a curl-multi single-threaded system and see how that does, but so far things seem to be running a lot smoother using exec spawns - once I fixed a little problem with multithreaded reuse of the same cookie jar (hadn't considered it being an issue when it was in a linear loop).

Edited January 12, 2013 by RealityRipple

RealityRipple · January 14, 2013

Alright, it's still causing 500 errors for a bit, and some threads are being interrupted it looks like, so I'm looking at rolling-curl as you suggested. I'm having some issues with implementation, though. The way this class works seems to be based on the idea that the script will only be requesting one URL per address, with no rational way to start another request from the callback in a sane manner. If I just send a single request in the "second volley", it will act like a standard curl request, and every callback will block until that request completes. It may help if I explain the exact steps my program MUST take -

1) using a supplied E-Mail address, it checks if the domain exists as a subdomain of another server (via gethostbyname). If the subdomain exists, it runs Type2, if not, Type1. It also grabs other required data via that E-Mail address including the password. The E-Mail address is then used as a Username.

2) Type1 requests the domain's login page, supplying the Username and Password via standard uri variables (GET). Type2 requests a different page on a different domain with a matching subdomain for the supplied domain name, passing it the Username and Password via POST variables instead.

3) The response of Type1 is checked for contents, the URL redirect result of Type2 is checked to see if it's become a home page, still the login page, or any other pages, each which check for various variables and possibly re-attempt the login if there was a failure (back to Step 2)

4) The second (or more in Type2 provision above) request is made requesting the usage data. Cookie data is passed in both Type1 and Type2 that was received in Step 3.

5) The response of Step 4 is parsed and the information saved.

I'm not sure that rolling-curl would be efficient at handling more than step 2, and it doesn't seem that there are any truly efficient methods for handling all of the steps without threading methods my host doesn't have. How can I bunch these parallel runnings up?

kicken · January 14, 2013

I've never personally used rolling cURL or cURL multi before. Last time I did anything in parallel I rolled my own due to now knowing about them, and have not had to do anything like that since learning of them.

Based on the readme for rolling curl however, something like the below should be the setup you're after:

<?php

class MyRequest extends RollingCurlRequest {
private $mType;
private $mRc;

const TYPE1=1;
const TYPE2=2;

public function __construct($url, $type, $rc){
	$this->mType = $type;
	$this->mRc = $rc;
	parent::__construct($url);
}

public function ProcessResponse($responseBody, $responseInfo){
	switch ($this->mType){
		case self::TYPE1:
			$this->ProcessType1($responseBody, $responseInfo);
			break;
		case self::TYPE2:
			$this->ProcessType2($responseBody, $responseInfo);
			break;
	}
}

private function ProcessType1($body, $info){
	//blah blah blah
}

private function ProcessType2($body, $info){
	//blah blah blah
}
}

$rc = new RollingCurl(function($response, $info, $request){
$request->ProcessResponse($response, $info);
});

foreach ($Emails as $email){
$type = //determine type of initial request
$url = //determine URL of initial request
$req = new MyRequest($url, $type, $rc);
$rc->add($req);
}

$rc->execute();

You setup all your initial type1 or type2 requests by adding them to the rolling curl object. As each one completes it will call the callback function, which you then have call your processing function.

If during the processing function you determine that you need to issue another request, you can just add it to the queue by creating a new request object and adding to the queue by calling the add method again.

Eg:

$req = new MyRequest($newUrl, $this->mType, $this->mRc);
$this->mRc->add($req);

RealityRipple · January 14, 2013

The problem I'm seeing is that adding a new request will do nothing if the "execute" function is already complete, and just adding one and executing it alone is no different from a standard curl_exec, meaning it'd just block the whole thing until that one finished.

The method i'm trying right now is a simple use of curl_multi_exec, blocking until all are complete and then moving onto the next requests once the first batch are complete. It will then loop this second run of multi_execs until there's nothing left to read, in the case of multiple logon try attempts being required. This will have the unfortunate side effect of "blocking" to wait for the slowest page, but it seems like no matter what you do, there will be blocking of some magnitude somewhere.

(BTW, I'm making these judgements on rolling curl by the php supplied at http://code.google.com/p/rolling-curl/source/browse/trunk/RollingCurl.php - particularly lines 200-300.)

kicken · January 15, 2013

The problem I'm seeing is that adding a new request will do nothing if the "execute" function is already complete,

That won't be the case if you add the request during the processing callback function. After rolling cURL calls your function, it checks if there are any pending requests yet to be processed (which will include the one(s) you just added during the callback). If there are remaining requests to be processed it will add it to curl multi process and it will begin processing that request immediately.

The only potential issue I see is if you start out with just a single request, as in that instance it skips the curl multi stuff and just does a single execution. In that scenario any requests added after the initial call to execute() would be skipped.

RealityRipple · January 15, 2013

Ah, I see, clever single threading... So it should never block anything because it's just setting up another request, which will be handled in turn...

I'll have to make sure the single request execution never occurs, I guess. Thank you very much - I'll let you know if there's any further problems.

RealityRipple · January 15, 2013

It's almost perfect, but for some reason it won't unlink the cookies files created after it's done processing the last function. I surmise this is because the handle is removed after the callback occurs, so it's probably technically still in use. For now, I'm wiping it at the beginning (if it exists) to ensure a clean set of cookies each time. I assume the execute I called will block until everything is done, though, so it would be safe to clear the files away there instead?

kicken · January 15, 2013

Yea, add the unlink after your call to execute(). The script will be looping within that method until all the requests are complete. Once it finishes it should be ok to remove the file and do any other post-processing you might want to do.

RealityRipple · January 15, 2013

Excellent, thank you very much for your help. I hope adding another dozen customers doesn't slow this thing down (I'm afraid to turn the window size up higher than 10 because so many users have the same host, I wouldn't want to get blocked for DDoS'ing!)

RealityRipple · January 25, 2013

Wow, this server just does not want to do this job. It's still getting intermittent 500 errors, often when loading two or three MantisBT bug pages at once, or when other semi-heavy PHP requests are made on the server. When I first tried to set up the script, I encountered problems due to it trying to run an old version of PHP4 in the cron job, and I had to direct it to my PHP5 folder. There is also an experimental PHP6 folder install, which I could also use. Do you think offloading the work of the cron job to the php4 install was intentional, and if so, should I try reverting the code to PHP4 or running it in PHP6 as a means of solving this problem? I suppose I should ask 1and1, but they take so long to get ahold of and rarely know what they're talking about, and try to sign me up for trials of things I don't want afterward. :-\

Sign In

Improving efficiency on a time-consuming Cron job

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information