Improving efficiency on a time-consuming Cron job
#1
Posted 11 January 2013 - 07:09 PM
The issue I'm running into is that this cron job running every 15 minutes is slowing down now that I have a handful of clients, and is starting to pop ISE 500's on my website every fifteen minutes past the hour for two or three minutes. Each client potentially has to load a different address, parse a different style layout of data, and the rest of it, basically as an individual script. However, I tried to simplify matters by making a single script which then exec's the client script with the required parameters in a loop. (One strange thing I noticed was my webhost doesn't allow this kind of exec "splitting" on client-requested pages, but it works fine within Cron jobs or when run from SSH. However, since the loop of the main script waits for each exec to complete before the next starts anyway, I don't see why they label it splitting anyway, but that's shared hosting for you.)
Anyway, I would like to know what the best method for doing this sort of loading of pages (there's HTTPS, POSTs and GETs involved regardless of the address) and saving of data in an efficient method through PHP scripts called by a cron job. Is there a way to actually split these calls into threads so they don't wait, and hopefully execute simultaneously so as not to interfere with the site's normal operations? Do I need to talk to 1and1? Should I be using some other looping methodology?
#2
Posted 11 January 2013 - 08:19 PM
As far as trying to do some type of threading, if 1and1 offers pcntl you can use pcntl_fork() to spawn new processes from the main script. If it doesn't offer that extension but allows you to use exec() you can spawn a new PHP process and put it into the background by adding >/dev/null 2>&1 & to the end of the command line.
However, I would try the profiling first to make sure you have the script running as efficiently as possible. Spawning off new processes will not necessarily run any faster as you're still doing the same amount of work. You'd only see an improvement through multiple processes if the server is not busy doing other tasks and can spare other CPU cores to run the processes in parallel. For a shared host, there might not be much for spare cpu cycles between all the other sites/scripts trying to run so you'd see little to no improvement in such a case.
Did I help you out? Feeling generous? I accept tips via Paypal or Bitcoin @ 14mDxaob8Jgdg52scDbvf3uaeR61tB2yC7
#3
Posted 11 January 2013 - 08:35 PM
I'll try tossing that little bit to the end of exec, because that's basically all I'm allowed to do. The whole system is a lot more complex than I could paste anywhere, since I used quite a few forms of encryption (no HTTPS on my server available, I had to basically create my own SRP-based system for sending passwords, then my own storage method because the entire internet says "never store passwords, just hashes", and that would defeat the purpose of this system). The actual code being run in cron is all just requesting and parsing various HTML page layouts; there's nothing much to be improved unless I wanted to muck about in DOM.
Edit - Wow, I just added >/dev/null 2>&1 & to the end of exec and it ran them all simultaneously in less than a minute. That definitely did the trick - even if they still cause an ISE when they're running, they went so quickly it shouldn't bother a thing.
Edited by RealityRipple, 11 January 2013 - 08:46 PM.
#4
Posted 11 January 2013 - 09:55 PM
Did I help you out? Feeling generous? I accept tips via Paypal or Bitcoin @ 14mDxaob8Jgdg52scDbvf3uaeR61tB2yC7
#5
Posted 11 January 2013 - 11:14 PM
#6
Posted 12 January 2013 - 02:34 AM
The idea with curl multi exec though is that you'd setup your list of URL's that you need to download and then enter a loop. curl will attempt to download them all in parallel and once one of the URL's has finished downloading you can run it through whatever processing function you need. This is just pseudo-code, but the process essentially looks something like this:
addUrlsToList();
while (urlsToDownload() > 0){
doDownload(); //curl will try and retrieve data from any of the urls in the list
if (urlIsFinished()){
processUrlData();
}
}
Like I said that's just pseudo-code to illustrate the process, the way you actually go about setting it up is a bit more complex, but the library should help with that.
Did I help you out? Feeling generous? I accept tips via Paypal or Bitcoin @ 14mDxaob8Jgdg52scDbvf3uaeR61tB2yC7
#7
Posted 12 January 2013 - 02:54 AM
Edited by RealityRipple, 12 January 2013 - 02:55 AM.
#8
Posted 14 January 2013 - 05:47 PM
1) using a supplied E-Mail address, it checks if the domain exists as a subdomain of another server (via gethostbyname). If the subdomain exists, it runs Type2, if not, Type1. It also grabs other required data via that E-Mail address including the password. The E-Mail address is then used as a Username.
2) Type1 requests the domain's login page, supplying the Username and Password via standard uri variables (GET). Type2 requests a different page on a different domain with a matching subdomain for the supplied domain name, passing it the Username and Password via POST variables instead.
3) The response of Type1 is checked for contents, the URL redirect result of Type2 is checked to see if it's become a home page, still the login page, or any other pages, each which check for various variables and possibly re-attempt the login if there was a failure (back to Step 2)
4) The second (or more in Type2 provision above) request is made requesting the usage data. Cookie data is passed in both Type1 and Type2 that was received in Step 3.
5) The response of Step 4 is parsed and the information saved.
I'm not sure that rolling-curl would be efficient at handling more than step 2, and it doesn't seem that there are any truly efficient methods for handling all of the steps without threading methods my host doesn't have. How can I bunch these parallel runnings up?
#9
Posted 14 January 2013 - 06:26 PM
Based on the readme for rolling curl however, something like the below should be the setup you're after:
<?php
class MyRequest extends RollingCurlRequest {
private $mType;
private $mRc;
const TYPE1=1;
const TYPE2=2;
public function __construct($url, $type, $rc){
$this->mType = $type;
$this->mRc = $rc;
parent::__construct($url);
}
public function ProcessResponse($responseBody, $responseInfo){
switch ($this->mType){
case self::TYPE1:
$this->ProcessType1($responseBody, $responseInfo);
break;
case self::TYPE2:
$this->ProcessType2($responseBody, $responseInfo);
break;
}
}
private function ProcessType1($body, $info){
//blah blah blah
}
private function ProcessType2($body, $info){
//blah blah blah
}
}
$rc = new RollingCurl(function($response, $info, $request){
$request->ProcessResponse($response, $info);
});
foreach ($Emails as $email){
$type = //determine type of initial request
$url = //determine URL of initial request
$req = new MyRequest($url, $type, $rc);
$rc->add($req);
}
$rc->execute();
You setup all your initial type1 or type2 requests by adding them to the rolling curl object. As each one completes it will call the callback function, which you then have call your processing function.
If during the processing function you determine that you need to issue another request, you can just add it to the queue by creating a new request object and adding to the queue by calling the add method again.
Eg:
$req = new MyRequest($newUrl, $this->mType, $this->mRc); $this->mRc->add($req);
Did I help you out? Feeling generous? I accept tips via Paypal or Bitcoin @ 14mDxaob8Jgdg52scDbvf3uaeR61tB2yC7
#10
Posted 14 January 2013 - 06:56 PM
The method i'm trying right now is a simple use of curl_multi_exec, blocking until all are complete and then moving onto the next requests once the first batch are complete. It will then loop this second run of multi_execs until there's nothing left to read, in the case of multiple logon try attempts being required. This will have the unfortunate side effect of "blocking" to wait for the slowest page, but it seems like no matter what you do, there will be blocking of some magnitude somewhere.
(BTW, I'm making these judgements on rolling curl by the php supplied at http://code.google.c...RollingCurl.php - particularly lines 200-300.)
#11
Posted 14 January 2013 - 07:17 PM
The problem I'm seeing is that adding a new request will do nothing if the "execute" function is already complete,
That won't be the case if you add the request during the processing callback function. After rolling cURL calls your function, it checks if there are any pending requests yet to be processed (which will include the one(s) you just added during the callback). If there are remaining requests to be processed it will add it to curl multi process and it will begin processing that request immediately.
The only potential issue I see is if you start out with just a single request, as in that instance it skips the curl multi stuff and just does a single execution. In that scenario any requests added after the initial call to execute() would be skipped.
Did I help you out? Feeling generous? I accept tips via Paypal or Bitcoin @ 14mDxaob8Jgdg52scDbvf3uaeR61tB2yC7
#12
Posted 14 January 2013 - 07:20 PM
I'll have to make sure the single request execution never occurs, I guess. Thank you very much - I'll let you know if there's any further problems.
#13
Posted 15 January 2013 - 05:15 AM
#14
Posted 15 January 2013 - 11:16 AM
Did I help you out? Feeling generous? I accept tips via Paypal or Bitcoin @ 14mDxaob8Jgdg52scDbvf3uaeR61tB2yC7
#15
Posted 15 January 2013 - 01:24 PM
#16
Posted 24 January 2013 - 11:38 PM
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users












