php curl tasks too numerous for one script

michaellunsford · December 30, 2013

A php file whose job it is to check a website and update an index will take a long time. It uses curl to go out to a webpage and ping each individual article, compare to the local database, and then timestamp anything that's changed for sorting purposes.

It works great for one or two things, but when confronted with a hundred, the curl script just can't handle all that and complete in a timely manner (although I haven't actually tried it yet).

Forking the individual curl calls is a possibility (which I've never done before), but the script also needs to be manually called from a web browser. Is there a single solution for this, or do I build the fork page for cron and a separate ajax page for the browser?

dalecosp · December 30, 2013

Why does it need to be called from a browser? This doesn't seem like browser-related work. If you do need a web-based UI, it would be better to implement a RESTful interface and have the working script called via AJAX, then notify the UI when it's finished.

michaellunsford · December 30, 2013

Okay, I think I get it. The RESTful document sends the list, and the admin UI works through it. But I'd still have to use fork processes with cron, no?

dalecosp · December 30, 2013

I'm not sure I know enough about your system and requirements to say definitely "yea" or "nay" on that.

Here's how one of my apps works.

1. The UI page is loaded.

2. The user picks a task.

3. Pressing the control for this task activates JS that makes an AJAX call to a PHP handler.

Here's the handler, more or less:

<?php

//doit.php --- For a RESTful home page, this will call one of the *_doit.php files with appropriate GET string.

$who = $_GET['doit'];

$doitfile="doits/".$who."_doit.php";

echo "Running $doitfile\n"; //for the AJAX; this subsystem needs improved thought given to it.

include "$doitfile";

?>

The script's echo output (see above comment) is returned and JS in the UI reads this into a "status" DIV. The $doitfile follows this same logic, echoing needed information when finished which also gets returned to the status div. The $doitfiles take anywhere from a couple minutes to a day or more to run, depending on what they're doing (and to $whom).

Cron isn't involved. I've not forked anything either, but I've considered it (we currently have to run a big job or two over the weekend because there's no point in the user doing nothing for two days) ;)

Hope this helps,

dalecosp · December 30, 2013

As time required for processing is also your issue, I guess you could fork, or you could just write multiple handlers and have the UI call them all. I'm not sure if it matters; I guess the forking might be better if it's less front-end work. Either way, you are going to end up binding up 100% CPU, I'd imagine, unless you have significant machine resources or the app runs in a cluster/cloud.

This stuff ain't too easy, is it?

kicken · December 30, 2013

It sounds like probably what you'll want to do is have your curl script act as a daemon process that you start up from the command line, either manually or via a helper script. That process would read the URL's it needs to fetch/parse out of a queue somewhere such as a database, memcache entry, etc. As it completes it's work it could provide status updates via a similar mechanism.

Your web UI then would just inject stuff into the queue and periodically check on it's status rather than launch the curl process and wait on it. You could have a cron task that also periodically injects items into the queue for refreshing whenever necessary.

As for he curl script itself, by having it running as a background daemon you don't really have to worry as much about how long it takes, the process can just run indefinitly. One way to speed it up without having to mess around with forking and multi-process issues is issue several requests at once and process the results as they come in. This is done using curl_multi_init and curl_multi_exec. There is a wrapper for this know as rolling curl that can simplify the task quite a bit.

michaellunsford · December 31, 2013

Well, the issue is more with the time than the processor, as a curl call isn't really that processor intensive. It's mostly just waiting on the internet to send the page.

I didn't know about curl_multi... it looks like the same thing as doing a curl loop, no?

Now, running it as a background daemon is an interesting idea. Are we talking about making a PHP script terminate and stay resident (TSR) or more of the cron daemon calling the php script every five minutes?

kicken · December 31, 2013

I didn't know about curl_multi... it looks like the same thing as doing a curl loop, no?

No, curl_multi will open several connections and download the information in parallel, where as curl in a loop would do things in sequence one at a time.

Are we talking about making a PHP script terminate and stay resident (TSR) or more of the cron daemon calling the php script every five minutes?

The script wouldn't terminate, it would keep going. You'd just set it up to run in the background, such as by using & when running it from a linux shell. You'd just setup some means of communicating with the script so you can tell it what URL's to download or any other actions you need it to do. You could write a small cron task that runs ever few minutes just to make sure the script is still alive and working, and restart it if not (a watchdog essentially).

michaellunsford · December 31, 2013

The script wouldn't terminate, it would keep going. You'd just set it up to run in the background, such as by using & when running it from a linux shell. You'd just setup some means of communicating with the script so you can tell it what URL's to download or any other actions you need it to do. You could write a small cron task that runs ever few minutes just to make sure the script is still alive and working, and restart it if not (a watchdog essentially).

Interesting. Looking for a tutorial or example out there without much luck. Got a link / starting point?

dalecosp · December 31, 2013

Well, the issue is more with the time than the processor, as a curl call isn't really that processor intensive. It's mostly just waiting on the internet to send the page.

Yes, I suppose if all you're doing is d'loading a page, it's not. We do a lot of page parsing after the download, in VM's, and CPU does become an issue, particularly if we use threaded workers for the task(s).

You'd just setup some means of communicating with the script so you can tell it what URL's to download or any other actions you need it to do.

Might be fun to run as a socket server in a "while(1)" loop....

Sign In

php curl tasks too numerous for one script

Recommended Posts

michaellunsford

Link to comment

Share on other sites

dalecosp

Link to comment

Share on other sites

michaellunsford

Link to comment

Share on other sites

dalecosp

Link to comment

Share on other sites

dalecosp

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

michaellunsford

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

michaellunsford

Link to comment

Share on other sites

dalecosp

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information