Jump to content

php curl tasks too numerous for one script


michaellunsford

Recommended Posts

A php file whose job it is to check a website and update an index will take a long time. It uses curl to go out to a webpage and ping each individual article, compare to the local database, and then timestamp anything that's changed for sorting purposes.

 

It works great for one or two things, but when confronted with a hundred, the curl script just can't handle all that and complete in a timely manner (although I haven't actually tried it yet).

 

Forking the individual curl calls is a possibility (which I've never done before), but the script also needs to be manually called from a web browser. Is there a single solution for this, or do I build the fork page for cron and a separate ajax page for the browser?

Edited by michaellunsford
Link to comment
Share on other sites

I'm not sure I know enough about your system and requirements to say definitely "yea" or "nay" on that.

 

Here's how one of my apps works.

 

1. The UI page is loaded.

2. The user picks a task.

3. Pressing the control for this task activates JS that makes an AJAX call to a PHP handler.

 

Here's the handler, more or less:

<?php

//doit.php --- For a RESTful home page, this will call one of the *_doit.php files with appropriate GET string.

$who = $_GET['doit'];

$doitfile="doits/".$who."_doit.php";

echo "Running $doitfile\n"; //for the AJAX; this subsystem needs improved thought given to it.

include "$doitfile";

?>

 

The script's echo output (see above comment) is returned and JS in the UI reads this into a "status" DIV. The $doitfile follows this same logic, echoing needed information when finished which also gets returned to the status div. The $doitfiles take anywhere from a couple minutes to a day or more to run, depending on what they're doing (and to $whom).

 

Cron isn't involved. I've not forked anything either, but I've considered it (we currently have to run a big job or two over the weekend because there's no point in the user doing nothing for two days) ;) ;)

 

Hope this helps,

Link to comment
Share on other sites

As time required for processing is also your issue, I guess you could fork, or you could just write multiple handlers and have the UI call them all. I'm not sure if it matters; I guess the forking might be better if it's less front-end work. Either way, you are going to end up binding up 100% CPU, I'd imagine, unless you have significant machine resources or the app runs in a cluster/cloud.

 

This stuff ain't too easy, is it? :)

Link to comment
Share on other sites

It sounds like probably what you'll want to do is have your curl script act as a daemon process that you start up from the command line, either manually or via a helper script. That process would read the URL's it needs to fetch/parse out of a queue somewhere such as a database, memcache entry, etc. As it completes it's work it could provide status updates via a similar mechanism.

 

Your web UI then would just inject stuff into the queue and periodically check on it's status rather than launch the curl process and wait on it. You could have a cron task that also periodically injects items into the queue for refreshing whenever necessary.

 

As for he curl script itself, by having it running as a background daemon you don't really have to worry as much about how long it takes, the process can just run indefinitly. One way to speed it up without having to mess around with forking and multi-process issues is issue several requests at once and process the results as they come in. This is done using curl_multi_init and curl_multi_exec. There is a wrapper for this know as rolling curl that can simplify the task quite a bit.

Link to comment
Share on other sites

Well, the issue is more with the time than the processor, as a curl call isn't really that processor intensive. It's mostly just waiting on the internet to send the page.

 

I didn't know about curl_multi... it looks like the same thing as doing a curl loop, no?

 

Now, running it as a background daemon is an interesting idea. Are we talking about making a PHP script terminate and stay resident (TSR) or more of the cron daemon calling the php script every five minutes?

Link to comment
Share on other sites

I didn't know about curl_multi... it looks like the same thing as doing a curl loop, no?

No, curl_multi will open several connections and download the information in parallel, where as curl in a loop would do things in sequence one at a time.

 

 

Are we talking about making a PHP script terminate and stay resident (TSR) or more of the cron daemon calling the php script every five minutes?

The script wouldn't terminate, it would keep going. You'd just set it up to run in the background, such as by using & when running it from a linux shell. You'd just setup some means of communicating with the script so you can tell it what URL's to download or any other actions you need it to do. You could write a small cron task that runs ever few minutes just to make sure the script is still alive and working, and restart it if not (a watchdog essentially).

Link to comment
Share on other sites

The script wouldn't terminate, it would keep going. You'd just set it up to run in the background, such as by using & when running it from a linux shell. You'd just setup some means of communicating with the script so you can tell it what URL's to download or any other actions you need it to do. You could write a small cron task that runs ever few minutes just to make sure the script is still alive and working, and restart it if not (a watchdog essentially).

 

Interesting. Looking for a tutorial or example out there without much luck. Got a link / starting point?

Link to comment
Share on other sites

Well, the issue is more with the time than the processor, as a curl call isn't really that processor intensive. It's mostly just waiting on the internet to send the page.

Yes, I suppose if all you're doing is d'loading a page, it's not. We do a lot of page parsing after the download, in VM's, and CPU does become an issue, particularly if we use threaded workers for the task(s).

 

You'd just setup some means of communicating with the script so you can tell it what URL's to download or any other actions you need it to do.

Might be fun to run as a socket server in a "while(1)" loop....

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.