Jump to content

Efficiently allocating resources for large import of images


Recommended Posts

I'm hoping to get a little feedback on what you all believe is the best way to handle this efficiently in PHP. I am working on a script that imports a large amount of data from remote feeds; this facilitates the quick deployment of real estate web sites, but has to download a large number of images to each new site.

 

Assuming for right now that the bottleneck isn't in the method (fsock vs curl vs...) and that for each imported listing we're spending between .89439 and 17.0601 seconds on the image import process alone... what would you suggest for handling this over the space of 100-1000 occurrences?

 

As of right now I have two ideas in mind, both fairly rudimentary in nature. The first idea is to shut the script down every 30-45 seconds, sleep for a second and fire off another asynchronous request to start the script again. The second idea is to fire off a new asynchronous to run the image imports separate from the main script. This would let the efficient ones clear out rather quickly while the slower imports would have their own process to run in. The only thing that worries me about this is the fact that 100 of these could be fired off every second. Even assuming half of them complete before the next round are fired off, they would still pile up.

 

 

Link to comment
Share on other sites

Local Caching.

 

You should try your best not to rely on an outside server, as your script execution varies with their response time/speed. If you must, you should always keep a recent local cache that you can fall back on if the server takes more than x milliseconds to respond.

 

I would just run a cron job that scrapes the sites once a day and stores everything locally. If you need the information to be up-to-the-second, there's no super efficient way of doing this.

 

If you wanted to avoid cron, you could store a record of each cached page in the database along with a date. If a user requests that information, and its more than x seconds/days etc old, you could initiate the function to rescrape that particular site. This avoid your cron having to grab information from sites that someone visited once never gets seen again.

 

Regardless, I'd imagine you're going to use a LOT of bandwidth here if you're grabbing and serving images from 1000's of real estate sites daily. Keep that in mind.

 

Why import pics? Just link to them?

 

Decent option, but if the outside site puts up hotlink protection or moves them, you're hooped. At least with a local copy, if the remote image no longer exists, you always have SOMETHING to fall back on, out of date or not.

Link to comment
Share on other sites

Local Caching.

 

You should try your best not to rely on an outside server, as your script execution varies with their response time/speed. If you must, you should always keep a recent local cache that you can fall back on if the server takes more than x milliseconds to respond.

 

That's what I'm doing, the program doesn't grab the information on the fly. An import is scheduled via cron, and fired off once or twice a day depending on the needs of the particular site. That import grabs the data from an XML feed, turns it into a page and then retrieves the images. It's this image retrieval that's causing the script to run for an extremely long period of time. With the image import disabled, it takes on average 30 seconds for the script to import all the data from the remote XML feed. That's about 5-700 pages of information, adding anywhere from 1-18 seconds per page to grab the images.

 

If we average 4 seconds (low-balling) per page due to the image import, we're looking at a half hour execution time on the script. I'd rather not run a single process for that length of time. If I make the script sleep and restart every minute, I'm actually extending that time, but I'm starting a new process every minute so it can't eat a ridiculous amount of memory.

 

If I fire off a separate request for each page to grab the images, the whole process may complete in far less time, but it would require 5-700 php scripts being fired in the space of 30 seconds... I'm leaning this direction because it's not much different than a site that has 1000-1400 visitors per minute, which can easily be handled by a properly configured server. If it's scheduled during the lowest traffic time of the day, it will have the least competition and there should be no visible affect to regular visitors on the front end.

 

Unless of course there's a better solution! I googled for about an hour on this earlier, and while I'm quite positive I'm not the only person who's had to pull in a large amount of remote files, there doesn't seem to be too many people interested in writing about the methodology they employed to do so.

 

Link to comment
Share on other sites

Unless you're grabbing many images at one time, there shouldn't be a big memory issue. Your idea of running several scripts at once would be harder on the server than one long request that parses one image at a time.

 

It really depends. If you have the resources and speed to do it all at once, go for it. The job gets done quicker. If you want it as more of a background process, then go one at a time.

 

Both solutions will work, one might just work better in your situation.

Link to comment
Share on other sites

It's much harder on the server, agreed. The worry is having one process running for 30-60 minutes; I think there's more potential for that to hang or hit a snag and spiral out of control eating memory. Then again, with 100 processes being fired off nearly simultaneously, each should be done in a fraction of the time, but there is potential for 100 processes to hang.

 

Augh. Okay. I'm going to test the multiple processes version on a sandbox server to see if I can hang it. :)

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.