Jump to content

running php processes for extended periods of time


tobycatlin

Recommended Posts

Hello everybody

 

I have a php script that reads a big xml file and updates a number of database tables that drives our search engine. Depending on the number of xml files to be processed and the load on our servers this often takes several hours/days.

 

I call ignore_user_abort() so that the process doesn't stop when the user closes the browser and i have also implemented a file lock system so that only one indexing process can be run at any one time. Also the script periodically writes status information to a file so that i can see whats going on and check that the process is still running.

 

Most of the time this setup works ok, although it is prone to crashing. Quite often recently, it just stops updating the status file and i have to restart the web server and delete the lock files before i can safely start another process going. This is a major pain.

 

I would like to know if anyone has experience in running a php process for extended periods of time. Is it best to start the php script via the browser or would it be better to use some kind of cron job to run a php script from the shell. The parser is complicated and would take a long time to re-write in any other language, so i want to stick with php.

 

I also have the same kind of issues when parsing apache access log files.

 

any comments would be very welcome.

 

toby

Link to comment
Share on other sites

My first step in addressing this issue would be to optimize the process in any way possible.  I'm curious as to the XML file itself.  How big is the file?  Does it only contain new information or is it a file that continuously grows in size as data is appended to it?

 

I would also take a look at the operations your script is executing and in which order.  You don't need rewrite the whole thing, but perhaps re-ordering some of the code will lead to faster execution times.

 

As for invoking the script, you really have two choices.  Either you do it from the command line or from the browser.  As Patrick pointed out, directing the output to /dev/null will mimic a fork on *nix systems and is one option when invoking a script that takes several hours or even days to execute from a browser.  The other choice would be to set up a cron job that searches a DB for a flag to start executing the script and setting that DB flag through the browser.

 

However, due to the execution time required for your script I think your first step should be to optimize where possible.  If you provide more details about the stuff I asked in the beginning of the post, I may be able to give you some ideas.

Link to comment
Share on other sites

thanks for the replies guys. I'll deffo try out patricks suggestion.

I'd also love to hear your ideas on optimising the parser because i will have to re-write the damn thing at some point. Before i start i fully acknowledge this is a shocking bit of code and can be optimised loads, it's just that there is always something more important to do.

 

Each xml file describes the structure of a medical article.

in very simplified example an article contains a number of heading and each heading contains a number of words

<article article_id= 12323>
<headings>
  <heading word_count=4 heading_id=1245>
  <content> heading content goes here</content>
    <words>
     <word> heading</word>
     <word> content</word>
     <word> goes</word>
     <word> here</word>
   </words>
</heading>
....more headings
</headings>
</article>

 

There are other tables that hold other meta data but the three main tables are articles, headings and words. The parser loads the whole xml file into memory and then converts the xml file into a nested php array.

we then start looping through all the articles (multiple files are submitted at once), do a database lookup on the article table using the article_id, if there is no match row insert, if it has changed, update it. Then move on to the heading data, do a select on the database check to see if anything has changed and either insert/update/delete and finally move on to the word table and do the same.

all deletions are saved in an array and done in one go at the then end of the article iteration.

the database is oracle

the os is solaris

there are around 2500 articles in the articles table

around 300,000 heading in the heading table

and about 5million plus words in the word table.

I have done some spikes to see if it would be quicker to just wipe all data on the article and simply re-insert it all but this proved slower.

 

Ideally i think the best way would be to load the file data into memory. do 3 queries to get all the article data from the database. then loop through the file data comparing it to the file data and create a set of sql statements that can be run in one big hit.

 

I can hear you saying why go through this painful process when you could use oracle text or lucene and create a much faster simpler more powerful search engine, i couldn't agree more. one day...

 

thanks again for any bolts of brilliance

Link to comment
Share on other sites

You didn't really answer my questions about the origins of the XML file you are parsing.

 

So here we go again:

 

Is this a single XML file?  In your original post you refer to it as a big xml file but in your second post you refer to each xml file describes.  So now I'm unclear as to whether you're working with a single file or multiple files.

 

What are the origins of the XML file(s)?  Are they created by your software or some one else's?  Do you have any control over the format / content of the XML file(s)?

 

I asked previously if the XML file contains only new information or if it contains everything you may have processed previously as well.  From your latest post, it sounds like the file contains old data that may or may not have changed.  If a large portion of your XML file is mostly static data that changes infrequently, you're wasting an awful lot of time processing a file of data you've already processed before.  If possible, you should set the XML file up so that it only contains data that needs to be inserted or updated and eliminate all data that hasn't changed.  Even better would be to separate the contents of the XML file into two files: insert.xml and update.xml.  Everything in insert.xml should be guaranteed to be a new article that requires insertion into the DB; everything in update.xml should be guaranteed to be data that is changed and requires updating.  This reduces the amount of database checking your script has to make.

 

How are you going about reading the file's contents into memory?  Are you loading the entire file into a single variable and then parsing that?  If so, I'd be curious to know what's happening with your page file usage.  You could be killing your system by trying to load the entire contents at once.

 

What kind of benchmarking do you have in place?  Do you even know which part of the script is eating up the most time?  Without proper benchmarking, you have no idea if its spending 80% of its time running DB queries and 20% doing file I/O or what's going on.  Without knowing where it's spending it's time it's hard to optimize.

 

The last thing I can think of is what does your table set up look like?  More specifically, what types of indexes are declared on your tables?

 

Hope some of that helps.

Link to comment
Share on other sites

Those are useful comments.

 

sorry i didn't answer your question, I did mention that multiple files are submitted at once, but looking back it isn't  very clear. We have a windows application that is distributed to our writers. This is basically a beefed up word processor that allows us to allocate article to people and hold the data in a central place. from this windows app we use xslt transforms to create html files and the other xml files that need to parsed to drive the search. The xml files contain a mixture of new info, changed info and information that needs to be taken out the db. it depends on what part of the article has been edited. each xml file is around 2mb in size and around 20 files are generally submitted at once.

 

What i haven't mentioned so far is that there are 3 web environments that are used at various stages of dev and testing and of course production. all the databases must be synchronised with the data held in the windows app. due to security only the dev databases can be accessed externally. so when the xml file is created there is no knowledge of what is in any of the databases. so the parser on each server must decide what information to insert/update /delete based on whats in the file.

 

from monitoring cpu usage on the servers it doesn't seem to cpu bound. I think it is i/o bound and this is mostly caused by excessive multiple database lookups to see what action needs to be done with a chunk of data. it's that old game, reduce the db queries and improve the performance.

 

I totally agree about the benchmarking, i have been thinking about that. I have also come up with a load of functions that use regular expressions to pull out certain parts of the file rather than having to load the whole thing into memory.

 

No matter how much i optimise the parser there will always be times when it need to run for hours and even days. once in a while we need to process all 2500 files so I am still very interested to see if that forking technique will help reliability.

 

better get back to work...

Link to comment
Share on other sites

I totally agree about the benchmarking, i have been thinking about that. I have also come up with a load of functions that use regular expressions to pull out certain parts of the file rather than having to load the whole thing into memory.

 

If I understand correctly you're using regex on an XML file to avoid loading it into memory? Why not simply use an event based, old fashioned, SAX parser?

Link to comment
Share on other sites

Is your word processor set up so that it only saves files that have been modified?  If I were to open an article in your word processor, not change anything, and close it, does it rewrite the XML file to disk?  If the answer is yes, you need to change that behavior.  Because then your script can keep track of the last time it ran and only modify files with a timestamp greater than or equal to the last time the script ran.

 

it's that old game, reduce the db queries and improve the performance.

 

I also suggest trying to reduce the input size so that there are less files to process and each file takes less time.

 

It's still unclear as to what magnitude of complexity your algorithm is using while processing these files, so perhaps that can be optimized as well.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.