Jump to content

Recommended Posts

I'm using a MySQL database as a queue for pages that have been downloaded by a php based crawler. Sets of pages will be added several times per second and pages will be read by a seperate php indexer running as a daemon also several times a second. This would obviously lead to clashes - I guess the obvious solution to this is to use TSQL for the web crawler although this would cause delays for the indexer.

 

The other issue is the number of database interactions that the indexer would use and the performance lag this would create, I guess that this could be reduced by pulling records out of the db in groups but then memory could be an issue. Another option would be to store the pages in flat files but then I'm not sure how I could avoid issues with both applications trying to access the same file simultaneously.

 

Does anyone have any experiemce of handling a situation like this or have any ideas how to go about doing it in an efficient way?

 

I've also been looking at queue software like beanstalkd but not sure how I would implement this?

 

Any suggestions welcome - cheers!!

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.