Jump to content

Fetching multiple sites at once.


Anidazen

Recommended Posts

Hey guys,

Asked about this a couple days ago and got no response, so I am re-posting and elaborating.


Basically, here's the issue:
- I have a script that fetches several websites while a user waits. Due to the custom nature of each search, this information cannot be cached efficiently - it has to be real time.

- To do standard CURL requests, fetching and parsing each website in sequence produces an uncomfortable load time. A high quality load-bar can only buy you so much time!

- A very kind member of these forums (printf) helped with a class that helped, but this was ages ago and printf no longer visits the boards (i believe). The problem is that the class is simply too unstable - with random timeouts occuring for one reason or another.



So I'm looking for some stable way to download more than one website at once in PHP. I really can't believe that wanting to do this is as rare as it seems to be, I'd have thought it'd be mainstream.

Anyway - does anyone have any suggestions how to do this? I am considering taking an AJAX style approach, loading each request in individual frames then either passing the information through the browser (JavaScript) or through the server (MySQL).




One glimmer of hope appears to be the "PECL HTTP" class, of this website: http://pecl.php.net/package/pecl_http

It says it supports parallel requests in PHP 5+. I don't know anything about this, and maybe somebody on this forum can give me some more info. (Does this mean seperate, concurrent pages are possible?) Seems to be very, very little community-based information on this class, and the documentation is far from helpful.


Edit: Forgot to mention: is there some other technology that would be more suited to this task than PHP?


So I know I've raised a lot of questions in one single post, but if people could give some help or advice to any of it, then it would be appreciated.
Link to comment
Share on other sites

The class has been updated many times, I think it on version 1.2 now, I know earlier versions had some problems, but without knowing what your doing with the class, makes it difficult to even help you figure how I can make you a custom version for what your doing. The new version can fetch 1000 pages with 20 concurrent streams in less than 5 seconds. I have people using it with the XML extended class fetching thousands of document every hour. For Windows users I even added a service option, the class can listen on a certain port and handle soap, xml, http request. I have it running as a spider, it does around 400,000 + pages a hour, that includes full indexing with the extended extractor class (page, images, CSS, JavaScrpt). PM me and I will help you...

printf
Link to comment
Share on other sites

  • 6 months later...
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.