reading several remote urls simultaneously

topcat · December 20, 2010

I'm trying to create a simple search engine bot and have found the biggest issue is that of performance, it takes several hours to index all the links and ideally I need to run it once a day, the bottleneck occurs when pulling the data from the remote urls in terms of network transfer time. This could obviously be reduced massively if I could read from several url's simultaneously. Does anyone know how I could do this. I'm aware that in PHP5 there is thread functionality which may solve the problem but this is only available from the CLI. I heard that it could be done with CURL without having to go down the threaded route, does anyone know how nthis can be done or know of any examples on the net that I could look at?

As a cheeky follow up, I was also thinking a potential problem could be the script getting out of hand and opening too many connections and I don't to cause a DOS attack! Is there a way that the number of connections could be limited to 10 for example. Is there a way of detecting when a CURL connection has finished reading a file so that a connection manager class could then start the next connection asap?

Thanks for any helpn people, I've been looking for this stuff for days!!

QuickOldCar · December 21, 2010

You can try this.

http://www.phpf1.com/manual/curl-multi-exec.html

Here's what I did... I have lists of url's in text files, I made a single page which has the curl code and mysql update or insert statements (yeah I do a check because I only want one and no duplicates). It will call to the text file, get the top url, use it, and delete that top url from the list. Connects to the site, grabs the info if alive and inserts. At the end of the script I simply do a meta refresh of how many seconds I would like to delay the process. The way I do it...you could also run as many instances you would like, and also on as many servers as would want. I started out doing say 20 simultaneous instances with a 1 second refresh. But now I add images so I set it slower to also have the image available just after the post is published. But that's just me, I'm in no hurry.

So lets say you really want to make a search engine.....I can easily see you using a single curl to get the correct pages, then maybe simple html dom to parse the pages for additional links, get the titles or w/e and the href of each links(save to db), then also save those discovered links into a text files as well for the future crawling. Number the text files 1.txt, 2.txt, 3.txt, etc. Trim any white spaces and blank lines for the text files, can even remove any duplicate url's. Have in your curl code for the url insertion if 1.txt is empty then 2.txt be used and onward, at the end can just redirect the code to some empty page. so doesn't have to keep running in a loop.

Anyway, this is the system I've come up with and used for a year now, seems to work pretty good. if search engines were fast and easy to do, would be a ton more of them out there, it just takes time to connect to these.

topcat · December 21, 2010

Thanks for the reply. To be honest it wasn't really a question of how to manage the data and the links, although that in itself could be a whole other conversation! but more about the parallelizm aspect of running several crawlers simultaneously from the same server. Since I posted though I've been experimenting with curl_multi which looks interesting and also maybe splitting the crawler and indexer into seperate modules and having the indexing done by a seperate script running as a background process on the server. That way I could have the indexer and several crawlers running simultaneously rather than a loop which would have to get data from the remote source, index it, insert into the db and then repeat - essentially just indexing one URL at a time. Although I guess I could use threads from the command line and have several loops running simultaneously? There's so many options - I can see some serious testing ahead

If anyone has any experience of any of these approaches or other techniques I'd love to hear them.

Thanks

Sign In

reading several remote urls simultaneously

Recommended Posts

topcat

Link to comment

Share on other sites

QuickOldCar

Link to comment

Share on other sites

topcat

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information