Jump to content

multi curl optimized for speed


ItsPawl

Recommended Posts

Iv'e been looking in to methods of scraping data from pages and has found several examples of using multi-curl to achieve this. But i am not used to curl and is not completely sure how it works and i need to find the fastest reliable (i do need all, or close to all, pages every run) method of getting the content of a number of pages (about 160).

 

Here is an example i got from searching the web which i managed to implement:

<?php 
/** 
* 
*@param $picsArr Array [0]=> [url], 
*@$picsArr Array will filled with the image data , you can use the data as you want or just save it in the next step. 
**/ 

function getAllPics(&$picsArr){ 

        if(count($picsArr)<=0) return false; 

        $hArr = array();//handle array 

        foreach($picsArr as $k=>$pic){ 

                $h = curl_init(); 
                curl_setopt($h,CURLOPT_URL,$pic['url']); 
                curl_setopt($h,CURLOPT_HEADER,0); 
                curl_setopt($h,CURLOPT_RETURNTRANSFER,1);//return the image value 

                array_push($hArr,$h); 
        } 

        $mh = curl_multi_init(); 
        foreach($hArr as $k => $h)      curl_multi_add_handle($mh,$h); 

        $running = null; 
        do{ 
                curl_multi_exec($mh,$running); 
        }while($running > 0); 

        // get the result and save it in the result ARRAY 
        foreach($hArr as $k => $h){ 
                $picsArr[$k]['data'] = curl_multi_getcontent($h); 
        } 

        //close all the connections 
        foreach($hArr as $k => $h){ 
                $info = curl_getinfo($h); 
                preg_match("/^image\/(.*)$/",$info['content_type'],$matches); 
                echo $tail = $matches[1]; 
                curl_multi_remove_handle($mh,$h); 
        } 
        curl_multi_close($mh); 

        return true; 
} 
?> 

 

Since time is critical in my script i would ask if you think this is a good implementation or if you can point me in the direction of one that will save me noticeable run-time.

Link to comment
Share on other sites

Im trying to get all the pages contents in a as short time as possible. As i understand it using multi-curl allows me to get all the pages in parallell instead of one after the other, thus reducing latency wait times. (im pretty sure file_get_contents for each would take longer, unless maybe used with threads somehow)

 

Im only asking if anyone is familiar with these things and know of a still faster way since a low execution time is important in my program.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.