Jump to content

multi curl optimized for speed


ItsPawl

Recommended Posts

Iv'e been looking in to methods of scraping data from pages and has found several examples of using multi-curl to achieve this. But i am not used to curl and is not completely sure how it works and i need to find the fastest reliable (i do need all, or close to all, pages every run) method of getting the content of a number of pages (about 160).

 

Here is an example i got from searching the web which i managed to implement:

<?php 
/** 
* 
*@param $picsArr Array [0]=> [url], 
*@$picsArr Array will filled with the image data , you can use the data as you want or just save it in the next step. 
**/ 

function getAllPics(&$picsArr){ 

        if(count($picsArr)<=0) return false; 

        $hArr = array();//handle array 

        foreach($picsArr as $k=>$pic){ 

                $h = curl_init(); 
                curl_setopt($h,CURLOPT_URL,$pic['url']); 
                curl_setopt($h,CURLOPT_HEADER,0); 
                curl_setopt($h,CURLOPT_RETURNTRANSFER,1);//return the image value 

                array_push($hArr,$h); 
        } 

        $mh = curl_multi_init(); 
        foreach($hArr as $k => $h)      curl_multi_add_handle($mh,$h); 

        $running = null; 
        do{ 
                curl_multi_exec($mh,$running); 
        }while($running > 0); 

        // get the result and save it in the result ARRAY 
        foreach($hArr as $k => $h){ 
                $picsArr[$k]['data'] = curl_multi_getcontent($h); 
        } 

        //close all the connections 
        foreach($hArr as $k => $h){ 
                $info = curl_getinfo($h); 
                preg_match("/^image\/(.*)$/",$info['content_type'],$matches); 
                echo $tail = $matches[1]; 
                curl_multi_remove_handle($mh,$h); 
        } 
        curl_multi_close($mh); 

        return true; 
} 
?> 

 

Since time is critical in my script i would ask if you think this is a good implementation or if you can point me in the direction of one that will save me noticeable run-time.

Link to comment
Share on other sites

Im trying to get all the pages contents in a as short time as possible. As i understand it using multi-curl allows me to get all the pages in parallell instead of one after the other, thus reducing latency wait times. (im pretty sure file_get_contents for each would take longer, unless maybe used with threads somehow)

 

Im only asking if anyone is familiar with these things and know of a still faster way since a low execution time is important in my program.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.