Iv'e been looking in to methods of scraping data from pages and has found several examples of using multi-curl to achieve this. But i am not used to curl and is not completely sure how it works and i need to find the fastest reliable (i do need all, or close to all, pages every run) method of getting the content of a number of pages (about 160).
Here is an example i got from searching the web which i managed to implement:
<?php
/**
*
*@param $picsArr Array [0]=> [url],
*@$picsArr Array will filled with the image data , you can use the data as you want or just save it in the next step.
**/
function getAllPics(&$picsArr){
if(count($picsArr)<=0) return false;
$hArr = array();//handle array
foreach($picsArr as $k=>$pic){
$h = curl_init();
curl_setopt($h,CURLOPT_URL,$pic['url']);
curl_setopt($h,CURLOPT_HEADER,0);
curl_setopt($h,CURLOPT_RETURNTRANSFER,1);//return the image value
array_push($hArr,$h);
}
$mh = curl_multi_init();
foreach($hArr as $k => $h) curl_multi_add_handle($mh,$h);
$running = null;
do{
curl_multi_exec($mh,$running);
}while($running > 0);
// get the result and save it in the result ARRAY
foreach($hArr as $k => $h){
$picsArr[$k]['data'] = curl_multi_getcontent($h);
}
//close all the connections
foreach($hArr as $k => $h){
$info = curl_getinfo($h);
preg_match("/^image\/(.*)$/",$info['content_type'],$matches);
echo $tail = $matches[1];
curl_multi_remove_handle($mh,$h);
}
curl_multi_close($mh);
return true;
}
?>
Since time is critical in my script i would ask if you think this is a good implementation or if you can point me in the direction of one that will save me noticeable run-time.