ItsPawl Posted February 17, 2011 Share Posted February 17, 2011 Iv'e been looking in to methods of scraping data from pages and has found several examples of using multi-curl to achieve this. But i am not used to curl and is not completely sure how it works and i need to find the fastest reliable (i do need all, or close to all, pages every run) method of getting the content of a number of pages (about 160). Here is an example i got from searching the web which i managed to implement: <?php /** * *@param $picsArr Array [0]=> [url], *@$picsArr Array will filled with the image data , you can use the data as you want or just save it in the next step. **/ function getAllPics(&$picsArr){ if(count($picsArr)<=0) return false; $hArr = array();//handle array foreach($picsArr as $k=>$pic){ $h = curl_init(); curl_setopt($h,CURLOPT_URL,$pic['url']); curl_setopt($h,CURLOPT_HEADER,0); curl_setopt($h,CURLOPT_RETURNTRANSFER,1);//return the image value array_push($hArr,$h); } $mh = curl_multi_init(); foreach($hArr as $k => $h) curl_multi_add_handle($mh,$h); $running = null; do{ curl_multi_exec($mh,$running); }while($running > 0); // get the result and save it in the result ARRAY foreach($hArr as $k => $h){ $picsArr[$k]['data'] = curl_multi_getcontent($h); } //close all the connections foreach($hArr as $k => $h){ $info = curl_getinfo($h); preg_match("/^image\/(.*)$/",$info['content_type'],$matches); echo $tail = $matches[1]; curl_multi_remove_handle($mh,$h); } curl_multi_close($mh); return true; } ?> Since time is critical in my script i would ask if you think this is a good implementation or if you can point me in the direction of one that will save me noticeable run-time. Quote Link to comment Share on other sites More sharing options...
BlueSkyIS Posted February 17, 2011 Share Posted February 17, 2011 i might not understand the complexity of the problem, but is file_get_contents() not an option? if possible, i would loop over each url and use file_get_contents() to get the page content. Quote Link to comment Share on other sites More sharing options...
ItsPawl Posted February 17, 2011 Author Share Posted February 17, 2011 Im trying to get all the pages contents in a as short time as possible. As i understand it using multi-curl allows me to get all the pages in parallell instead of one after the other, thus reducing latency wait times. (im pretty sure file_get_contents for each would take longer, unless maybe used with threads somehow) Im only asking if anyone is familiar with these things and know of a still faster way since a low execution time is important in my program. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.