dilbertone Posted May 24, 2011 Share Posted May 24, 2011 i currently write a little parser & harvester that collects the data of this website: (see below) http://www.aktive-buergerschaft.de/buergerstiftungsfinder i want to have all foundations that are listed on this page (see examples below).- Well i think, that i need to choose between file_get_contents and curl - to fetch the datas. And i have tu use some ideas of a parser - i do not know which one i should use here. Can you give me some hints!? first .- i present my FETCHING-Part: with curl: well I've never needed to use curl myself, but, obvious resource php.net's example is; <?php // create a new cURL resource $ch = curl_init(); // set URL and other appropriate options curl_setopt($ch, CURLOPT_URL, "http://www.example.com/"); curl_setopt($ch, CURLOPT_HEADER, 0); // grab URL and pass it to the browser $data = curl_exec($ch); // close cURL resource, and free up system resources curl_close($ch); //Then you can use $data for parsing ?> well to be frank: If we dont have curl a slower function is file_get_contents() - this will work too! Well i think that it just is about 1-2 seconds slower, but the call is much easier! <?php $html = file_get_contents('http://www.example.com'); //now all the html is the $html ?> anyway - i think the much more interesting part is the parsing i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder Bürgerstiftung Lebensraum Aachen rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Hubert Schramm Alexanderstr. 69/ 71 52062 Aachen Telefon: 0241 - 4500130 Telefax: 0241 - 4500131 Email: [email protected] www.buergerstiftung-aachen.de >> Weitere Details zu dieser Stiftung Bürgerstiftung Achim rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Helga Kühn Rotkehlchenstr. 72 28832 Achim Telefon: 04202-84981 Telefax: 04202-955210 Email: [email protected] www.buergerstiftung-achim.de >> Weitere Details zu dieser Stiftung BürgerStiftung Region Ahrensburg rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Dr. Michael Eckstein An der Reitbahn 3 22926 Ahrensburg Telefon: 04102 - 67 84 89 Telefax: 04102 - 82 34 56 Email: [email protected] www.buergerstiftung-region-ahrensburg.de >> Weitere Details zu dieser Stiftung i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder Note: see the link here - >> Weitere Details zu dieser Stiftung i need to grab the datas that is "behind" this link! Quote Link to comment https://forums.phpfreaks.com/topic/237364-file_get_contents-or-curl-which-one-to-take-for-a-little-parser/ Share on other sites More sharing options...
dilbertone Posted May 24, 2011 Author Share Posted May 24, 2011 i currently write a little parser & harvester that collects the data of this website: (see below) http://www.aktive-buergerschaft.de/buergerstiftungsfinder i want to have all foundations that are listed on this page (see examples below).- Well i think, that i need to choose between file_get_contents and curl - to fetch the datas. And i have tu use some ideas of a parser - i do not know which one i should use here. Can you give me some hints!? first .- i present my FETCHING-Part: with curl: well I've never needed to use curl myself, but, obvious resource php.net's example is; <?php // create a new cURL resource $ch = curl_init(); // set URL and other appropriate options curl_setopt($ch, CURLOPT_URL, "http://www.example.com/"); curl_setopt($ch, CURLOPT_HEADER, 0); // grab URL and pass it to the browser $data = curl_exec($ch); // close cURL resource, and free up system resources curl_close($ch); //Then you can use $data for parsing ?> well to be frank: If we dont have curl a slower function is file_get_contents() - this will work too! Well i think that it just is about 1-2 seconds slower, but the call is much easier! <?php $html = file_get_contents('http://www.example.com'); //now all the html is the $html ?> anyway - i think the much more interesting part is the parsing i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder Bürgerstiftung Lebensraum Aachen rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Hubert Schramm Alexanderstr. 69/ 71 52062 Aachen Telefon: 0241 - 4500130 Telefax: 0241 - 4500131 Email: [email protected] www.buergerstiftung-aachen.de >> Weitere Details zu dieser Stiftung Bürgerstiftung Achim rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Helga Kühn Rotkehlchenstr. 72 28832 Achim Telefon: 04202-84981 Telefax: 04202-955210 Email: [email protected] www.buergerstiftung-achim.de >> Weitere Details zu dieser Stiftung BürgerStiftung Region Ahrensburg rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Dr. Michael Eckstein An der Reitbahn 3 22926 Ahrensburg Telefon: 04102 - 67 84 89 Telefax: 04102 - 82 34 56 Email: [email protected] www.buergerstiftung-region-ahrensburg.de >> Weitere Details zu dieser Stiftung i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder Note: see the link here - >> Weitere Details zu dieser Stiftung i need to grab the datas that is "behind" this link! Quote Link to comment https://forums.phpfreaks.com/topic/237364-file_get_contents-or-curl-which-one-to-take-for-a-little-parser/#findComment-1219762 Share on other sites More sharing options...
xyph Posted May 24, 2011 Share Posted May 24, 2011 Once you have the html data, the easiest way to grab parts is using RegEx Quote Link to comment https://forums.phpfreaks.com/topic/237364-file_get_contents-or-curl-which-one-to-take-for-a-little-parser/#findComment-1219768 Share on other sites More sharing options...
dilbertone Posted May 24, 2011 Author Share Posted May 24, 2011 hello dear xyph many many thanks to you for the quick reply! Once you have the html data, the easiest way to grab parts is using RegEx thx for the hint i will try it out! With this ...: function do_reg($text, $regex, $regs) { if (preg_match($regex, $text, $regs)) { $result = $regs[0]; } else { $result = ""; } return $result; } or this:: function do_reg($text, $regex) { preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER); for ($i = 0; $i < count($result[0]); $i++) { $result[0][$i]; } } i will try out and see which regex fits the most again many thanks for any and all help! db1 Quote Link to comment https://forums.phpfreaks.com/topic/237364-file_get_contents-or-curl-which-one-to-take-for-a-little-parser/#findComment-1219779 Share on other sites More sharing options...
xyph Posted May 24, 2011 Share Posted May 24, 2011 The key is in the pattern. Something like preg_match_all( '%<dt>([^<]++)</dt>\s++ <dd\ class="refo">([^<]++)</dd>\s++ <dd>Ansprechpartner:\s++([^<]++)</dd> # etc %x', $subject, $result, PREG_SET_ORDER); print_r($result); is what you want Quote Link to comment https://forums.phpfreaks.com/topic/237364-file_get_contents-or-curl-which-one-to-take-for-a-little-parser/#findComment-1219799 Share on other sites More sharing options...
devWhiz Posted May 25, 2011 Share Posted May 25, 2011 as file_get_contents() takes less code and is quicker to write for many, I prefer curl as when I write scripts, a big factor is how fast I can get it to grab data from a header, like my scripts are optimized for servers and I mainly write the scripts to automate actions on myspace and facebook applications, but all of my scripts need to load a header fast, grab data, parse it, throw the data I need into variables and then I manipulate new headers and send the information to the server, Im just rambling, my bad I don't know anything about preg_match yet, I think I might try to learn that soon Quote Link to comment https://forums.phpfreaks.com/topic/237364-file_get_contents-or-curl-which-one-to-take-for-a-little-parser/#findComment-1219804 Share on other sites More sharing options...
dilbertone Posted May 25, 2011 Author Share Posted May 25, 2011 hello CueL3SS - many many thanks as file_get_contents() takes less code and is quicker to write for many, I prefer curl as when I write scripts, a big factor is how fast I can get it to grab data from a header, like my scripts are optimized for servers and I mainly write the scripts to automate actions on myspace and facebook applications, but all of my scripts need to load a header fast, grab data, parse it, throw the data I need into variables and then I manipulate new headers and send the information to the server, Im just rambling, my bad I don't know anything about preg_match yet, I think I might try to learn that soon great to read you and your ideas! i will try all that is written in the thread! Greetings Quote Link to comment https://forums.phpfreaks.com/topic/237364-file_get_contents-or-curl-which-one-to-take-for-a-little-parser/#findComment-1220119 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.