file_get_contents or Curl - which one to take for a little parser

dilbertone · May 24, 2011

i currently write a little parser & harvester that collects the data of this website: (see below)

http://www.aktive-buergerschaft.de/buergerstiftungsfinder

i want to have all foundations that are listed on this page (see examples below).- Well i think, that i

need to choose between file_get_contents and curl - to fetch the datas.

And i have tu use some ideas of a parser - i do not know which one i should use here. Can you give me some hints!?

first .- i present my FETCHING-Part: with curl:

well I've never needed to use curl myself, but, obvious resource php.net's example is;

<?php
// create a new cURL resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);

// grab URL and pass it to the browser
$data = curl_exec($ch);

// close cURL resource, and free up system resources
curl_close($ch);


//Then you can use $data for parsing
?>

well to be frank:

If we dont have curl a slower function is file_get_contents() - this will work too! Well i think that it just is about 1-2 seconds slower, but the call is much easier!

<?php
$html = file_get_contents('http://www.example.com');

//now all the html is the $html
?>

anyway - i think the much more interesting part is the parsing

i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder

Bürgerstiftung Lebensraum Aachen

rechtsfähige Stiftung des bürgerlichen Rechts

Ansprechpartner: Hubert Schramm

Alexanderstr. 69/ 71

52062 Aachen

Telefon: 0241 - 4500130

Telefax: 0241 - 4500131

Email: [email protected]

www.buergerstiftung-aachen.de

>> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim

rechtsfähige Stiftung des bürgerlichen Rechts

Ansprechpartner: Helga Kühn

Rotkehlchenstr. 72

28832 Achim

Telefon: 04202-84981

Telefax: 04202-955210

Email: [email protected]

www.buergerstiftung-achim.de

>> Weitere Details zu dieser Stiftung

BürgerStiftung Region Ahrensburg

rechtsfähige Stiftung des bürgerlichen Rechts

Ansprechpartner: Dr. Michael Eckstein

An der Reitbahn 3

22926 Ahrensburg

Telefon: 04102 - 67 84 89

Telefax: 04102 - 82 34 56

Email: [email protected]

www.buergerstiftung-region-ahrensburg.de

>> Weitere Details zu dieser Stiftung

i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder

Note: see the link here - >> Weitere Details zu dieser Stiftung i need to grab the datas that is "behind" this link!

dilbertone · May 24, 2011

i currently write a little parser & harvester that collects the data of this website: (see below)

http://www.aktive-buergerschaft.de/buergerstiftungsfinder

i want to have all foundations that are listed on this page (see examples below).- Well i think, that i

need to choose between file_get_contents and curl - to fetch the datas.

And i have tu use some ideas of a parser - i do not know which one i should use here. Can you give me some hints!?

first .- i present my FETCHING-Part: with curl:

well I've never needed to use curl myself, but, obvious resource php.net's example is;
<?php
// create a new cURL resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/");
curl_setopt($ch, CURLOPT_HEADER, 0);

// grab URL and pass it to the browser
$data = curl_exec($ch);

// close cURL resource, and free up system resources
curl_close($ch);


//Then you can use $data for parsing
?>
well to be frank:

If we dont have curl a slower function is file_get_contents() - this will work too! Well i think that it just is about 1-2 seconds slower, but the call is much easier!
<?php
$html = file_get_contents('http://www.example.com');

//now all the html is the $html
?>
anyway - i think the much more interesting part is the parsing

i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder

Bürgerstiftung Lebensraum Aachen

rechtsfähige Stiftung des bürgerlichen Rechts

Ansprechpartner: Hubert Schramm

Alexanderstr. 69/ 71

52062 Aachen

Telefon: 0241 - 4500130

Telefax: 0241 - 4500131

Email: [email protected]

www.buergerstiftung-aachen.de

>> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim

rechtsfähige Stiftung des bürgerlichen Rechts

Ansprechpartner: Helga Kühn

Rotkehlchenstr. 72

28832 Achim

Telefon: 04202-84981

Telefax: 04202-955210

Email: [email protected]

www.buergerstiftung-achim.de

>> Weitere Details zu dieser Stiftung

BürgerStiftung Region Ahrensburg

rechtsfähige Stiftung des bürgerlichen Rechts

Ansprechpartner: Dr. Michael Eckstein

An der Reitbahn 3

22926 Ahrensburg

Telefon: 04102 - 67 84 89

Telefax: 04102 - 82 34 56

Email: [email protected]

www.buergerstiftung-region-ahrensburg.de

>> Weitere Details zu dieser Stiftung

i have to parse the stuff - in order to get the following data: See the site with examples..http://www.aktive-buergerschaft.de/buergerstiftungsfinder

Note: see the link here - >> Weitere Details zu dieser Stiftung i need to grab the datas that is "behind" this link!

xyph · May 24, 2011

Once you have the html data, the easiest way to grab parts is using RegEx

dilbertone · May 24, 2011

hello dear xyph

many many thanks to you for the quick reply!

Once you have the html data, the easiest way to grab parts is using RegEx

thx for the hint i will try it out! With this ...:

function do_reg($text, $regex, $regs)
{
if (preg_match($regex, $text, $regs)) {
	$result = $regs[0];
} 
else {
	$result = "";
}
return $result;
}

or this::


function do_reg($text, $regex)
{
preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
$result[0][$i];
}
}

i will try out and see which regex fits the most

again many thanks for any and all help!

db1

xyph · May 24, 2011

The key is in the pattern. Something like

preg_match_all(
'%<dt>([^<]++)</dt>\s++
<dd\ class="refo">([^<]++)</dd>\s++
<dd>Ansprechpartner:\s++([^<]++)</dd>
# etc
%x', 
$subject, $result, PREG_SET_ORDER);
print_r($result);

is what you want

devWhiz · May 25, 2011

as file_get_contents() takes less code and is quicker to write for many, I prefer curl as when I write scripts, a big factor is how fast I can get it to grab data from a header, like my scripts are optimized for servers and I mainly write the scripts to automate actions on myspace and facebook applications, but all of my scripts need to load a header fast, grab data, parse it, throw the data I need into variables and then I manipulate new headers and send the information to the server, Im just rambling, my bad I don't know anything about preg_match yet, I think I might try to learn that soon

dilbertone · May 25, 2011

hello CueL3SS - many many thanks

as file_get_contents() takes less code and is quicker to write for many, I prefer curl as when I write scripts, a big factor is how fast I can get it to grab data from a header, like my scripts are optimized for servers and I mainly write the scripts to automate actions on myspace and facebook applications, but all of my scripts need to load a header fast, grab data, parse it, throw the data I need into variables and then I manipulate new headers and send the information to the server, Im just rambling, my bad I don't know anything about preg_match yet, I think I might try to learn that soon

great to read you and your ideas!

i will try all that is written in the thread! Greetings

Sign In

file_get_contents or Curl - which one to take for a little parser

Recommended Posts

dilbertone

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

xyph

Link to comment

Share on other sites

devWhiz

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information