dilbertone Posted June 11, 2011 Share Posted June 11, 2011 hello dear community! good day! I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval! see an example: I need to get all the data out of this site. www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder . I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. see an example: Bürgerstiftung Lebensraum Aachen rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Hubert Schramm Alexanderstr. 69/ 71 52062 Aachen Telefon: 0241 - 4500130 Telefax: 0241 - 4500131 Email: [email protected] www.buergerstiftung-aachen.de >> Weitere Details zu dieser Stiftung Bürgerstiftung Achim rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Helga Kühn Rotkehlchenstr. 72 28832 Achim Telefon: 04202-84981 Telefax: 04202-955210 Email: [email protected] www.buergerstiftung-achim.de >> Weitere Details zu dieser Stiftung I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!? well i could do this with XPahts - in PHP or Perl - (with mechanize) i started with an php-approach: But -if i run the code (see below) i get this results PHP Fatal error: Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5 martin@suse-linux:~/perl/foundations> cd foundations caused by this code here <?php // Create DOM from URL or file $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder'); // split it via body, so you only get to the contents inside body tag $split = split('<body>', $html); // it is usually in the top of the array but just check to be sure $body = $split[1]; // split again with, say,<p class="divider">A</p> $split = split('<p class="divider">A</p>', $body); // now this should contain just the data table you want to process $data = $split[1]; // Find all links from original html foreach($html->find('a') as $element) { $link = $element->href; // check if this link is in our data table if(substr_count($data, $link) > 0) { // link is in our data table, follow the link $html = file_get_html($link); // do what you have to do } } ?> well some musings about my approach: the standard practice for scrapping the pages would be: 1. read the page into a string (file_get_html or whatever is being used now) 2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by <p class="divider">A</p> , since it has the link we described above: 3. If we wish to follow the link, just repeat the same process, but using the link. 4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way. Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward Update: Hmm - i could try this: addmiting that it doesn't get any simpler than using simple_html_dom. $records = array(); foreach($html->find('#content dl') as $contact) { $record = array(); $record["name"] = $contact->find("dt", 0)->plaintext; foreach($contact->find("dd") as $field) { /* parse each $field->plaintext in order to obtain $fieldname */ $record[$fieldname] = $field->plaintext; } $records[] = $record; } Well - i try to work from here. Perhaps i use a recent version of PHP to get the jQuery-like syntax.... hmmm... any ideas look forward Quote Link to comment https://forums.phpfreaks.com/topic/239067-dom-processing-code-review-of-a-little-parser-10-liner/ Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.