dilbertone Posted June 11, 2011 Share Posted June 11, 2011 hello dear community! good day! I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval! see an example: I need to get all the data out of this site. www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder . I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. see an example: Bürgerstiftung Lebensraum Aachen rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Hubert Schramm Alexanderstr. 69/ 71 52062 Aachen Telefon: 0241 - 4500130 Telefax: 0241 - 4500131 Email: info@buergerstiftung-aachen.de www.buergerstiftung-aachen.de >> Weitere Details zu dieser Stiftung Bürgerstiftung Achim rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Helga Kühn Rotkehlchenstr. 72 28832 Achim Telefon: 04202-84981 Telefax: 04202-955210 Email: info@buergerstiftung-achim.de www.buergerstiftung-achim.de >> Weitere Details zu dieser Stiftung I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!? well i could do this with XPahts - in PHP or Perl - (with mechanize) i started with an php-approach: But -if i run the code (see below) i get this results PHP Fatal error: Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5 martin@suse-linux:~/perl/foundations> cd foundations caused by this code here <?php // Create DOM from URL or file $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder'); // split it via body, so you only get to the contents inside body tag $split = split('<body>', $html); // it is usually in the top of the array but just check to be sure $body = $split[1]; // split again with, say,<p class="divider">A</p> $split = split('<p class="divider">A</p>', $body); // now this should contain just the data table you want to process $data = $split[1]; // Find all links from original html foreach($html->find('a') as $element) { $link = $element->href; // check if this link is in our data table if(substr_count($data, $link) > 0) { // link is in our data table, follow the link $html = file_get_html($link); // do what you have to do } } ?> well some musings about my approach: the standard practice for scrapping the pages would be: 1. read the page into a string (file_get_html or whatever is being used now) 2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by <p class="divider">A</p> , since it has the link we described above: 3. If we wish to follow the link, just repeat the same process, but using the link. 4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way. Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward Update: Hmm - i could try this: addmiting that it doesn't get any simpler than using simple_html_dom. $records = array(); foreach($html->find('#content dl') as $contact) { $record = array(); $record["name"] = $contact->find("dt", 0)->plaintext; foreach($contact->find("dd") as $field) { /* parse each $field->plaintext in order to obtain $fieldname */ $record[$fieldname] = $field->plaintext; } $records[] = $record; } Well - i try to work from here. Perhaps i use a recent version of PHP to get the jQuery-like syntax.... hmmm... any ideas look forward Link to comment https://forums.phpfreaks.com/topic/239067-dom-processing-code-review-of-a-little-parser-10-liner/ Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.