dilbertone Posted June 4, 2011 Share Posted June 4, 2011 hello community many many thanks for running this board. I love this site. It has helped me so often! You are great fellows. What i do today is workin on a little php-parser! I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval! see an example: I need to get all the data out of this site. see the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. see an example: Bürgerstiftung Lebensraum Aachen rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Hubert Schramm Alexanderstr. 69/ 71 52062 Aachen Telefon: 0241 - 4500130 Telefax: 0241 - 4500131 Email: [email protected] www.buergerstiftung-aachen.de >> Weitere Details zu dieser Stiftung Bürgerstiftung Achim rechtsfähige Stiftung des bürgerlichen Rechts Ansprechpartner: Helga Kühn Rotkehlchenstr. 72 28832 Achim Telefon: 04202-84981 Telefax: 04202-955210 Email: [email protected] www.buergerstiftung-achim.de >> Weitere Details zu dieser Stiftung I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!? well i could do this with XPahts - in PHP or Perl - (with mechanize) i started with an php-approach: But -if i run the code (see below) i get this results martin@suse-linux:~> cd perl martin@suse-linux:~/perl> cd foundations martin@suse-linux:~/perl/foundations> php arbie_finder_de.php PHP Parse error: syntax error, unexpected '*' in /home/martin/perl/foundations/arbie_finder_de.php on line 3 martin@suse-linux:~/perl/foundations> php arbie_finder_de.php PHP Parse error: syntax error, unexpected T_FOREACH in /home/martin/perl/foundations/arbie_finder_de.php on line 17 martin@suse-linux:~/perl/foundations> ^C martin@suse-linux:~/perl/foundations> caused by this code here <?php // Create DOM from URL or file $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder'); // split it via body, so you only get to the contents inside body tag $split = split('<body>', $html); // it is usually in the top of the array but just check to be sure $body = $split[1]; // split again with, say,<p class="divider">A</p> $split = split('<p class="divider">A</p>', $body); // now this should contain just the data table you want to process $data = $split[1] // Find all links from original html foreach($html->find('a') as $element) { $link = $element->href; // check if this link is in our data table if(substr_count($data, $link) > 0) { // link is in our data table, follow the link $html = file_get_html($link); // do what you have to do } } ?> **well some musings about my approach:** the standard practice for scrapping the pages would be: 1. read the page into a string (file_get_html or whatever is being used now) 2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by <p class="divider">A</p> , since it has the link we described above: 3. If we wish to follow the link, just repeat the same process, but using the link. 4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way. Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward Link to comment https://forums.phpfreaks.com/topic/238374-dom-processing-code-review-of-a-little-parser/ Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.