Jump to content

Dom Processing: code review of a little parser


dilbertone

Recommended Posts

 

hello community

 

many many thanks for running this board. I love this site. It has helped me so often!  You are great fellows. What i do today is workin on a little php-parser!

 

I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

I want to store the data in a Mysql-db for the sake of a better retrieval!

 

see an example:

 

I need to get all the data out of this site.

 

see the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

 

 

see an example:

 

    Bürgerstiftung Lebensraum Aachen

        rechtsfähige Stiftung des bürgerlichen Rechts

        Ansprechpartner: Hubert Schramm

        Alexanderstr. 69/ 71

        52062 Aachen

        Telefon: 0241 - 4500130

        Telefax: 0241 - 4500131

        Email: [email protected]

        www.buergerstiftung-aachen.de

        >> Weitere Details zu dieser Stiftung

   

    Bürgerstiftung Achim

        rechtsfähige Stiftung des bürgerlichen Rechts

        Ansprechpartner: Helga Kühn

        Rotkehlchenstr. 72

        28832 Achim

        Telefon: 04202-84981

        Telefax: 04202-955210

        Email: [email protected]

        www.buergerstiftung-achim.de

        >> Weitere Details zu dieser Stiftung

 

I need to have the data that are "behind" the link - is there any way to do this

with a easy and understandable parser - one that can be understood and written by a newbie!?

well  i could do this with XPahts - in PHP or Perl - (with mechanize)

 

i started with an php-approach: But -if i run the code (see below) i get this results

 

    martin@suse-linux:~> cd perl
    martin@suse-linux:~/perl> cd foundations
    martin@suse-linux:~/perl/foundations> php arbie_finder_de.php
    PHP Parse error:  syntax error, unexpected '*' in /home/martin/perl/foundations/arbie_finder_de.php on line 3
    martin@suse-linux:~/perl/foundations> php arbie_finder_de.php
    PHP Parse error:  syntax error, unexpected T_FOREACH in /home/martin/perl/foundations/arbie_finder_de.php on line 17
    martin@suse-linux:~/perl/foundations> ^C
    martin@suse-linux:~/perl/foundations>

 

caused by this code here

    <?php
    
    // Create DOM from URL or file
    $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');
    
    // split it via body, so you only get to the contents inside body tag
    $split = split('<body>', $html);
    // it is usually in the top of the array but just check to be sure
    $body = $split[1];
    // split again with, say,<p class="divider">A</p>
    $split = split('<p class="divider">A</p>', $body);
    // now this should contain just the data table you want to process
    $data = $split[1]
    
    // Find all links from original html
    foreach($html->find('a') as $element) {
           $link = $element->href;
           // check if this link is in our data table
           if(substr_count($data, $link) > 0) {
               // link is in our data table, follow the link
               $html = file_get_html($link);
              // do what you have to do
           }
    }
    
    
    ?>

 

 

**well some musings about my approach:**

 

 

the standard practice for scrapping the pages would be:

 

1. read the page into a string (file_get_html or whatever is being used now)

2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by

 

<p class="divider">A</p>

, since it has the link we described above:

 

3. If we wish to follow the link, just repeat the same process, but using the link.

4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way.

 

 

Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward

 

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.