Jump to content

Dom Processing: code review of a little parser


dilbertone

Recommended Posts

 

hello community

 

many many thanks for running this board. I love this site. It has helped me so often!  You are great fellows. What i do today is workin on a little php-parser!

 

I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

I want to store the data in a Mysql-db for the sake of a better retrieval!

 

see an example:

 

I need to get all the data out of this site.

 

see the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

 

 

see an example:

 

    Bürgerstiftung Lebensraum Aachen

        rechtsfähige Stiftung des bürgerlichen Rechts

        Ansprechpartner: Hubert Schramm

        Alexanderstr. 69/ 71

        52062 Aachen

        Telefon: 0241 - 4500130

        Telefax: 0241 - 4500131

        Email: info@buergerstiftung-aachen.de

        www.buergerstiftung-aachen.de

        >> Weitere Details zu dieser Stiftung

   

    Bürgerstiftung Achim

        rechtsfähige Stiftung des bürgerlichen Rechts

        Ansprechpartner: Helga Kühn

        Rotkehlchenstr. 72

        28832 Achim

        Telefon: 04202-84981

        Telefax: 04202-955210

        Email: info@buergerstiftung-achim.de

        www.buergerstiftung-achim.de

        >> Weitere Details zu dieser Stiftung

 

I need to have the data that are "behind" the link - is there any way to do this

with a easy and understandable parser - one that can be understood and written by a newbie!?

well  i could do this with XPahts - in PHP or Perl - (with mechanize)

 

i started with an php-approach: But -if i run the code (see below) i get this results

 

    martin@suse-linux:~> cd perl
    martin@suse-linux:~/perl> cd foundations
    martin@suse-linux:~/perl/foundations> php arbie_finder_de.php
    PHP Parse error:  syntax error, unexpected '*' in /home/martin/perl/foundations/arbie_finder_de.php on line 3
    martin@suse-linux:~/perl/foundations> php arbie_finder_de.php
    PHP Parse error:  syntax error, unexpected T_FOREACH in /home/martin/perl/foundations/arbie_finder_de.php on line 17
    martin@suse-linux:~/perl/foundations> ^C
    martin@suse-linux:~/perl/foundations>

 

caused by this code here

    <?php
    
    // Create DOM from URL or file
    $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');
    
    // split it via body, so you only get to the contents inside body tag
    $split = split('<body>', $html);
    // it is usually in the top of the array but just check to be sure
    $body = $split[1];
    // split again with, say,<p class="divider">A</p>
    $split = split('<p class="divider">A</p>', $body);
    // now this should contain just the data table you want to process
    $data = $split[1]
    
    // Find all links from original html
    foreach($html->find('a') as $element) {
           $link = $element->href;
           // check if this link is in our data table
           if(substr_count($data, $link) > 0) {
               // link is in our data table, follow the link
               $html = file_get_html($link);
              // do what you have to do
           }
    }
    
    
    ?>

 

 

**well some musings about my approach:**

 

 

the standard practice for scrapping the pages would be:

 

1. read the page into a string (file_get_html or whatever is being used now)

2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by

 

<p class="divider">A</p>

, since it has the link we described above:

 

3. If we wish to follow the link, just repeat the same process, but using the link.

4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way.

 

 

Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.