Jump to content

Dom Processing: code review of a little parser [10 liner]


dilbertone

Recommended Posts

hello dear community!

 

good day!

 

I need to get all the data out of this site.See the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder 

 

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link. I want to store the data in a Mysql-db for the sake of a better retrieval!

 

see an example:

 

I need to get all the data out of this site. www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder    . I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

 

 

see an example:

 

    Bürgerstiftung Lebensraum Aachen

        rechtsfähige Stiftung des bürgerlichen Rechts

        Ansprechpartner: Hubert Schramm

        Alexanderstr. 69/ 71

        52062 Aachen

        Telefon: 0241 - 4500130

        Telefax: 0241 - 4500131

        Email: info@buergerstiftung-aachen.de

        www.buergerstiftung-aachen.de

        >> Weitere Details zu dieser Stiftung

   

    Bürgerstiftung Achim

        rechtsfähige Stiftung des bürgerlichen Rechts

        Ansprechpartner: Helga Kühn

        Rotkehlchenstr. 72

        28832 Achim

        Telefon: 04202-84981

        Telefax: 04202-955210

        Email: info@buergerstiftung-achim.de

        www.buergerstiftung-achim.de

        >> Weitere Details zu dieser Stiftung

 

I need to have the data that are "behind" the link - is there any way to do this

with a easy and understandable parser - one that can be understood and written by a newbie!?

well  i could do this with XPahts - in PHP or Perl - (with mechanize)

 

i started with an php-approach: But -if i run the code (see below) i get this results

 

   

 

PHP Fatal error:  Call to undefined function file_get_html() in /home/martin/perl/foundations/arbie_finder_de.php on line 5

    martin@suse-linux:~/perl/foundations> cd foundations

 

 

caused by this code here

 

 

    <?php
    
    // Create DOM from URL or file
    $html = file_get_html('www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');
    
    // split it via body, so you only get to the contents inside body tag
    $split = split('<body>', $html);
    // it is usually in the top of the array but just check to be sure
    $body = $split[1];
    // split again with, say,<p class="divider">A</p>
    $split = split('<p class="divider">A</p>', $body);
    // now this should contain just the data table you want to process
    $data = $split[1];
    
    // Find all links from original html
    foreach($html->find('a') as $element) {
           $link = $element->href;
    
           // check if this link is in our data table
           if(substr_count($data, $link) > 0) {
               // link is in our data table, follow the link
               $html = file_get_html($link);
              // do what you have to do
           }
    }
    
    
    ?>

 

well some musings about my approach:

 

 

the standard practice for scrapping the pages would be:

 

1. read the page into a string (file_get_html or whatever is being used now)

2. split the string, This depends on the page structure. First split it via <body>, so one element of the array will contain the body, and so on until we get our target. Well I'm guessing the final split would be by

 

<p class="divider">A</p>

 

, since it has the link we described above:

 

3. If we wish to follow the link, just repeat the same process, but using the link.

4. Alternatively, we can search around for a PHP snippet that gets all links in a page. This is better if we have done 1 and 2 already, and we now have only the string inside the <body> tag. Much simpler that way.

 

 

Well - my question is: what can this errors cause - i have no glue...would be great if you have an idea look forward

 

 

Update: Hmm -  i could try this:

 

addmiting that it  doesn't get any simpler than using simple_html_dom.

 

    $records = array();
    foreach($html->find('#content dl') as $contact) {
       $record = array();
       $record["name"] = $contact->find("dt", 0)->plaintext;
       foreach($contact->find("dd") as $field) {
           /* parse each $field->plaintext in order to obtain $fieldname */
           $record[$fieldname] = $field->plaintext;
       }
       $records[] = $record;
    }

 

Well - i try to work from here. Perhaps i use a recent version of PHP to get the jQuery-like syntax.... hmmm...

 

any ideas

 

look forward

 

 

 

 

 

 

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.