Jump to content

dilbertone

Members
  • Posts

    122
  • Joined

  • Last visited

Posts posted by dilbertone

  1.  

    Hello dear Community, hello dear Andy

     

    I want to parse a site that is called the foundation-finder: My Perl knowledge is pretty small!

    I have tried various tutorials (examples of Mecha - that i have found on the CPAN) not oll of them  work - some of them are broken!

     

    Now i try t o  get some real-world-task!

     

    the Foundation-Finder-task has several steps: Especially interesting for me as a PHP/Perl-beginner is this site in  Switzerland: http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=3221

     

     

    which has a dataset of 2700 foundations. All the data are free to use - with no limitations copyrights on it.

     

    i mused about a starting-point: ould i use a Perl-module from CPAN and do the job with Perl.I guess that Mechanize or LWP could do a great job. Or HTML::Parser well - i am just musing which is

    the best way to do the job. Guess that i am in front of a nice learning curve. This task will give me some nice PHP or Perl lessions.

     

    Or can we do this with Python either!? I guess so! So here i am!

     

    So here is a sample-page for the  real-world-task a governmental site in Switzerland: more than 2'700 foundations in

     

    http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=3221

     

    can i do this with mecha!?

     

    love to get a hint

     

    thx matz

  2. hello dear community,

     

    i am currently wroking on a approach to parse some sites that contain datas on Foundations in Switzerland

    with some details like goals, contact-E-Mail and the like,,,

     

    See http://www.foundationfinder.ch/  which has a dataset of 790 foundations. All the data are free to use - with no limitations copyrights on it.

     

    I have tried it with PHP Simple HTML DOM Parser - but , i have seen that it is difficult to get all necessary data -that is needed to get it up and running.

     

    Who is wanting to jump in and help in creating this scraper/parser.  I love to hear from you.

     

    Please help me - to get up to speed with this approach?

     

     

    regards

    Dilbertone

     

  3. howdy myarro

     

    interesting thing -

     

    I am using file_get_contents to grab HTML pages. Seems to work fine as a part of a 1-shot function. But as soon as I include the function as part of a loop, it doesn't return anything...

     

    for ($i = 0; $i < 10; $i++) {

     

      ...

     

      getFile($urlArray[$i];

     

      ...

    What's the deal?

     

    well i eagerly want to know how the full code will look like.  I currently work on the same thing... 

     

    Look forward to hear  from you

     

    cheers

    db1

     

    BTW - ever tried to do it with CURL ...

  4. good evening dear community! Howdy,

     

     

    at the moment i am debugging some lines of code...

     

     

    purpose: i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

     

    This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records

     

    Attempt: Here are the first 5 page URLs:

    http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
    http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
    http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
    http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
    http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200
    

    We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after.  We can use this information to create a loop:

     

     

     

    
    #!/usr/bin/perl  
    use warnings;  
    use strict;  
    use LWP::Simple;  
    use HTML::TableExtract;  
    use Text::CSV;  
    
    my @cols = qw(  
        rownum  
        number  
        name  
        phone  
        type  
        website  
    );  
      
    my @fields = qw(  
        rownum  
        number  
        name  
        street  
        postal  
        town  
        phone  
        fax  
        type  
        website  
    );  
      
    my $i_first = "0";   
    my $i_last = "6100";   
    my $i_interval = "50";   
       
    for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
    my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
    $html =~ tr/r//d;     # strip the carriage returns  
    $html =~ s/ / /g; # expand the spaces  
      
    my $te = new HTML::TableExtract();  
    $te->parse($html);  
      
    my $csv = Text::CSV->new({ binary => 1 });  
      
    foreach my $ts ($te->table_states) {  
    	foreach my $row ($ts->rows) {  
    			#trim leading/trailing whitespace from base fields  
    		s/^s+//, s/\s+$// for @$row;  
    
    		#load the fields into the hash using a "hash slice"  
    		my %h;  
    		@h{@cols} = @$row;  
      
    		#derive some fields from base fields, again using a hash slice  
    		@h{qw/name street postal town/} = split /n+/, $h{name};  
    		@h{qw/phone fax/} = split /n+/, $h{phone};  
      
    		#trim leading/trailing whitespace from derived fields  
    		s/^s+//, s/\s+$// for @h{qw/name street postal town/};  
      
    		$csv->combine(@h{@fields});  
    		print $csv->string, "\n";  
    	}  
    } 
    }
    
    
    

     

     

    i tested the code and  get the following  results: .- see below - the error message shown in the command line...

     

    btw: here the lines 57 and 58:

    	#trim leading/trailing whitespace from derived fields  
    		s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

     

    what do you think?

     

     

     

    Sta�e
        PLZ                                                                                                                                                                                                  
        Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    "lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
        Sta�e                                                                                                                                                                                                
        PLZ                                                                                                                                                                                                  
        Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
    "lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
        Sta�e
        PLZ 
        Ot",,,Telefo,Fax,Schulat,Webseite
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    "lfd. N.",Schul-numme,Schul,"ame
        Sta�e
        PLZ 
        Ot",,,Telefo,Fax,Schulat,Webseite
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
    "lfd. N.",Schul-numme,Schul,"ame

  5.  

     

    good evening - here i am back again!!

     

    i run into troubles...  i guess that i have made some mistakes while applying some code in the above mentioned script....

     

    #!/usr/bin/perl
    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TableExtract;
    use Text::CSV;
    
    my $i_first = "0"; 
    my $i_last = "6100"; 
    my $i_interval = "50"; 
    
    for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
         my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
         #process pageurl 
    }
    
    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    
    my $te = new HTML::TableExtract();
    $te->parse($html);
    
    my @cols = qw(
        rownum
        number
        name
        phone
        type
        website
    );
    
    my @fields = qw(
        rownum
        number
        name
        street
        postal
        town
        phone
        fax
        type
        website
    );
    
    my $csv = Text::CSV->new({ binary => 1 });
    
    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
    
    trim leading/trailing whitespace from base fields
            s/^s+//, s/\s+$// for @$row;
    
    load the fields into the hash using a "hash slice"
            my %h;
            @h{@cols} = @$row;
    
    derive some fields from base fields, again using a hash slice
            @h{qw/name street postal town/} = split /n+/, $h{name};
            @h{qw/phone fax/} = split /n+/, $h{phone};
    
    trim leading/trailing whitespace from derived fields
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};
    
            $csv->combine(@h{@fields});
            print $csv->string, "\n";
        }
    } 
    

     

     

    there have been some issues - i have made a mistake

    i guess that the error is here:

     

    for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
         my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
         #process pageurl 
    }
    
    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    

     

    i have written down some kind of double - code. I need to leave out one part ... this one here; What do you think about this!?

     

    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    

     

    I get these kind of errors - it looks very very nasty!

     

    martin@suse-linux:~> cd perl
    martin@suse-linux:~/perl> perl bavaria_all_.pl
    Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
    Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
    Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
    Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
    syntax error at bavaria_all_.pl line 59, near "/,"
    Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59.
    Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
    Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
    Substitution replacement not terminated at bavaria_all_.pl line 63.
    martin@suse-linux:~/perl> 

     

     

    what do you think!?

     

    i look forward to hear from you!

     

  6. good day, hello dear community!

     

     

    i am currently ironing out a little parser-script. I have some bits

    - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

     

    This page has got more than 6000 results!

     

    Well  how do i get all the results?

     

    I tried out several things - but i dont helped. I allways got bad results.

    See i have good csv-data - but unfortunatley no spider logic...  I need some bits to get there! How to get there!?

     

    I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all  the 6150 records

     

    
    
    #!/usr/bin/perl
    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TableExtract;
    use Text::CSV;
    
    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    
    my $te = new HTML::TableExtract();
    $te->parse($html);
    
    my @cols = qw(
        rownum
        number
        name
        phone
        type
        website
    );
    
    my @fields = qw(
        rownum
        number
        name
        street
        postal
        town
        phone
        fax
        type
        website
    );
    
    my $csv = Text::CSV->new({ binary => 1 });
    
    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
    
    trim leading/trailing whitespace from base fields
            s/^s+//, s/\s+$// for @$row;
    
    load the fields into the hash using a "hash slice"
            my %h;
            @h{@cols} = @$row;
    
    derive some fields from base fields, again using a hash slice
            @h{qw/name street postal town/} = split /n+/, $h{name};
            @h{qw/phone fax/} = split /n+/, $h{phone};
    
    trim leading/trailing whitespace from derived fields
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};
    
            $csv->combine(@h{@fields});
            print $csv->string, "\n";
        }
    } 
    

    Well - with this i have a good csv-output:- but unfortunatley no spider logic.

     

     

    How to add the spider-logic here... !?

     

    well i need some help

     

    Love to hear from you

     

     

     

     

  7. hi dear Abracadaver,

     

    many many thanks - i am very very happy to hear  from you.

     

    Why not try a Perl board?

     

    i am pretty sure that this can be done in php as well - and the usage of csv-formatted output is also known in php-fields.. But the best argument is - i am a big  big fan of this site here.

     

    And yes - you helped me years and years... your code is a live  time saver..!!!  ;)

    [ i know you from  the AutoTheme and i am/was a user of your site from the early beginning in 2003....

     

    So i would  be glad if you can help me here...

  8. hello good day dear community,

     

     

    i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks.

     

    i have a little parser that parses a site - with 6150 records.  But i need to have this in a CSV-formate. First of all see here the  target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

     

    i need all the data - with separation in the filed of

     

        number
        schoolnumber
        school-name
        Adress
        Street 
        Postal Code 
         phone
         fax 
        School-type
        website
    
    

     

    BTW - see here the  target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare!

     

    Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them!

     

        #!/usr/bin/perl
        use strict;
        use HTML::TableExtract;
        use LWP::Simple;
        use Cwd;
        use POSIX qw(strftime);
        
        my $total_records = 0;
        my $alpha = "x";
        my $results = 50;
        my $range = 0;
        my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
        my $processdir = "processing";
        my $counter = 50;
        my $percent = 0;
        
        workDir();
        chdir $processdir;
        processURL();
        print "\nPress <enter> to continue\n";
        <>;
        my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
        open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
        processData();
        close $outfile;
        print "Finished processing $total_records records...\n";
        print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
        unlink 'processing.html';
        
        sub processURL() {
        print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
        getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';
        
           while( <tempfile.html> ) {
              open( FH, "$_" ) or die;
              while( <FH> ) {
                 if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
                    $total_records = $6;
                    print "Total records to process is $total_records\n";
                    }
                 }
                 close FH;
           }
           unlink 'tempfile.html';
        }
        
        sub processData() {
           while ( $range <= $total_records) {
              my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
              getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
              $te->parse_file('processing.html');
              my ($table) = $te->tables;
              foreach my $ts ($te->table_states) {
                 foreach my $row ($ts->rows) {
                    cleanup(@$row);
        	    # Add a table column delimiter in this case ||
                    print $outfile join("||", @$row)."\n";
                    }
                 }
              $| = 1;  
              print "Processed records $range to $counter";
              print "\r";
              $counter = $counter + 50;
              $range = $range + 50;
           }
        }
        
        sub cleanup() {
           for ( @_ ) {
              s/\s+/ /g;
           }
        }
        
        sub workDir() {
        # Use home directory to process data
        chdir or die "$!";
        if ( ! -d $processdir ) {
           mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
           }
        }
    
    

    output:

    
        1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule|| 
        2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)|| 
        3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 
        4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de
    

    With the || being the delimiter.

     

     

    My problem is: i need to have more fields - i need to have the following divided:

     

        name: Volksschule Abenberg (Grundschule)
        street: Güssübelstr. 2
        postal-code and town: 91183 Abenberg
        fax and telephone: 09178/215 09178/905060
        type of school: Volksschulen
        website: home.t-online.de/home/vs-abenberg 

     

    well - how to add more fields?

    This obviously has to be done in this line here, doesn t it!?

     

    my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
    

    But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic...

     

        #!/usr/bin/perl
        use warnings;
        use strict;
        use LWP::Simple;
        use HTML::TableExtract;
        use Text::CSV;
        
        my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
        $html =~ tr/r//d;     # strip the carriage returns
        $html =~ s/ / /g; # expand the spaces
        
        my $te = new HTML::TableExtract();
        $te->parse($html);
        
        my @cols = qw(
            rownum
            number
            name
            phone
            type
            website
        );
        
        my @fields = qw(
            rownum
            number
            name
            street
            postal
            town
            phone
            fax
            type
            website
        );
        
        my $csv = Text::CSV->new({ binary => 1 });
        
        foreach my $ts ($te->table_states) {
            foreach my $row ($ts->rows) {
        
                #  trim leading/trailing whitespace from base fields
                s/^s+//, s/\s+$// for @$row;
        
                # load the fields into the hash using a "hash slice"
                my %h;
                @h{@cols} = @$row;
        
                # derive some fields from base fields, again using a hash slice
                @h{qw/name street postal town/} = split /n+/, $h{name};
                @h{qw/phone fax/} = split /n+/, $h{phone};
        
                #  trim leading/trailing whitespace from derived fields
                s/^s+//, s/\s+$// for @h{qw/name street postal town/};
        
                $csv->combine(@h{@fields});
                print $csv->string, "\n";
            }
        }  
    

     

    Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic.

    How to add the spider-logic here... !?

     

    look forward to any and all help!

  9. hello dear all - hello all freaks of this great community,

     

     

    one question regarding a parser... note - it is a perl-parser, but believe me: i need some help  with that. And i guess that here many many experts know the perl-bits... so well that this is no problem here....

     

    Here we go! is there any chance to catch some seperators within the that seperate the table...  The paser script runs allready nicely. Note - i want to store the data into a MySQL database. So it would be great to have some seperators - (commas, tabs or somewhat else - a tab seperated values or comma seperated values

    are handy formats to work with...

     

    here the data out of the following site:  http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20

     

    lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite

    1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg   09178/509210  Realschulen  mrs-marienburg.homepage.t-online.de

    2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg   09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg

    6 3074 Private Berufsschule zur sonderpäd. Förderung, Förderschwerpunkt Lernen, Abensberg Regensburger Straße 60 93326  Abensberg  09443/709191

    09443/709193 Berufsschulen zur sonderpädog. Förderung www.berufsschule-abensberg.de

     

     

    Well i need to have those lines divided into at least three columns - take the first record.

    name: Volksschule Abenberg (Grundschule)

    street: Güssübelstr. 2

    postal-code and town: 91183  Abenberg

    fax and telephone: 09178/215 09178/905060

    type of school: Volksschulen

    website: home.t-online.de/home/vs-abenberg

     

    Or even better - i have divided the postal-code and town into two seperate columns!?

    Question: is this possible?

     

    By the way: see the first record: (here i only show the names of the school)

     

    1 0401 Mädchenrealschule Marienburg, Abenberg,

    6 3074 Private Berufsschule zur sonderpäd. Förderung, Förderschwerpunkt Lernen, Abensberg

     

    Note, those have some commas inside the name; does this make it difficult to create a parser that creates csv-fomate?

     

    Any idea how to do this in Perl... If possible it would be just great!!

     

    many many thx for a hint regarding this little issue - besides this all is great and fascinating!

     

     

    dilbertone...

     

     

    Here the code:

     

      #!/usr/bin/perl
        use strict;
        use warnings;
        use HTML::TableExtract;
        use LWP::Simple;
        use Cwd;
        use POSIX qw(strftime);
        my $te = HTML::TableExtract->new;
        my $total_records = 0;
        my $suchbegriffe = "e";
        my $treffer = 50;
        my $range = 0;
        my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
        my $processdir = "processing";
        my $counter = 50;
        my $displaydate = "";
        my $percent = 0;
    
        &workDir();
        chdir $processdir;
        &processURL();
        print "\nPress <enter> to continue\n";
        <>;
        $displaydate = strftime('%Y%m%d%H%M%S', localtime);
        open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
        &processData();
        close OUTFILE;
        print "Finished processing $total_records records...\n";
        print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
        unlink 'processing.html';
        die "\n";
    
        sub processURL() {
        print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
        getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
    
           while( <tempfile.html> ) {
              open( FH, "$_" ) or die;
              while( <FH> ) {
                 if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
                    $total_records = $6;
                    print "Total records to process is $total_records\n";
                    }
                 }
                 close FH;
           }
           unlink 'tempfile.html';
        }
    
        sub processData() {
           while ( $range <= $total_records) {
              getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
              $te->parse_file('processing.html');
              my ($table) = $te->tables;
              for my $row ( $table->rows ) {
                 cleanup(@$row);
                 print OUTFILE "@$row\n";
              }
              $| = 1; 
              print "Processed records $range to $counter";
              print "\r";
              $counter = $counter + 50;
              $range = $range + 50;
              $te = HTML::TableExtract->new;
           }
        }
    
        sub cleanup() {
           for ( @_ ) {
              s/s+/ /g;
           }
        }
    
        sub workDir() {
        # Use home directory to process data
        chdir or die "$!";
        if ( ! -d $processdir ) {
           mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
           }
        }  
    

  10. Hello TLG

     

    yes  - i want to split that information to tree cells or columns (in MySQL)

     

    BTW see the dataset: here you have an overview:  http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

     

    Well - i have loaded the data of the online sheet to a calc-spreadsheet and from there i imported it to mysql.

     

    In only one Column (the third one!) i have the full adress with

     

    1. name of the school

    2. name of the street

    3. postal code and town

     

    Well  - i guess that your code hits the point. I take all the (almost 6000 ) records and apply your code below.

     

     

    If I understand correctly, you would want something like this:

     

    $opt = explode("\n", $record); // Don't forget to mysql_real_escape_string
    mysql_query("INSERT INTO my_table (name, street, postal) VALUES ('{$opt[0]}','{$opt[1]}','{$opt[2]}')");

     

    i will have a closer look what explode does exactly. But i am pretty sure that you have given the exact hint....

     

    Best regards

    db1

     

  11. hello dear The Little Guy

     

    many many thanks for the hints! GREAT!

     

    $opt = explode("\n", $record);
    var_dump($opt);

     

     

    well did i get you right: i take the information of the  third column in this huge spreadheet that can be found here  http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

     

    (or the information that is derived from this dataset to a calc-Spreadsheet) and apply your code on the third  (!!!) Column of the Spreadsheet ?

     

    Note - i need to have the information not only in one cell but in three or four!?

     

    Question: should i take the information of the huge (note the table contains almost 6000 records)

     

     

    The Little Guy - i love to hear from you again. ..

     

    best regards

    db1  ;D

  12. good day dear community,

     

    well i am in big big trouble - i need some regex to solve a problem! Can you help me a bit! That would be great! Well - i mused alot how to call the subject: Finally i came to: "Regex or explode to array: I need some help in a simple string!"

     

    i have a spreadsheed in calc. with some records. There is a column that contains the following information

     

    Ecole Saint-Exupery

    Rue Saint-Malo 24

    67544 Paris

     

    Well i need to have those lines divided into at least three columns

     

    name: Ecole Saint-Exupery

    street: Rue Saint-Malo 24

    postal code and town 67544 Paris

     

    Or even better - i have divided the postal code and town into two seperate columns!? Question: is this possible? Can (or should) i do this in calc (open document-formate)? Do i need to have to use a regex and perl or am i able to solve this issues without an regex?

     

    Note - finally i need to transfer the data into MySQL-database...

     

    I look forward to a tipp...

     

    greetings

     

    BTW: you can see all the things in a real world-live-demo: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 - see the filed

     

    Schulname

    Straße

    PLZ Ort

     

    These field contains three things - the name, the street and the Postal Code and the town! Question: can this be divided into parts!? If you copy and paste the information - and drop it to calc then you get all the information in only one cell. How to divide and seperate all those information into three cells or even four?

     

    BTW - i tried to translate the information to hex-code - see the follwoing...:

     

    Staatl. Realschule Grafenau

    Rachelweg 20

    94481  Grafenau

     

    00000000: 53 74 61 61 74 6C 2E 20  52 65 61 6C 73 63 68 75
    00000010: 6C 65 20 47 72 61 66 65  6E 61 75 20 0A 52 61 63
    00000020: 68 65 6C 77 65 67 20 32  30 0A 39 34 34 38 31 20
    00000030: 20 47 72 61 66 65 6E 61  75 20 20 

     

    but i do not know if this helps here!??

     

    Can you help me to solve the problem. Do i need to have a regex!?

     

    Many thanks in advance for any and all help!

  13. Hello Ignace

     

    many many thanks for the idea - that sounds very very good.,

     

    BTW;

     

    As i already have the addresses in tab separated format

     

    What aobut this: I can create 10 different tables (or less according to the different formats) and loaf the into the database using load

     

    data infile command MySQL :: MySQL 5.1 Reference Manual :: 12.2.6 LOAD DATA INFILE Syntax .After this i can used the commands posted by you to create a new table with your new address book format.

     

    what do you think about this!

    look forward to hear from you

     

    best

    db1

     

    see also: http://dev.mysql.com/doc/refman/5.1/en/load-data.html

     

  14. Hi there - hello BlueSkyIS

     

    you can add more fields to any table you like. perhaps i don't understand the question?

     

     

    Well - what if i want to migrate 10 (Adressbook-)DBs into one.

     

    They look a bit different:

     

     

    Adressbook 1: 	name	adress 	eMail	tel		Telefax	       portrait	
    
    Adressbook 2: 	name	Company aresss: postalcode 	Telefon: 	Fax: 	E-Mail:	Internet: 
    
    Adressbook 3: 	name	address	tel	fax	email	homepage		
    

     

    all ten look like a bit different. How to treat this migration  of ten tables into one big DB!?

     

    Hope i was able to make clear what i want. If i have to be more precise - just lemme know

     

    Many thanks in advance

     

    regards

    db1

  15. Hi dear freaks

     

     

    i want to create an adressbook with MySQL. At the moment i  do not know how many fields i need.

     

    i want to be flexible with that - at least in the next days... Untill i am sure how many fields i really would need.

     

     

    i have found a Dump that allready is build for a Adressbook - i found this one in the internet.

     

     

    http://www.apachefriends.org/f/viewtopic.php?f=14&t=26305&start=0&sid=633a3f317b08dc6d8e555a81ed10538f&view=print

     

    # phpMyAdmin SQL Dump # version 2.5.7-pl1 # http://www.phpmyadmin.net # # Host: localhost # Erstellungszeit: 04. September 2007 um 16:37 # Server Version: 4.0.20 # PHP-Version: 4.3.7 # # Datenbank: `joels` # # -------------------------------------------------------- # # Tabellenstruktur f�r Tabelle `address_book` # CREATE TABLE `address_book` ( `address_book_id` int(11) NOT NULL auto_increment, `customers_id` int(11) NOT NULL default '0', `entry_gender` char(1) NOT NULL default '', `entry_company` varchar(32) default NULL, `entry_firstname` varchar(32) NOT NULL default '', `entry_lastname` varchar(32) NOT NULL default '', `entry_street_address` varchar(64) NOT NULL default '', `entry_suburb` varchar(32) default NULL, `entry_postcode` varchar(10) NOT NULL default '', `entry_city` varchar(32) NOT NULL default '', `entry_state` varchar(32) default NULL, `entry_country_id` int(11) NOT NULL default '0', `entry_zone_id` int(11) NOT NULL default '0', PRIMARY KEY (`address_book_id`), KEY `idx_address_book_customers_id` (`customers_id`) ) TYPE=MyISAM AUTO_INCREMENT=2 ; 
    
    

    can i use this - and can i easily add more fields... ?

     

    lookforward to hear from you

     

    Regards

    db1

     

     

     

  16. Hello BlueSkyIs,

     

    many thanks for your answer!!

     

    That function is so simple, i would just skip writing the function and put the function code within the loop.

     

    Hmm - how do i apply it!`

     

    Hmmm - as i am new to php i ask myself what the point of the function is.  Does the $numbers loop  go within the function definition.

     

    i guess no - Hmm - guess the loop goes outside the function definition where the function is called multiple times.

     

    function  () {
        /* Inside, define the function. */
    }
    multiload(); /* <-- Outside, call the function. */ 
    
    

     

    Love to hear from you

     

    best regards

    db1

  17. Hello dear community,

     

     

    The following code is a solution that returns the labels and values in a formatted array ready for input to mysql. Very nice;-)

     

    <?php
    
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('http://schulen.bildung-rp.de/gehezu/startseite/einzelanzeige.html?tx_wfqbe_pi1%5buid%5d=60119');
    $divElement = $dom->getElementById('wfqbeResults');
    
    $innerHTML= '';
    $children = $divElement->childNodes;
    foreach ($children as $child) {
    $innerHTML = $child->ownerDocument->saveXML( $child );
    
    $doc = new DOMDocument();
    $doc->loadHTML($innerHTML);
    //$divElementNew = $dom->getElementsByTagName('td');
    $divElementNew = $dom->getElementsByTagname('td');
    
        /*** the array to return ***/
        $out = array();
        foreach ($divElementNew as $item)
        {
            /*** add node value to the out array ***/
            $out[] = $item->nodeValue;
        }
    
    echo '<pre>';
    print_r($out);
    echo '</pre>';
    
    } 
    
    ?>
    

     

    That bit of code works very fine and it performs an operation that i intend to call upon multiple times, Therefore it makes sense to wrap it in a function. We can name it whatever we want- Let us just name it "multiload".

     

    I tried to do this with the following code - but this does not run... I am still not sure where to put the uid - inside or outside the function...

     

    <?php
    
    function multiload ($uid) {
    /*...*/
    //  $uid = '60119';
    
    $dom = new DOMDocument();
    
             $dom->loadHTMLFile('http://schulen.bildung-rp.de/gehezu/startseite/einzelanzeige.html?tx_wfqbe_pi1%5buid%5d=' . $uid);
    
    
         }
    
    multiload ('60089');
    multiload ('60152');
    multiload ('60242');
    /*...*/
    
    $divElement = $dom->getElementById('wfqbeResults');
    
    $innerHTML= '';
    $children = $divElement->childNodes;
    foreach ($children as $child) {
    $innerHTML = $child->ownerDocument->saveXML( $child );
    
    $doc = new DOMDocument();
    $doc->loadHTML($innerHTML);
    //$divElementNew = $dom->getElementsByTagName('td');
    $divElementNew = $dom->getElementsByTagname('td');
    
        /*** the array to return ***/
        $out = array();
        foreach ($divElementNew as $item)
        {
            /*** add node value to the out array ***/
            $out[] = $item->nodeValue;
        }
    
    echo '<pre>';
    print_r($out);
    echo '</pre>';
    
    }
    
    
    ?>
    
    

     

     

    where to put the following lines

     

    multicall('60089');
    multicall('60152');
    multicall('60242');
    /*...*/

     

    This is still repetitive, so we can put the numbers in an array - can ´t we!

    Then we can loop through the array.

     

     

    $numbers = array ('60089', '60152', '60242' /*...*/);
    foreach ($numbers as $number) {
        doStuff($number);
    }

     

    But the question is - how to and where to put the loop!?

     

    Can anybody give me a starting point...

     

     

    BTW - if i have to be more descriptive i am trying to explain more - just let me know...

    it is no problem to explain more

     

    greetings

     

  18. hello dear revraz, hello dear litebearer, good day

     

     

    many many tanks to you both! Great to hear from you.

    The idea with an array is convincing me! I am convinced! 

     

    BTW: this uses the Dot-Operator, doesn ´t it!? One of two solutions for string- or Url-concetenation??

     

     

    Perhaps like this?

    $orig_string = "http://www.somesite.com?page=";
    $number_array = array ("123", "43567", "9287","3323");
    for($i=0; $i<$count($number_array); $i ++) {
    $new_url = $orig_url . $number_array[$i];
    /* do something with the new url */
    }
    
    

     

    many many thanks for the hint!!

     

    btw; this is a absolute great forum - i love it!!  Many many thanks for the ideas and hints.

     

    @ you both - you are very very supportive. GREAT To have you here!

     

    Have a great season break and merry merry Christmas

     

    greetings

    Dilbertone!

     

     

    Update: one last question: I integrate the loop solution with the array into my basic-script ....

     

    <?php
    
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('http://schulen.bildung-rp.de/gehezu/startseite/einzelanzeige.html?tx_wfqbe_pi1%5buid%5d=60119');
    $divElement = $dom->getElementById('wfqbeResults');
    $innerHTML= '';
    $children = $divElement->childNodes;
    foreach ($children as $child) {
    $innerHTML = $child->ownerDocument->saveXML( $child );
    
    $doc = new DOMDocument();
    $doc->loadHTML($innerHTML);
    //$divElementNew = $dom->getElementsByTagName('td');
    $divElementNew = $dom->getElementsByTagname('td');
    
        /*** the array to return ***/
        $out = array();
        foreach ($divElementNew as $item)
        {
            /*** add node value to the out array ***/
            $out[] = $item->nodeValue;
        }
    
    echo '<pre>';
    print_r($out);
    echo '</pre>';
    
    } 
    

     

     

     

    .......like so:

     

     

    <?php
    
    $dom = new DOMDocument();
    $orig_string = "http://www.somesite.com?page=";
    
    @$dom->loadHTMLFile {
    
    $number_array = array ("123", "43567", "9287","3323");
    for($i=0; $i<$count($number_array); $i ++) {
    
    $new_url = $orig_url . $number_array[$i];
    
    /* do something with the new url */
    }
    
    $divElement = $dom->getElementById('wfqbeResults');
    $innerHTML= '';
    $children = $divElement->childNodes;
    foreach ($children as $child) {
    $innerHTML = $child->ownerDocument->saveXML( $child );
    
    $doc = new DOMDocument();
    $doc->loadHTML($innerHTML);
    //$divElementNew = $dom->getElementsByTagName('td');
    $divElementNew = $dom->getElementsByTagname('td');
    
        /*** the array to return ***/
        $out = array();
        foreach ($divElementNew as $item)
        {
            /*** add node value to the out array ***/
            $out[] = $item->nodeValue;
        }
    
    echo '<pre>';
    print_r($out);
    echo '</pre>';
    
    } 
    

  19. good evening dear Community,  8)

     

    Well first of all: felize Navidad - I wanna wish you a Merry Christmas!!

     

    Today i'm trying to debug a little DOMDocument object in PHP. Ideally it'd be nice if I could get DOMDocument to output in a array-like format, to store the data in a database!

     

    My example: head over to the url -

    see the example: the target http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=8880   

     

    I investigated the Sourcecode:

     

    I want to filter out the data that that is in the following class <div class="floatbox">

     

    See the sourcecode:

     

    <span class="grey"> <span style="font-size:x-small;">></span></span>
    <a class="navLink" href="http://dms-schule.bildung.hessen.de/suchen/index.html" title="Suchformulare zum hessischen schulischen Bildungssystem">suche</a>
                  </div>
                </div>
              <!-- begin of text -->
               <h3>Siegfried-Pickert Schule</h3>
    <div class="floatbox">
    

     

     

    See my approach: Here is the solution return the labels and values in a formatted array ready for input to mysql!

    <?php
    
    $dom = new DOMDocument();
    @$dom->loadHTMLFile('http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=8880');
    $divElement = $dom->getElementById('floatbox');
    
    $innerHTML= '';
    $children = $divElement->childNodes;
    foreach ($children as $child) {
    $innerHTML = $child->ownerDocument->saveXML( $child );
    
    $doc = new DOMDocument();
    $doc->loadHTML($innerHTML);
    //$divElementNew = $dom->getElementsByTagName('td');
    $divElementNew = $dom->getElementsByTagname('td');
    
        /*** the array to return ***/
        $out = array();
        foreach ($divElementNew as $item)
        {
            /*** add node value to the out array ***/
            $out[] = $item->nodeValue;
        }
    
    echo '<pre>';
    print_r($out);
    echo '</pre>';
    
    } 
    

     

     

    well Duhh: this outputs lot of garbage. The code spits out a lot of html anyway.

    What can i do to get a more cleaned up code!?

     

    What is wrong with the idea of using this attribute:

     

     $dom->getElementById('floatbox');

     

    any idea!?

     

    any and all help will greatly appreciated.

     

    season-greetings

    db1  ;)

  20. Hello dear community, good day!

     

    first of all: Merry Christmas to all of you!!

     

     

    How to combine / concatenate a *divided* string in order to use this combined / concatenated string in a loop where i run the

     

    $dom = new DOMDocument();
    @$dom->loadHTMLFile('<- path to the file-> =60119');
    

     

    and the following.... numbers - Note: they replace the ending!!!

     

    60299

    64643

    62958

    63678

    60419

    60585

    60749

    60962

     

    and so on. (

     

    Question: How to combine the string (in fact the string is an URL) so that i am able to build the URLs automatically. And that i am able to run all that in a loop - eg with foreach [probably this is the right way to do that].

     

    I hope that i was able to explain the question so that you understand it. If i have to be more descriptive - just let me know!

     

    Many many thanks for a hint!

     

    db1  ;)

     

  21.  

    hello good evening!

     

    i was able to run the test:

     

    suse-linux:~ # php -r "echo class_exists('DOMDocument') ? 'It exists' : 'It Does NOT exist';"
    It existssuse-linux:~ # 
    

     

    well i am glad - now i can continue with the work on the parser.

    @DJ Kat: i continue with some tests on  the parserscript (you gave me in an other thread - furhter below)]

  22. hi there - hello dear DJ Kat

     

    well - i am a bit  confused!  what to say? ;-)

     

    Actually that what you showed doesn't mean DOMDocument doesn't exist. It means you ran the command wrong. The dollar sign just indicates it's on the command line you shouldn't run that.

     

    hmmm - i want to run the DOM-Document-code you suggested to me: so i am trying my best to get the Linux-box up and running with all that i need to have.

     

    Lemme know if i did something wrong!?

     

    well i try it on the shell:

     

    Question: should i run  this:

     

    $ php -r 'echo (class_exists("DOMDocument")) ? "It exists \n" : "It Does NOT exist \n";'

     

    or this: on the

    -r 'echo (class_exists("DOMDocument")) ? "It exists \n" : "It Does NOT exist \n";'
    

     

    or so:

     

    -r 'echo (class_exists("DOMDocument")) ? "It exists \n" : "It Does NOT exist \n"
    

     

    hmmm - i am a bit confused...

     

    love to hear from you!

     

    db1

     

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.