Html::tableExtract: how to optimize the CSV-Output?

dilbertone · February 24, 2011

hello good day dear community,

i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks.

i have a little parser that parses a site - with 6150 records. But i need to have this in a CSV-formate. First of all see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

i need all the data - with separation in the filed of

    number
    schoolnumber
    school-name
    Adress
    Street 
    Postal Code 
     phone
     fax 
    School-type
    website

BTW - see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare!

Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them!

    #!/usr/bin/perl
    use strict;
    use HTML::TableExtract;
    use LWP::Simple;
    use Cwd;
    use POSIX qw(strftime);
    
    my $total_records = 0;
    my $alpha = "x";
    my $results = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $percent = 0;
    
    workDir();
    chdir $processdir;
    processURL();
    print "\nPress <enter> to continue\n";
    <>;
    my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
    processData();
    close $outfile;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
    unlink 'processing.html';
    
    sub processURL() {
    print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
    getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';
    
       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }
    
    sub processData() {
       while ( $range <= $total_records) {
          my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
          getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          foreach my $ts ($te->table_states) {
             foreach my $row ($ts->rows) {
                cleanup(@$row);
    	    # Add a table column delimiter in this case ||
                print $outfile join("||", @$row)."\n";
                }
             }
          $| = 1;  
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
       }
    }
    
    sub cleanup() {
       for ( @_ ) {
          s/\s+/ /g;
       }
    }
    
    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }

output:


    1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule|| 
    2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)|| 
    3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 
    4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

With the || being the delimiter.

My problem is: i need to have more fields - i need to have the following divided:

    name: Volksschule Abenberg (Grundschule)
    street: Güssübelstr. 2
    postal-code and town: 91183 Abenberg
    fax and telephone: 09178/215 09178/905060
    type of school: Volksschulen
    website: home.t-online.de/home/vs-abenberg

well - how to add more fields?

This obviously has to be done in this line here, doesn t it!?

my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);

But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic...

    #!/usr/bin/perl
    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TableExtract;
    use Text::CSV;
    
    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    
    my $te = new HTML::TableExtract();
    $te->parse($html);
    
    my @cols = qw(
        rownum
        number
        name
        phone
        type
        website
    );
    
    my @fields = qw(
        rownum
        number
        name
        street
        postal
        town
        phone
        fax
        type
        website
    );
    
    my $csv = Text::CSV->new({ binary => 1 });
    
    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
    
            #  trim leading/trailing whitespace from base fields
            s/^s+//, s/\s+$// for @$row;
    
            # load the fields into the hash using a "hash slice"
            my %h;
            @h{@cols} = @$row;
    
            # derive some fields from base fields, again using a hash slice
            @h{qw/name street postal town/} = split /n+/, $h{name};
            @h{qw/phone fax/} = split /n+/, $h{phone};
    
            #  trim leading/trailing whitespace from derived fields
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};
    
            $csv->combine(@h{@fields});
            print $csv->string, "\n";
        }
    }

Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic.

How to add the spider-logic here... !?

look forward to any and all help!

AbraCadaver · February 24, 2011

Why not try a Perl board?

dilbertone · February 24, 2011

hi dear Abracadaver,

many many thanks - i am very very happy to hear from you.

Why not try a Perl board?

i am pretty sure that this can be done in php as well - and the usage of csv-formatted output is also known in php-fields.. But the best argument is - i am a big big fan of this site here.

And yes - you helped me years and years... your code is a live time saver..!!!

[ i know you from the AutoTheme and i am/was a user of your site from the early beginning in 2003....

So i would be glad if you can help me here...

Sign In

Html::tableExtract: how to optimize the CSV-Output?

Recommended Posts

dilbertone

Link to comment

Share on other sites

AbraCadaver

Link to comment

Share on other sites

dilbertone

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information