Jump to content

Html::tableExtract: how to optimize the CSV-Output?


dilbertone

Recommended Posts

hello good day dear community,

 

 

i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks.

 

i have a little parser that parses a site - with 6150 records.  But i need to have this in a CSV-formate. First of all see here the  target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

 

i need all the data - with separation in the filed of

 

    number
    schoolnumber
    school-name
    Adress
    Street 
    Postal Code 
     phone
     fax 
    School-type
    website

 

BTW - see here the  target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare!

 

Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them!

 

    #!/usr/bin/perl
    use strict;
    use HTML::TableExtract;
    use LWP::Simple;
    use Cwd;
    use POSIX qw(strftime);
    
    my $total_records = 0;
    my $alpha = "x";
    my $results = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $percent = 0;
    
    workDir();
    chdir $processdir;
    processURL();
    print "\nPress <enter> to continue\n";
    <>;
    my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
    processData();
    close $outfile;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
    unlink 'processing.html';
    
    sub processURL() {
    print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
    getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';
    
       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }
    
    sub processData() {
       while ( $range <= $total_records) {
          my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
          getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          foreach my $ts ($te->table_states) {
             foreach my $row ($ts->rows) {
                cleanup(@$row);
    	    # Add a table column delimiter in this case ||
                print $outfile join("||", @$row)."\n";
                }
             }
          $| = 1;  
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
       }
    }
    
    sub cleanup() {
       for ( @_ ) {
          s/\s+/ /g;
       }
    }
    
    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }

output:


    1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule|| 
    2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)|| 
    3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 
    4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

With the || being the delimiter.

 

 

My problem is: i need to have more fields - i need to have the following divided:

 

    name: Volksschule Abenberg (Grundschule)
    street: Güssübelstr. 2
    postal-code and town: 91183 Abenberg
    fax and telephone: 09178/215 09178/905060
    type of school: Volksschulen
    website: home.t-online.de/home/vs-abenberg 

 

well - how to add more fields?

This obviously has to be done in this line here, doesn t it!?

 

my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);

But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic...

 

    #!/usr/bin/perl
    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TableExtract;
    use Text::CSV;
    
    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    
    my $te = new HTML::TableExtract();
    $te->parse($html);
    
    my @cols = qw(
        rownum
        number
        name
        phone
        type
        website
    );
    
    my @fields = qw(
        rownum
        number
        name
        street
        postal
        town
        phone
        fax
        type
        website
    );
    
    my $csv = Text::CSV->new({ binary => 1 });
    
    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
    
            #  trim leading/trailing whitespace from base fields
            s/^s+//, s/\s+$// for @$row;
    
            # load the fields into the hash using a "hash slice"
            my %h;
            @h{@cols} = @$row;
    
            # derive some fields from base fields, again using a hash slice
            @h{qw/name street postal town/} = split /n+/, $h{name};
            @h{qw/phone fax/} = split /n+/, $h{phone};
    
            #  trim leading/trailing whitespace from derived fields
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};
    
            $csv->combine(@h{@fields});
            print $csv->string, "\n";
        }
    }  

 

Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic.

How to add the spider-logic here... !?

 

look forward to any and all help!

hi dear Abracadaver,

 

many many thanks - i am very very happy to hear  from you.

 

Why not try a Perl board?

 

i am pretty sure that this can be done in php as well - and the usage of csv-formatted output is also known in php-fields.. But the best argument is - i am a big  big fan of this site here.

 

And yes - you helped me years and years... your code is a live  time saver..!!!  ;)

[ i know you from  the AutoTheme and i am/was a user of your site from the early beginning in 2003....

 

So i would  be glad if you can help me here...

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.