dilbertone Posted February 24, 2011 Share Posted February 24, 2011 hello good day dear community, i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks. i have a little parser that parses a site - with 6150 records. But i need to have this in a CSV-formate. First of all see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 i need all the data - with separation in the filed of number schoolnumber school-name Adress Street Postal Code phone fax School-type website BTW - see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare! Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them! #!/usr/bin/perl use strict; use HTML::TableExtract; use LWP::Simple; use Cwd; use POSIX qw(strftime); my $total_records = 0; my $alpha = "x"; my $results = 50; my $range = 0; my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q="; my $processdir = "processing"; my $counter = 50; my $percent = 0; workDir(); chdir $processdir; processURL(); print "\nPress <enter> to continue\n"; <>; my $displaydate = strftime('%Y%m%d%H%M%S', localtime); open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file'; processData(); close $outfile; print "Finished processing $total_records records...\n"; print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n"; unlink 'processing.html'; sub processURL() { print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n"; getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page'; while( <tempfile.html> ) { open( FH, "$_" ) or die; while( <FH> ) { if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) { $total_records = $6; print "Total records to process is $total_records\n"; } } close FH; } unlink 'tempfile.html'; } sub processData() { while ( $range <= $total_records) { my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]); getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page'; $te->parse_file('processing.html'); my ($table) = $te->tables; foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { cleanup(@$row); # Add a table column delimiter in this case || print $outfile join("||", @$row)."\n"; } } $| = 1; print "Processed records $range to $counter"; print "\r"; $counter = $counter + 50; $range = $range + 50; } } sub cleanup() { for ( @_ ) { s/\s+/ /g; } } sub workDir() { # Use home directory to process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!"; } } output: 1||9752||Deutsche Schule Alamogordo USA Alamogorde - New Mexico || ||Deutschsprachige Auslandsschule|| 2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET Alexandrien - Ägypten || ||Begegnungsschule (Auslandsschuldienst)|| 3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503 Altötting ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 4||0006||Max-Reger-Gymnasium Amberg Kaiser-Wilhelm-Ring 7 92224 Amberg ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de With the || being the delimiter. My problem is: i need to have more fields - i need to have the following divided: name: Volksschule Abenberg (Grundschule) street: Güssübelstr. 2 postal-code and town: 91183 Abenberg fax and telephone: 09178/215 09178/905060 type of school: Volksschulen website: home.t-online.de/home/vs-abenberg well - how to add more fields? This obviously has to be done in this line here, doesn t it!? my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]); But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic... #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; use Text::CSV; my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; $html =~ tr/r//d; # strip the carriage returns $html =~ s/ / /g; # expand the spaces my $te = new HTML::TableExtract(); $te->parse($html); my @cols = qw( rownum number name phone type website ); my @fields = qw( rownum number name street postal town phone fax type website ); my $csv = Text::CSV->new({ binary => 1 }); foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { # trim leading/trailing whitespace from base fields s/^s+//, s/\s+$// for @$row; # load the fields into the hash using a "hash slice" my %h; @h{@cols} = @$row; # derive some fields from base fields, again using a hash slice @h{qw/name street postal town/} = split /n+/, $h{name}; @h{qw/phone fax/} = split /n+/, $h{phone}; # trim leading/trailing whitespace from derived fields s/^s+//, s/\s+$// for @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } } Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic. How to add the spider-logic here... !? look forward to any and all help! Link to comment https://forums.phpfreaks.com/topic/228709-htmltableextract-how-to-optimize-the-csv-output/ Share on other sites More sharing options...
AbraCadaver Posted February 24, 2011 Share Posted February 24, 2011 Why not try a Perl board? Link to comment https://forums.phpfreaks.com/topic/228709-htmltableextract-how-to-optimize-the-csv-output/#findComment-1179151 Share on other sites More sharing options...
dilbertone Posted February 24, 2011 Author Share Posted February 24, 2011 hi dear Abracadaver, many many thanks - i am very very happy to hear from you. Why not try a Perl board? i am pretty sure that this can be done in php as well - and the usage of csv-formatted output is also known in php-fields.. But the best argument is - i am a big big fan of this site here. And yes - you helped me years and years... your code is a live time saver..!!! [ i know you from the AutoTheme and i am/was a user of your site from the early beginning in 2003.... So i would be glad if you can help me here... Link to comment https://forums.phpfreaks.com/topic/228709-htmltableextract-how-to-optimize-the-csv-output/#findComment-1179158 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.