Jump to content

how to fetch a page with a parser [live demo]


dilbertone

Recommended Posts

good evening dear community! Howdy,

 

 

at the moment i am debugging some lines of code...

 

 

purpose: i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

 

This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records

 

Attempt: Here are the first 5 page URLs:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after.  We can use this information to create a loop:

 

 

 


#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  
  
my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  
  
my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   
   
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
$html =~ tr/r//d;     # strip the carriage returns  
$html =~ s/ / /g; # expand the spaces  
  
my $te = new HTML::TableExtract();  
$te->parse($html);  
  
my $csv = Text::CSV->new({ binary => 1 });  
  
foreach my $ts ($te->table_states) {  
	foreach my $row ($ts->rows) {  
			#trim leading/trailing whitespace from base fields  
		s/^s+//, s/\s+$// for @$row;  

		#load the fields into the hash using a "hash slice"  
		my %h;  
		@h{@cols} = @$row;  
  
		#derive some fields from base fields, again using a hash slice  
		@h{qw/name street postal town/} = split /n+/, $h{name};  
		@h{qw/phone fax/} = split /n+/, $h{phone};  
  
		#trim leading/trailing whitespace from derived fields  
		s/^s+//, s/\s+$// for @h{qw/name street postal town/};  
  
		$csv->combine(@h{@fields});  
		print $csv->string, "\n";  
	}  
} 
}


 

 

i tested the code and  get the following  results: .- see below - the error message shown in the command line...

 

btw: here the lines 57 and 58:

	#trim leading/trailing whitespace from derived fields  
		s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

 

what do you think?

 

 

 

Sta�e
    PLZ                                                                                                                                                                                                  
    Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
"lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
    Sta�e                                                                                                                                                                                                
    PLZ                                                                                                                                                                                                  
    Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
"lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.