dil_bert Posted November 9, 2017 Share Posted November 9, 2017 hello dear php-experts, I'm pretty new to Programming and OO programming especially.Nonetheless, I'm trying to get done a very simple Spider for web crawling.Here's what i do not get to work #!C:\Perl\bin\perl use strict; # You always want to include both strict and warnings use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::LinkExtor; # There was no reason for this to be in a BEGIN block (and there # are a few good reasons for it not to be) open my $file1,"+>>", ("links.txt"); select($file1); #The Url I want it to start at; # Note that I've made this an array, @urls, rather than a scalar, $URL my @urls = ('http://www.computersecrets.eu.pn/'); # I'm not positive, but this should only need to be set up once, not # on every pass through the loop my $browser = LWP::UserAgent->new('IE 6'); $browser->timeout(10); #Request and receive contents of a web page; # Need to use a while loop instead of a for loop because @urls will # be changing as we go while (@urls) { my $url = shift @urls; my $request = HTTP::Request->new(GET => $URL); my $response = $browser->request($request); #Tell me if there is an error; if ($response->is_error()) {printf "%s\n", $response->status_line;} my $contents = $response->content(); #Extract the links from the HTML; my ($page_parser) = HTML::LinkExtor->new(undef, $url); $page_parser->parse($contents)->eof; @links = $page_parser->links; #Print the link to a links.txt file; foreach $link (@links) { push @urls, $$link[2]; # Add link to list of urls before printing it print "$$link[2]\n"; } # This next line is MANDATORY - spidering a site as fast as you can # will probably bog it down, may crash it, may get your IP address # blacklisted (I've written monitors in the past which do just that), # and is absolutely certain to piss the admins off. sleep 60; } i get the following results Global symbol "$URL" requires explicit package name at wc1.pl line 32. Global symbol "@links" requires explicit package name at wc1.pl line 42. Global symbol "$link" requires explicit package name at wc1.pl line 45. Global symbol "@links" requires explicit package name at wc1.pl line 45. Global symbol "$link" requires explicit package name at wc1.pl line 46. Global symbol "$link" requires explicit package name at wc1.pl line 47. Execution of wc1.pl aborted due to compilation errors. martin@linux-jnmx:~/perl> ^C martin@linux-jnmx:~/perl> Link to comment Share on other sites More sharing options...
dil_bert Posted November 9, 2017 Author Share Posted November 9, 2017 fixed this with the following #!C:\Perl\bin\perl use strict; # You always want to include both strict and warnings use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::LinkExtor; # There was no reason for this to be in a BEGIN block (and there # are a few good reasons for it not to be) open my $file1,"+>>", ("links.txt"); select($file1); #The Url I want it to start at; # Note that I've made this an array, @urls, rather than a scalar, $URL my @urls = ('https://wordpress.org/support/plugin/participants-database/'); my %visited; # The % sigil indicates it's a hash my $browser = LWP::UserAgent->new(); $browser->timeout(5); while (@urls) { my $url = shift @urls; # Skip this URL and go on to the next one if we've # seen it before next if $visited{$url}; my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); # No real need to invoke printf if we're not doing # any formatting if ($response->is_error()) {print $response->status_line, "\n";} my $contents = $response->content(); # Now that we've got the url's content, mark it as # visited $visited{$url} = 1; my ($page_parser) = HTML::LinkExtor->new(undef, $url); $page_parser->parse($contents)->eof; my @links = $page_parser->links; foreach my $link (@links) { print "$$link[2]\n"; push @urls, $$link[2]; } sleep 60; } Link to comment Share on other sites More sharing options...
dil_bert Posted November 10, 2017 Author Share Posted November 10, 2017 hello i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string "http://www.foo.com/bar" in other words: what is aimed, i need to fetch all the urls that contains the term " /bar"- then i want to extract the "bar" so that it remains the url: http://www.foo.com-is this doable?love to hear from you Link to comment Share on other sites More sharing options...
dil_bert Posted November 11, 2017 Author Share Posted November 11, 2017 hello dear all i tried the following thing out... my $url =~s|/bar$||; but i left out the "my", The "my" causes a new $url to be created. What we want is to modify the old $url.what is aimed: i want to do a search to find out all urls that contains the following term: /participants-database/but unfortunatley this does not work : #!C:\Perl\bin\perl use strict; # You always want to include both strict and warnings use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::LinkExtor; # There was no reason for this to be in a BEGIN block (and there # are a few good reasons for it not to be) open my $file1,"+>>", ("links.txt"); select($file1); #The Url I want it to start at; # Note that I've made this an array, @urls, rather than a scalar, $URL #my @urls = (' $url =~ s$||;'); my $urls =~ ('s|/participants-database$||'); my %visited; # The % sigil indicates it's a hash my $browser = LWP::UserAgent->new(); $browser->timeout(5); while (@urls) { my $url = shift @urls; # Skip this URL and go on to the next one if we've # seen it before next if $visited{$url}; my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); # No real need to invoke printf if we're not doing # any formatting if ($response->is_error()) {print $response->status_line, "\n";} my $contents = $response->content(); # Now that we've got the url's content, mark it as # visited $visited{$url} = 1; my ($page_parser) = HTML::LinkExtor->new(undef, $url); $page_parser->parse($contents)->eof; my @links = $page_parser->links; foreach my $link (@links) { print "$$link[2]\n"; push @urls, $$link[2]; } sleep 60; } Link to comment Share on other sites More sharing options...
dil_bert Posted November 12, 2017 Author Share Posted November 12, 2017 hello allIve messed up the code a bit:i have made @urls the result of a grep on @urls, which was not yet defined at all.i need to define the following: my @in_urls = ('bob', 'joe' ); btw: also and furthermore - I'm not quite sure about the while loop. do we need the while loop? foreach my $url ( @urls ){ # do stuff with the URL } i need to get it to work Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.