dil_bert Posted November 9, 2017 Share Posted November 9, 2017 hello dear php-experts, I'm pretty new to Programming and OO programming especially.Nonetheless, I'm trying to get done a very simple Spider for web crawling.Here's what i do not get to work #!C:\Perl\bin\perl use strict; # You always want to include both strict and warnings use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::LinkExtor; # There was no reason for this to be in a BEGIN block (and there # are a few good reasons for it not to be) open my $file1,"+>>", ("links.txt"); select($file1); #The Url I want it to start at; # Note that I've made this an array, @urls, rather than a scalar, $URL my @urls = ('http://www.computersecrets.eu.pn/'); # I'm not positive, but this should only need to be set up once, not # on every pass through the loop my $browser = LWP::UserAgent->new('IE 6'); $browser->timeout(10); #Request and receive contents of a web page; # Need to use a while loop instead of a for loop because @urls will # be changing as we go while (@urls) { my $url = shift @urls; my $request = HTTP::Request->new(GET => $URL); my $response = $browser->request($request); #Tell me if there is an error; if ($response->is_error()) {printf "%s\n", $response->status_line;} my $contents = $response->content(); #Extract the links from the HTML; my ($page_parser) = HTML::LinkExtor->new(undef, $url); $page_parser->parse($contents)->eof; @links = $page_parser->links; #Print the link to a links.txt file; foreach $link (@links) { push @urls, $$link[2]; # Add link to list of urls before printing it print "$$link[2]\n"; } # This next line is MANDATORY - spidering a site as fast as you can # will probably bog it down, may crash it, may get your IP address # blacklisted (I've written monitors in the past which do just that), # and is absolutely certain to piss the admins off. sleep 60; } i get the following results Global symbol "$URL" requires explicit package name at wc1.pl line 32. Global symbol "@links" requires explicit package name at wc1.pl line 42. Global symbol "$link" requires explicit package name at wc1.pl line 45. Global symbol "@links" requires explicit package name at wc1.pl line 45. Global symbol "$link" requires explicit package name at wc1.pl line 46. Global symbol "$link" requires explicit package name at wc1.pl line 47. Execution of wc1.pl aborted due to compilation errors. martin@linux-jnmx:~/perl> ^C martin@linux-jnmx:~/perl> Quote Link to comment Share on other sites More sharing options...
dil_bert Posted November 9, 2017 Author Share Posted November 9, 2017 fixed this with the following #!C:\Perl\bin\perl use strict; # You always want to include both strict and warnings use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::LinkExtor; # There was no reason for this to be in a BEGIN block (and there # are a few good reasons for it not to be) open my $file1,"+>>", ("links.txt"); select($file1); #The Url I want it to start at; # Note that I've made this an array, @urls, rather than a scalar, $URL my @urls = ('https://wordpress.org/support/plugin/participants-database/'); my %visited; # The % sigil indicates it's a hash my $browser = LWP::UserAgent->new(); $browser->timeout(5); while (@urls) { my $url = shift @urls; # Skip this URL and go on to the next one if we've # seen it before next if $visited{$url}; my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); # No real need to invoke printf if we're not doing # any formatting if ($response->is_error()) {print $response->status_line, "\n";} my $contents = $response->content(); # Now that we've got the url's content, mark it as # visited $visited{$url} = 1; my ($page_parser) = HTML::LinkExtor->new(undef, $url); $page_parser->parse($contents)->eof; my @links = $page_parser->links; foreach my $link (@links) { print "$$link[2]\n"; push @urls, $$link[2]; } sleep 60; } Quote Link to comment Share on other sites More sharing options...
dil_bert Posted November 10, 2017 Author Share Posted November 10, 2017 hello i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string "http://www.foo.com/bar" in other words: what is aimed, i need to fetch all the urls that contains the term " /bar"- then i want to extract the "bar" so that it remains the url: http://www.foo.com-is this doable?love to hear from you Quote Link to comment Share on other sites More sharing options...
dil_bert Posted November 11, 2017 Author Share Posted November 11, 2017 hello dear all i tried the following thing out... my $url =~s|/bar$||; but i left out the "my", The "my" causes a new $url to be created. What we want is to modify the old $url.what is aimed: i want to do a search to find out all urls that contains the following term: /participants-database/but unfortunatley this does not work : #!C:\Perl\bin\perl use strict; # You always want to include both strict and warnings use warnings; use LWP::Simple; use LWP::UserAgent; use HTTP::Request; use HTTP::Response; use HTML::LinkExtor; # There was no reason for this to be in a BEGIN block (and there # are a few good reasons for it not to be) open my $file1,"+>>", ("links.txt"); select($file1); #The Url I want it to start at; # Note that I've made this an array, @urls, rather than a scalar, $URL #my @urls = (' $url =~ s$||;'); my $urls =~ ('s|/participants-database$||'); my %visited; # The % sigil indicates it's a hash my $browser = LWP::UserAgent->new(); $browser->timeout(5); while (@urls) { my $url = shift @urls; # Skip this URL and go on to the next one if we've # seen it before next if $visited{$url}; my $request = HTTP::Request->new(GET => $url); my $response = $browser->request($request); # No real need to invoke printf if we're not doing # any formatting if ($response->is_error()) {print $response->status_line, "\n";} my $contents = $response->content(); # Now that we've got the url's content, mark it as # visited $visited{$url} = 1; my ($page_parser) = HTML::LinkExtor->new(undef, $url); $page_parser->parse($contents)->eof; my @links = $page_parser->links; foreach my $link (@links) { print "$$link[2]\n"; push @urls, $$link[2]; } sleep 60; } Quote Link to comment Share on other sites More sharing options...
dil_bert Posted November 12, 2017 Author Share Posted November 12, 2017 hello allIve messed up the code a bit:i have made @urls the result of a grep on @urls, which was not yet defined at all.i need to define the following: my @in_urls = ('bob', 'joe' ); btw: also and furthermore - I'm not quite sure about the while loop. do we need the while loop? foreach my $url ( @urls ){ # do stuff with the URL } i need to get it to work Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.