global errors all along the way in a little parser - what should i do?

dil_bert · November 9, 2017

hello dear php-experts,

I'm pretty new to Programming and OO programming especially.
Nonetheless, I'm trying to get done a very simple Spider for web crawling.

Here's what i do not get to work

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('http://www.computersecrets.eu.pn/');

# I'm not positive, but this should only need to be set up once, not
# on every pass through the loop
my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);

#Request and receive contents of a web page;
# Need to use a while loop instead of a for loop because @urls will
# be changing as we go
while (@urls) {
  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => $URL);
  my $response = $browser->request($request);

  #Tell me if there is an error;
  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  #Extract the links from the HTML;
  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  @links = $page_parser->links;

  #Print the link to a links.txt file;
  foreach $link (@links) {
    push @urls, $$link[2];  # Add link to list of urls before printing it
    print "$$link[2]\n";
  }

  # This next line is MANDATORY - spidering a site as fast as you can
  # will probably bog it down, may crash it, may get your IP address
  # blacklisted (I've written monitors in the past which do just that),
  # and is absolutely certain to piss the admins off.
  sleep 60;
}

i get the following results


Global symbol "$URL" requires explicit package name at wc1.pl line 32.
Global symbol "@links" requires explicit package name at wc1.pl line 42.
Global symbol "$link" requires explicit package name at wc1.pl line 45.
Global symbol "@links" requires explicit package name at wc1.pl line 45.
Global symbol "$link" requires explicit package name at wc1.pl line 46.
Global symbol "$link" requires explicit package name at wc1.pl line 47.
Execution of wc1.pl aborted due to compilation errors.
martin@linux-jnmx:~/perl> ^C
martin@linux-jnmx:~/perl>

dil_bert · November 9, 2017

fixed this with the following

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('https://wordpress.org/support/plugin/participants-database/');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;

  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};
    
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  }
  sleep 60;
}

dil_bert · November 10, 2017

hello

i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string

"http://www.foo.com/bar"

in other words: what is aimed, i need to fetch all the urls that contains the term " /bar"
- then i want to extract the "bar" so that it remains the url: http://www.foo.com
-

is this doable?

love to hear from you

dil_bert · November 11, 2017

hello dear all

i tried the following thing out...

my $url =~s|/bar$||;

but i left out the "my", The "my" causes a new $url to be created.
What we want is to modify the old $url.

what is aimed: i want to do a search to find out all urls that contains the following term: /participants-database/

but unfortunatley this does not work :

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;


use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
#my @urls = (' $url =~ s$||;');
my $urls =~ ('s|/participants-database$||');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;
  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};
    
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  }
  sleep 60;
}

dil_bert · November 12, 2017

hello all

Ive messed up the code a bit:

i have made @urls the result of a grep on @urls, which was not yet defined at all.

i need to define the following:

 my @in_urls = ('bob', 'joe' );

btw: also and furthermore - I'm not quite sure about the while loop. do we need the while loop?

foreach my $url ( @urls ){
   # do stuff with the URL
   }

i need to get it to work

Sign In

global errors all along the way in a little parser - what should i do?

Recommended Posts

dil_bert

Link to comment

Share on other sites

dil_bert

Link to comment

Share on other sites

dil_bert

Link to comment

Share on other sites

dil_bert

Link to comment

Share on other sites

dil_bert

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information