Jump to content

global errors all along the way in a little parser - what should i do?


Recommended Posts

hello dear php-experts,
 


I'm pretty new to Programming and OO programming especially.
Nonetheless, I'm trying to get done a very simple Spider for web crawling.

Here's what i do not get to work

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('http://www.computersecrets.eu.pn/');

# I'm not positive, but this should only need to be set up once, not
# on every pass through the loop
my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);

#Request and receive contents of a web page;
# Need to use a while loop instead of a for loop because @urls will
# be changing as we go
while (@urls) {
  my $url = shift @urls;
  my $request = HTTP::Request->new(GET => $URL);
  my $response = $browser->request($request);

  #Tell me if there is an error;
  if ($response->is_error()) {printf "%s\n", $response->status_line;}
  my $contents = $response->content();

  #Extract the links from the HTML;
  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  @links = $page_parser->links;

  #Print the link to a links.txt file;
  foreach $link (@links) {
    push @urls, $$link[2];  # Add link to list of urls before printing it
    print "$$link[2]\n";
  }

  # This next line is MANDATORY - spidering a site as fast as you can
  # will probably bog it down, may crash it, may get your IP address
  # blacklisted (I've written monitors in the past which do just that),
  # and is absolutely certain to piss the admins off.
  sleep 60;
}

i get the  following results


Global symbol "$URL" requires explicit package name at wc1.pl line 32.
Global symbol "@links" requires explicit package name at wc1.pl line 42.
Global symbol "$link" requires explicit package name at wc1.pl line 45.
Global symbol "@links" requires explicit package name at wc1.pl line 45.
Global symbol "$link" requires explicit package name at wc1.pl line 46.
Global symbol "$link" requires explicit package name at wc1.pl line 47.
Execution of wc1.pl aborted due to compilation errors.
martin@linux-jnmx:~/perl> ^C
martin@linux-jnmx:~/perl>

fixed this with the following

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
my @urls = ('https://wordpress.org/support/plugin/participants-database/');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;

  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};
    
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  }
  sleep 60;
}

hello

 

i want to modify the script a bit - tailoring and tinkering is the  way to learn. I want to fetch urls with a certain content in the URL-string
 

"http://www.foo.com/bar"


in other words: what is aimed, i need to fetch all the urls that contains the term " /bar"
- then i want to extract the "bar"  so that it remains the url: http://www.foo.com
-


is this doable?

love to hear from you



 

hello dear all

i tried the following thing out...
 

my $url =~s|/bar$||;

but i left out the "my",  The "my" causes a new $url to be created.  
What we want is to modify the old $url.


what is aimed:  i want to do a search to find out all urls that contains the following term: /participants-database/

but unfortunatley this does not work :
 

#!C:\Perl\bin\perl

use strict; # You always want to include both strict and warnings
use warnings;


use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

# There was no reason for this to be in a BEGIN block (and there
# are a few good reasons for it not to be)
open my $file1,"+>>", ("links.txt");
select($file1);  

#The Url I want it to start at;
# Note that I've made this an array, @urls, rather than a scalar, $URL
#my @urls = (' $url =~ s$||;');
my $urls =~ ('s|/participants-database$||');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();
$browser->timeout(5);

while (@urls) {
  my $url = shift @urls;
  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};
    
  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  $page_parser->parse($contents)->eof;
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  }
  sleep 60;
}

hello all

Ive messed up the code a bit:

i have made @urls the result of a grep on @urls, which was not yet defined at all.


i need to define the following:
 

 my @in_urls = ('bob', 'joe' );


btw: also and furthermore -  I'm not quite sure about the while loop.  do we need the while loop?

foreach my $url ( @urls ){
   # do stuff with the URL
   }


   
 i need to get it to work

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.