Jump to content

Parse out all image tags and attributes in Perl


DarkWater

Recommended Posts

Just decided to experiment with it really quickly (kinda bored, to be honest).  Anyone have any advice for parts I could rewrite?

 

#!/usr/bin/perl

$_ = <<HTML; #set up test HTML, would be from file normally
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my @imgs;

while (/<img (.+?) \/>/g) {
push @imgs, {$1 =~ /(\w+)="([^"]+)"/g}; #@imgs becomes an array of anonymous hashes
}

for $loc (0 .. $#imgs) { #loop through indices to for easy position tracking (might change in real script to just a for loop)
print "Image at location $loc:\n";
$hash = $imgs[$loc]; #get the hash out!
for $key (keys %$hash) { #and deference it out of the scalar.  made THAT mistake a few times
	print "$key = $hash->{$key}\n"; #print out the pairs
}
}

 

Anyone have any suggestions?

Just changed it a tiny bit to use strict and warning pragmas, so I also needed to change that my declaration:

 

#!/usr/bin/perl
use strict;
use warnings;

$_ = <<HTML; #set up test HTML, would be from file normally
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my (@imgs, $loc, $hash, $key);

while (/<img (.+?) \/>/g) {
push @imgs, {$1 =~ /(\w+)="([^"]+)"/g}; #@imgs becomes an array of anonymous hashes
}

for $loc (0 .. $#imgs) { #loop through indices to for easy position tracking (might change in real script to just a for loop)
print "Image $loc:\n";
$hash = $imgs[$loc]; #get the hash out!
for $key (keys %$hash) { #and deference it out of the scalar.  made THAT mistake a few times
	print "$key = $hash->{$key}\n"; #print out the pairs
}
}

 

No big difference.

1. You're only going to catch double-quoted attributes.

2. You haven't accounted for rogue white space.

3. There are various modules on CPAN for things like this; my take is below:

 

use strict;
use warnings;
use XML::Twig;
use Data::Dumper;

my $data = <<HTML;
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my @atts;
my $twig = XML::Twig->new(
twig_roots => {
	'img' => sub {
		push @atts, $_->atts();
	}
}
);
$twig->parse_html($data);
print Data::Dumper->Dump([\@atts]);

1. You're only going to catch double-quoted attributes.

2. You haven't accounted for rogue white space.

3. There are various modules on CPAN for things like this; my take is below:

 

1. I was only planning on using this on valid XHTML anyway, so the double quoted attributes aren't a problem.

2. Do you mean like:

$1 =~ /(\w+)\s*=\s*"([^"]+)"/g

(notice the addition of \s*)

3. I know, I love CPAN, I was just wanting to try this on my own just to experiment with.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.