Parse out all image tags and attributes in Perl

DarkWater · September 30, 2008

Just decided to experiment with it really quickly (kinda bored, to be honest). Anyone have any advice for parts I could rewrite?

#!/usr/bin/perl

$_ = <<HTML; #set up test HTML, would be from file normally
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my @imgs;

while (/<img (.+?) \/>/g) {
push @imgs, {$1 =~ /(\w+)="([^"]+)"/g}; #@imgs becomes an array of anonymous hashes
}

for $loc (0 .. $#imgs) { #loop through indices to for easy position tracking (might change in real script to just a for loop)
print "Image at location $loc:\n";
$hash = $imgs[$loc]; #get the hash out!
for $key (keys %$hash) { #and deference it out of the scalar.  made THAT mistake a few times
	print "$key = $hash->{$key}\n"; #print out the pairs
}
}

Anyone have any suggestions?

DarkWater · October 1, 2008

Just changed it a tiny bit to use strict and warning pragmas, so I also needed to change that my declaration:

#!/usr/bin/perl
use strict;
use warnings;

$_ = <<HTML; #set up test HTML, would be from file normally
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my (@imgs, $loc, $hash, $key);

while (/<img (.+?) \/>/g) {
push @imgs, {$1 =~ /(\w+)="([^"]+)"/g}; #@imgs becomes an array of anonymous hashes
}

for $loc (0 .. $#imgs) { #loop through indices to for easy position tracking (might change in real script to just a for loop)
print "Image $loc:\n";
$hash = $imgs[$loc]; #get the hash out!
for $key (keys %$hash) { #and deference it out of the scalar.  made THAT mistake a few times
	print "$key = $hash->{$key}\n"; #print out the pairs
}
}

No big difference.

effigy · October 1, 2008

1. You're only going to catch double-quoted attributes.

2. You haven't accounted for rogue white space.

3. There are various modules on CPAN for things like this; my take is below:

use strict;
use warnings;
use XML::Twig;
use Data::Dumper;

my $data = <<HTML;
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my @atts;
my $twig = XML::Twig->new(
twig_roots => {
	'img' => sub {
		push @atts, $_->atts();
	}
}
);
$twig->parse_html($data);
print Data::Dumper->Dump([\@atts]);

DarkWater · October 1, 2008

1. You're only going to catch double-quoted attributes.

2. You haven't accounted for rogue white space.

3. There are various modules on CPAN for things like this; my take is below:

1. I was only planning on using this on valid XHTML anyway, so the double quoted attributes aren't a problem.

2. Do you mean like:

$1 =~ /(\w+)\s*=\s*"([^"]+)"/g

(notice the addition of \s*)

3. I know, I love CPAN, I was just wanting to try this on my own just to experiment with.

effigy · October 1, 2008

2. Do you mean like:

$1 =~ /(\w+)\s*=\s*"([^"]+)"/g

(notice the addition of \s*)

Yes.

DarkWater · October 1, 2008

Alright, thanks.

Sign In

Parse out all image tags and attributes in Perl

Recommended Posts

DarkWater

Link to comment

Share on other sites

DarkWater

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

DarkWater

Link to comment

Share on other sites

effigy

Link to comment

Share on other sites

DarkWater

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information