Jump to content

Parse out all image tags and attributes in Perl


DarkWater

Recommended Posts

Just decided to experiment with it really quickly (kinda bored, to be honest).  Anyone have any advice for parts I could rewrite?

 

#!/usr/bin/perl

$_ = <<HTML; #set up test HTML, would be from file normally
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my @imgs;

while (/<img (.+?) \/>/g) {
push @imgs, {$1 =~ /(\w+)="([^"]+)"/g}; #@imgs becomes an array of anonymous hashes
}

for $loc (0 .. $#imgs) { #loop through indices to for easy position tracking (might change in real script to just a for loop)
print "Image at location $loc:\n";
$hash = $imgs[$loc]; #get the hash out!
for $key (keys %$hash) { #and deference it out of the scalar.  made THAT mistake a few times
	print "$key = $hash->{$key}\n"; #print out the pairs
}
}

 

Anyone have any suggestions?

Link to comment
Share on other sites

Just changed it a tiny bit to use strict and warning pragmas, so I also needed to change that my declaration:

 

#!/usr/bin/perl
use strict;
use warnings;

$_ = <<HTML; #set up test HTML, would be from file normally
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my (@imgs, $loc, $hash, $key);

while (/<img (.+?) \/>/g) {
push @imgs, {$1 =~ /(\w+)="([^"]+)"/g}; #@imgs becomes an array of anonymous hashes
}

for $loc (0 .. $#imgs) { #loop through indices to for easy position tracking (might change in real script to just a for loop)
print "Image $loc:\n";
$hash = $imgs[$loc]; #get the hash out!
for $key (keys %$hash) { #and deference it out of the scalar.  made THAT mistake a few times
	print "$key = $hash->{$key}\n"; #print out the pairs
}
}

 

No big difference.

Link to comment
Share on other sites

1. You're only going to catch double-quoted attributes.

2. You haven't accounted for rogue white space.

3. There are various modules on CPAN for things like this; my take is below:

 

use strict;
use warnings;
use XML::Twig;
use Data::Dumper;

my $data = <<HTML;
<img src="foo" height="1" />
<img src="testing!" height="50" width="200" />
<img height="20" src="lol" />
HTML

my @atts;
my $twig = XML::Twig->new(
twig_roots => {
	'img' => sub {
		push @atts, $_->atts();
	}
}
);
$twig->parse_html($data);
print Data::Dumper->Dump([\@atts]);

Link to comment
Share on other sites

1. You're only going to catch double-quoted attributes.

2. You haven't accounted for rogue white space.

3. There are various modules on CPAN for things like this; my take is below:

 

1. I was only planning on using this on valid XHTML anyway, so the double quoted attributes aren't a problem.

2. Do you mean like:

$1 =~ /(\w+)\s*=\s*"([^"]+)"/g

(notice the addition of \s*)

3. I know, I love CPAN, I was just wanting to try this on my own just to experiment with.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.