Now that you know the composition of the various types of tokens, let's see how to use HTML::TokeParser to write useful programs. Many problems are quite simple and require only one token at a time. Programs to solve these problems consist of a loop over all the tokens, with an if statement in the body of the loop identifying the interesting parts of the HTML:
use HTML::TokeParser; my $stream = HTML::TokeParser->new($filename) || die "Couldn't read HTML file $filename: $!"; # For a string: HTML::TokeParser->new( \$string_of_html ); while (my $token = $stream->get_token) { if ($token->[0] eq 'T') { # text # process the text in $token->[1] } elsif ($token->[0] eq 'S') { # start-tag my($tagname, $attr) = @$token[1,2]; # consider this start-tag... } elsif ($token->[0] eq 'E') { my $tagname = $token->[1]; # consider this end-tag } # ignoring comments, declarations, and PIs }
Example 7-1 complains about any img tags in a document that are missing alt, height, or width attributes:
while(my $token = $stream->get_token) { if($token->[0] eq 'S' and $token->[1] eq 'img') { my $i = $token->[2]; # attributes of this img tag my @lack = grep !exists $i->{$_}, qw(alt height width); print "Missing for ", $i->{'src'} || "????", ": @lack\n" if @lack; } }
When run on an HTML stream (whether from a file or a string), this outputs:
Missing for liza.jpg: height width Missing for aimee.jpg: alt Missing for laurie.jpg: alt height width
Identifying images has many applications: making HEAD requests to ensure the URLs are valid, or making a GET request to fetch the image and using Image::Size from CPAN to check or insert the height and width attributes.
A similar while loop can use HTML::TokeParser as a simple code filter. You just pass through the $source from each token you don't mean to alter. Here's one that passes through every tag that it sees (by just printing its source as HTML::TokeParser passes it in), except for img start-tags, which get replaced with the content of their alt attributes:
while (my $token = $stream->get_token) { if ($token->[0] eq 'S') { if ($token->[1] eq 'img') { print $token->[2]{'alt'} || ''; } else { print $token->[4]; } } elsif($token->[0] eq 'E' ) { print $token->[2] } elsif($token->[0] eq 'T' ) { print $token->[1] } elsif($token->[0] eq 'C' ) { print $token->[1] } elsif($token->[0] eq 'D' ) { print $token->[1] } elsif($token->[0] eq 'PI') { print $token->[2] } }
So, for example, a document consisting just of this:
<!-- new entry --> <p>Dear Diary, <br>This is me & my balalaika, at BalalaikaCon 1998: <img src="mybc1998.jpg" alt="BC1998! WHOOO!"> Rock on!</p>
is then spat out as this:
<!-- new entry --> <p>Dear Diary, <br>This is me & my balalaika, at BalalaikaCon 1998: BC1998! WHOOO! Rock on!</p>