7.3. Individual Tokens

Now that you know the composition of the various types of tokens, let's see how to use HTML::TokeParser to write useful programs. Many problems are quite simple and require only one token at a time. Programs to solve these problems consist of a loop over all the tokens, with an if statement in the body of the loop identifying the interesting parts of the HTML:

use HTML::TokeParser;
my $stream = HTML::TokeParser->new($filename)
  || die "Couldn't read HTML file $filename: $!";
# For a string: HTML::TokeParser->new( \$string_of_html );

while (my $token = $stream->get_token) {
   if ($token->[0] eq 'T') { # text
     # process the text in $token->[1]

   } elsif ($token->[0] eq 'S') { # start-tag
     my($tagname, $attr) = @$token[1,2];
     # consider this start-tag...

   } elsif ($token->[0] eq 'E') {
     my $tagname = $token->[1];
     # consider this end-tag
   }

   # ignoring comments, declarations, and PIs
}

7.3.1. Checking Image Tags

Example 7-1 complains about any img tags in a document that are missing alt, height, or width attributes:

Example 7-1. Check <img> tags

while(my $token = $stream->get_token) {
  if($token->[0] eq 'S' and $token->[1] eq 'img') {
    my $i = $token->[2]; # attributes of this img tag
    my @lack = grep !exists $i->{$_}, qw(alt height width);
    print "Missing for ", $i->{'src'} || "????", ": @lack\n" if @lack;
  }
}

When run on an HTML stream (whether from a file or a string), this outputs:

Missing for liza.jpg: height width
Missing for aimee.jpg: alt
Missing for laurie.jpg: alt height width

Identifying images has many applications: making HEAD requests to ensure the URLs are valid, or making a GET request to fetch the image and using Image::Size from CPAN to check or insert the height and width attributes.

7.3.2. HTML Filters

A similar while loop can use HTML::TokeParser as a simple code filter. You just pass through the $source from each token you don't mean to alter. Here's one that passes through every tag that it sees (by just printing its source as HTML::TokeParser passes it in), except for img start-tags, which get replaced with the content of their alt attributes:

while (my $token = $stream->get_token) {
  if ($token->[0] eq 'S') {
    if ($token->[1] eq 'img') {
      print $token->[2]{'alt'} || '';
    } else {
      print $token->[4];
    }
  }
  elsif($token->[0] eq 'E' ) { print $token->[2] }
  elsif($token->[0] eq 'T' ) { print $token->[1] }
  elsif($token->[0] eq 'C' ) { print $token->[1] }
  elsif($token->[0] eq 'D' ) { print $token->[1] }
  elsif($token->[0] eq 'PI') { print $token->[2] }
}

So, for example, a document consisting just of this:

<!-- new entry -->
<p>Dear Diary,
<br>This is me &amp; my balalaika, at BalalaikaCon 1998:
<img src="mybc1998.jpg" alt="BC1998!  WHOOO!"> Rock on!</p>

is then spat out as this:

<!-- new entry -->
<p>Dear Diary,
<br>This is me &amp; my balalaika, at BalalaikaCon 1998:
BC1998!  WHOOO! Rock on!</p>