The HTML::TokeParser module is a class for accessing HTML as tokens. An HTML::TokeParser object gives you one token at a time, much as a filehandle gives you one line at a time from a file. The HTML can be tokenized from a file or string. The tokenizer decodes entities in attributes, but not entities in text.
Create a token stream object using one of these two constructors:
my $stream = HTML::TokeParser->new($filename) || die "Couldn't read HTML file $filename: $!";
or:
my $stream = HTML::TokeParser->new( \$string_of_html );
Once you have that stream object, you get the next token by calling:
my $token = $stream->get_token( );
The $token variable then holds an array reference, or undef if there's nothing left in the stream's file or string. This code processes every token in a document:
my $stream = HTML::TokeParser->new($filename) || die "Couldn't read HTML file $filename: $!"; while(my $token = $stream->get_token) { # ... consider $token ... }
The $token can have one of six kinds of values, distinguished first by the value of $token->[0], as shown in Table 7-1.
Token |
Values |
---|---|
Start-tag |
["S", $tag, $attribute_hashref, $attribute_order_arrayref, $source] |
End-tag |
["E", $tag, $source] |
Text |
["T", $text, $should_not_decode] |
Comment |
["C", $source] |
Declaration |
["D", $source] |
Processing instruction |
["PI", $content, $source] |
If $token->[0] is "S", the token represents a start-tag:
["S", $tag, $attribute_hash, $attribute_order_arrayref, $source]
The components of this token are:
The first three values are the most interesting ones, for most purposes.
For example, parsing this HTML:
<IMG SRC="kirk.jpg" alt="Shatner in rôle of Kirk" WIDTH=352 height=522>
gives this token:
[ 'S', 'img', { 'alt' => 'Shatner in rôle of Kirk', 'height' => '522', 'src' => 'kirk.jpg', 'width' => '352' }, [ 'src', 'alt', 'width', 'height' ], '<IMG SRC="kirk.jpg" alt="Shatner in rôle of Kirk" WIDTH=352 height=522>' ]
Notice that the tag and attribute names have been lowercased, and the ô entity decoded within the alt attribute.
When $token->[0] is "E", the token represents an end-tag:
[ "E", $tag, $source ]
The components of this tag are:
Parsing this HTML:
</A>
gives this token:
[ 'E', 'a', '</A>' ]
When $token->[0] is "T", the token represents text:
["T", $text, $should_not_decode]
The elements of this array are:
Tokenizing this HTML:
& the
gives this token:
[ 'T', ' & the', '' ]
The empty string is a false value, indicating that there's nothing stopping us from decoding $text with decode_entities( ) from HTML::Entities:
decode_entities($token->[1]) if $token->[2];
Text inside <script>, <style>, <xmp>, <listing>, and <plaintext> tags is not supposed to be entity-decoded. It is for such text that $should_not_decode is true.
When $token->[0] is "C", you have a comment token:
["C", $source]
The $source component of the token holds the original HTML of the comment. Most programs that process HTML simply ignore comments.
Parsing this HTML
<!-- Shatner's best known rôle -->
gives us this $token value:
[ 'C', #0: we're a comment '<!-- Shatner's best known rôle -->' #1: source ]
When $token->[0] is "D", you have a declaration token:
["D", $source]
The $source element of the array is the HTML of the declaration. Declarations rarely occur in HTML, and when they do, they are rarely of any interest. Almost all programs that process HTML ignore declarations.
This HTML:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
gives this token:
[ 'D', '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">' ]
When $token->[0] is "PI", the token represents a processing instruction:
[ "PI", $instruction, $source ]
The components are:
A processing instruction is an SGML construct rarely used in HTML. Most programs extracting information from HTML ignore processing instructions. If you do handle processing instructions, be warned that in SGML (and thus HTML) a processing instruction ends with a greater-than (>), but in XML (and thus XHTML), a processing instruction ends with a question mark and a greater-than sign (?>).
Tokenizing:
<?subliminal message>
gives:
[ 'PI', 'subliminal message', '<?subliminal message>' ]