Chapter 7. HTML Processing with Tokens

Contents:

HTML as Tokens
Basic HTML::TokeParser Use
Individual Tokens
Token Sequences
More HTML::TokeParser Methods
Using Extracted Text

Regular expressions are powerful, but they're a painfully low-level way of dealing with HTML. You're forced to worry about spaces and newlines, single and double quotes, HTML comments, and a lot more. The next step up from a regular expression is an HTML tokenizer. In this chapter, we'll use HTML::TokeParser to extract information from HTML files. Using these techniques, you can extract information from any HTML file, and never again have to worry about character-level trivia of HTML markup.

7.1. HTML as Tokens

Your experience with HTML code probably involves seeing raw text such as this:

<p>Dear Diary,
<br>I'm gonna be a superstar, because I'm learning to play
the <a href="http://MyBalalaika.com">balalaika</a> &amp; the <a
href='http://MyBazouki.com'>bazouki</a>!!!

The HTML::TokeParser module divides the HTML into units called tokens, which means units of parsing. The above source code is parsed as this series of tokens:

start-tag token
p with no attributes
text token
Dear Diary,\n
start-tag token
br with no attributes
text token
I'm gonna be a superstar, because I'm learning to play\nthe
start-tag token
a, with attribute href whose value is http://MyBalalaika.com
text token
balalaika
end-tag token
a
text token
&amp; the , which means & the
start-tag token
a, with attribute href equals http://MyBazouki.com
text token
bazouki
end-tag token
a
text token
!!!\n

This representation of things is more abstract, focusing on markup concepts and not individual characters. So whereas the two <a> tags have different types of quotes around their attribute values in the raw HTML, as tokens each has a start-tag of type a, with an href attribute of a particular value. A program that extracts information by working with a stream of tokens doesn't have to worry about the idiosyncrasies of entity encoding, whitespace, quotes, and trying to work out where a tag ends.