There are five steps to an HTML::TreeBuilder program:
Example 9-2 is a simple HTML::TreeBuilder program.
#!/usr/bin/perl -w use strict; use HTML::TreeBuilder 3; # make sure our version isn't ancient my $root = HTML::TreeBuilder->new; $root->parse( # parse a string... q{ <ul> <li>Ice cream.</li> <li>Whipped cream. <li>Hot apple pie <br>(mmm pie)</li> </ul> }); $root->eof( ); # done parsing for this tree $root->dump; # print( ) a representation of the tree $root->delete; # erase this tree because we're done with it
Four of the five steps are shown here. The HTML::TreeBuilder class's new( ) constructor creates a new object. We don't set parse options, preferring instead to use the defaults. The parse( ) method parses HTML from a string. It's designed to let you supply HTML in chunks, so you use the eof( ) method to tell the parser when there's no more HTML. The dump( ) method is our processing here, printing a string form of the tree (the output is given in Example 9-3). And finally we delete( ) the tree to free the memory it used.
<html> @0 (IMPLICIT) <head> @0.0 (IMPLICIT) <body> @0.1 (IMPLICIT) <ul> @0.1.0 <li> @0.1.0.0 "Ice cream." <li> @0.1.0.1 "Whipped cream. " <li> @0.1.0.2 "Hot apple pie " <br> @0.1.0.2.1 "(mmm pie)"
Each line in the dump represents either an element or text. Each element is identified by a dotted sequence of numbers (e.g., 0.1.0.2). This sequence identifies the position of the element in the tree (2nd child of the 0th child of the 1st child of the 0th child of the root of the tree). The dump also identifies some nodes as (IMPLICIT), meaning they weren't present in the HTML fragment but have been inferred to make a valid document parse tree.
To create a new empty tree, use the new( ) method:
$root = HTML::TreeBuilder->new( );
To create a new tree and parse the HTML in one go, pass one or more strings to the new_from_content( ) method:
$root = HTML::TreeBuilder->new_from_content([string, ...]);
To create a new HTML::TreeBuilder object and parse HTML from a file, pass the filename or a filehandle to the new_from_file( ) method:
$root = HTML::TreeBuilder->new_from_file(filename); $root = HTML::TreeBuilder->new_from_file(filehandle);
If you use new_from_file( ) or new_from_content( ), the parse is carried out with the default parsing options. To parse with any nondefault options, you must use the new( ) constructor and call parse_file( ) or parse( ).
Set options for the parse by calling methods on the HTML::TreeBuilder object. These methods return the old value for the option and set the value if passed a parameter. For example:
$comments = $root->strict_comment( ); print "Strict comment processing is "; print $comments ? "on\n" : "off\n"; $root->strict_comment(0); # disable
Some options affect the way the HTML standard is ignored or obeyed, while others affect the internal behavior of the parser. The full list of parser options follows.
There are two ways of parsing HTML: from a file or from strings.
Pass the parse_file( ) method a filename or filehandle to parse the HTML in that file:
$success = $root->parse_file(filename); $success = $root->parse_file(filehandle);
For example, to parse HTML from STDIN:
$root->parse_file(*STDIN) or die "Can't parse STDIN";
The parse_file( ) method returns the HTML::TreeBuilder object if successful or undef if an error occurred.
The parse( ) method takes a chunk of HTML and parses it. Call parse( ) on each chunk, then call the eof( ) method when there's no more HTML to come.
$success = $root->parse(chunk); $success = $root->eof( );
This method is designed for situations where you are acquiring your HTML one chunk at a time. It's also useful when you're extracting HTML from a larger file and can't simply parse the entire file with parse_file( ). In many cases, you could use new_from_content( ), but recall that new_from_content( ) doesn't give you an opportunity to set nondefault parsing options.
The delete( ) method frees the tree and its elements, giving the memory it used back to Perl:
$root->delete( );
Use this method in persistent environments such as mod_perl or when your program will parse a lot of HTML files. It's not enough to simply have $root be a private variable that goes out of scope, or to assign a new value to $root. Perl's current memory-management system fails on the kinds of data structures that HTML::Element uses.