Once you have parsed some HTML, you need to process it. Exactly what you do will depend on the nature of your problem. Two common models are extracting information and producing a transformed version of the HTML (for example, to remove banner advertisements).
Whether extracting or transforming, you'll probably want to find the bits of the document you're interested in. They might be all headings, all bold italic regions, or all paragraphs with class="blinking". HTML::Element provides several functions for searching the tree.
In scalar context, these methods return the first node that satisfies the criteria. In list context, all such nodes are returned. The methods can be called on the root of the tree or any node in it.
@headings = $root->find_by_tag_name('h1', 'h2');
@blinkers = $root->find_by_attribute("class", "blinking");
For example, to find all h2 nodes in the tree with class="blinking":
@blinkers = $root->look_down(_tag => 'h2', class => 'blinking');
We'll discuss look_down in greater detail later.
Four methods give access to the basic information in a node:
Four more methods convert a tree or part of a tree into another format, such as HTML or text.
$html = $node->as_HTML("", "", {});
For example, this will emit </li> tags for any li nodes under $node, even though </li> tags are technically optional, according to the HTML specification.
Using $node->as_HTML( ) with no parameters should be fine for most purposes.
These methods are useful once you've found the desired content. Example 9-4 prints all the bold italic text in a document.
#!/usr/bin/perl -w use HTML::TreeBuilder; use strict; my $root = HTML::TreeBuilder->new_from_content(<<"EOHTML"); <b><i>Shatner wins Award!</i></b> Today in <b>Hollywood</b> ... <b><i>End of World Predicted!</i></b> Today in <b>Washington</b> ... EOHTML $root->eof( ); # print contents of <b><i>...</i></b> my @bolds = $root->find_by_tag_name('b'); foreach my $node (@bolds) { my @kids = $node->content_list( ); if (@kids and ref $kids[0] and $kids[0]->tag( ) eq 'i') { print $kids[0]->as_text( ), "\n"; } }
Example 9-4 is fairly straightforward. Having parsed the string into a new tree, we get a list of all the bold nodes. Some of these will be the headlines we want, while others will simply be bolded text. In this case, we can identify headlines by checking that the node that it contains represents <i>...</i>. If it is an italic node, we print its text content.
The only complicated part of Example 9-4 is the test to see whether it's an interesting node. This test has three parts:
For many tasks, you can use the built-in search functions. Sometimes, though, you'd like to visit every node of the tree. You have two choices: you can use the existing traverse( ) function or write your own using either recursion or your own stack.
The act of visiting every node in a tree is called a traversal. Traversals can either be preorder (where you process the current node before processing its children) or postorder (where you process the current node after processing its children). The traverse( ) method lets you do both:
$node->traverse(callbacks [, ignore_text]);
The traverse( ) method calls a callback before processing the children and again afterward. If the callbacks parameter is a single function reference, the same function is called before and after processing the children. If the callbacks parameter is an array reference, the first element is a reference to a function called before the children are processed, and the second element is similarly called after the children are processed, unless this node is a text segment or an element that is prototypically empty, such as br or hr. (This last quirk of the traverse( ) method is one of the reasons that I discourage its use.)
Callbacks get called with three values:
sub callback my ($node, $startflag, $depth, $parent, $my_index) = @_; # ... }
The current node is the first parameter. The next is a Boolean value indicating whether we're being called before (true) or after (false) the children, and the third is a number indicating how deep into the traversal we are. The fourth and fifth parameters are supplied only for text elements: the parent node object and the index of the current node in its parent's list of children.
A callback can return any of the following values:
For example, to extract text from a node but not go into table elements:
my $text; sub text_no_tables { return if ref $_[0] && $_[0]->tag eq 'table'; $text .= $_[0] unless ref $_[0]; # only append text nodex return 1; # all is copacetic } $root->traverse([\&text_no_tables]);
This prevents descent into the contents of tables, while accumulating the text nodes in $text.
It can be hard to think in terms of callbacks, though, and the multiplicity of return values and calling parameters you get with traverse( ) makes for confusing code, as you will likely note when you come across its use in existing programs that use HTML::TreeBuilder.
Instead, it's usually easier and clearer to simply write your own recursive subroutine, like this one:
my $text = ''; sub scan_for_non_table_text { my $element = $_[0]; return if $element->tag eq 'table'; # prune! foreach my $child ($element->content_list) { if (ref $child) { # it's an element scan_for_non_table_text($child); # recurse! } else { # it's a text node! $text .= $child; } } return; } scan_for_non_table_text($root);
Alternatively, implement it using a stack, doing the same work:
my $text = ''; my @stack = ($root); # where to start while (@stack) { my $node = shift @stack; next if ref $node and $node->tag eq 'table'; # skip tables if (ref $node) { unshift @stack, $node->content_list; # add children } else { $text .= $node; # add text } }
The while( ) loop version can be faster than the recursive version, but at the cost of being much less clear to people who are unfamiliar with this technique. If speed is a concern, you should always benchmark the two versions to make sure you really need the speedup and that the while( ) loop version actually delivers. The speed difference is sometimes insignificant. The manual page perldoc HTML::Element::traverse discusses writing more complex traverser routines, in the rare cases where you might find this necessary.