Modifying HTML with Trees | Detaching and Reattaching |
Instead of altering nodes or extracting data from them, it's common to want to just delete them. For example, consider that we have the task of taking normally complex and image-rich web pages and making unadorned text-only versions of them, such as one would print out or paste into email. Each document in question has one big table with three rows, like this:
<html> <head><title>Shatner and Kunis Sweep the Oscars</title></head> <body> <table> <tr class="top_button_bar"> ...appalling amounts of ad banners and button bars... </tr> <tr class="main"> <td class="left_geegaws"> ...yet more ads and button bars... </td> <td class="story"> <h1>Shatner and Kunis Sweep the Oscars</h1> <img src="shatner_kunis_awards.jpg" align=left> <p>Stars of <cite>American Psycho II</cite> walked away with four Academy Awards... </td> <td class="right_geegaws"> ...even more ads... </td> </tr> <tr class="bottom_button_bar"> ...ads, always ads... </tr> </table> <hr>Copyright 2002, United Lies Syndicate </html>
The simplified version of such a page should omit all images and elements of the class top_button_bar, bottom_button_bar, left_geegaws, and right_geegaws. This can be implemented with a simple call to look_down:
use HTML::TreeBuilder; my $root = HTML::TreeBuilder->new; $root->parse_file('rewriters1/in002.html') || die $!; foreach my $d ($root->look_down( sub { return 1 if $_[0]->tag eq 'img'; # we're looking for images # no class means ignore it my $class = $_[0]->attr('class') || return 0; return 1 if $class eq 'top_button_bar' or $class eq 'right_geegaws' or $class eq 'bottom_button_bar' or $class eq 'left_geegaws'; return 0; } )) { $d->delete; } open(OUT, ">rewriters1/out002.html") || die "Can't write: $!"; print OUT $root->as_HTML(undef, ' '); # two-space indent in output close(OUT); $root->delete; # done with it, so delete it
The call to $d->delete detaches the node in $d from its parent, then destroys it along with all its descendant nodes. The resulting file looks like this:
<html> <head> <title>Shatner and Kunis Sweep the Oscars</title> </head> <body> <table> <tr class="main"> <td class="story"> <h1>Shatner and Kunis Sweep the Oscars</h1> <p>Stars of <cite>American Psycho II</cite> walked [...] </td> </tr> </table> <hr>Copyright 2002, United Lies Syndicate </body> </html>
One pragmatic point here: the list returned by the look_down( ) call will contain the two tr and td elements, any images they contain, and also images elsewhere in the document. When we delete one of those tr or td nodes, we are also implicitly deleting every one of its descendant nodes, including some img elements that we are about to hit in a subsequent iteration through look_down( )'s return list.
This isn't a problem in this case, because deleting an already deleted node is a harmless no-operation. The larger point here is that when look_down( ) finds a matching node (as with a left_geegaws td node, in our example), that doesn't stop it from looking below that node for more matches. If you need that kind of behavior, you'll need to implement it in your own traverser, as discussed in Chapter 9, "HTML Processing with Trees".
Continue to section: Detaching and Reattaching