Suppose that the output of our above rewriter is not satisfactory. While its output contains an apparently harmless one-cell one-row table, this is somehow troublesome when the president of the company tries viewing that web page on his cellphone/PDA, which has a typically limited understanding of HTML. Some experimentation shows that any web pages with tables in them will deeply confuse the boss's PDA.
So your task should be changed to this: find the one interesting cell in the table (the td with class="story"), detach it, then replace the table with the td, and delete the table. This is a complex series of actions, but luckily every one of them is directly translatable into an HTML::Element method. The result is Example 10-2.
use strict; use HTML::TreeBuilder; my $root = HTML::TreeBuilder->new; $root->parse_file('rewriters1/in002.html') || die $!; my $good_td = $root->look_down( '_tag', 'td', 'class', 'story', ); die "No good td?!" unless $good_td; # sanity checking my $big_table = $root->look_down( '_tag', 'table' ); die "No big table?!" unless $big_table; # sanity checking $good_td->detach; $big_table->replace_with($good_td); # Yes, there's even a method for replacing one node with another! open(OUT, ">rewriters1/out002b.html") || die "Can't write: $!"; print OUT $root->as_HTML(undef, ' '); # two-space indent in output close(OUT); $root->delete; # done with it, so delete it
The resulting document looks like this:
<html> <head> <title>Shatner and Kunis Sweep the Oscars</title> </head> <body> <td class="story"> <h1>Shatner and Kunis Sweep the Oscars</h1> <p>Stars of <cite>American Psycho II</cite> walked [...] </td> <hr>Copyright 2002, United Lies Syndicate </body> </html>
One problem, though: we have a td outside of a table. Simply change it from a td element into something innocuous, such as a div, and while we're at it, delete that class attribute:
$good_td->tag('div'); $good_td->attr('class', undef);
That makes the output look like this:
<html> <head> <title>Shatner and Kunis Sweep the Oscars</title> </head> <body> <div> <h1>Shatner and Kunis Sweep the Oscars</h1> <p>Stars of <cite>American Psycho II</cite> walked [...] </div> <hr>Copyright 2002, United Lies Syndicate </body> </html>
An alternative is not to detach and save the td in the first place, but to detach and save only its content. That's simple enough:
my @good_content = $good_td->content_list; foreach my $c (@good_content) { $c->detach if ref $c; # text nodes aren't objects, so aren't really "attached" anyhow }
The above task is so common that there's a method for it, called detach_content( ), which detaches and returns the content of the node on which it's called. So we can simply modify our program to read:
my @good_content = $good_td->detach_content; $big_table->replace_with(@good_content); $big_table->delete;
However you chose to express the node-moving operations, the parse tree looks like this:
<html> <head> <title>Shatner and Kunis Sweep the Oscars</title> </head> <body> <h1>Shatner and Kunis Sweep the Oscars</h1> <p>Stars of <cite>American Psycho II</cite> walked [...] <hr>Copyright 2002, United Lies Syndicate </body> </html>
In fact, every HTML::Element method that allows you to attach a node someplace (as replace_with does) will first detach that node if it's already attached elsewhere. So you could actually skip the whole detach_content( ) process step and just write this:
$big_table->replace_with( $good_td->content_list ); $big_table->delete;
It does the same thing and results in the same output.
There are some constraints on what you can expect replace_with( ) to do, but these are just three constraints against fairly odd things that you would probably not try anyway. Namely, the documentation says you can't replace an element with multiple instances of itself; you can't replace an element with one (or more) of its siblings; and you can't replace an element that has no parent, because replacing an element inherently means altering the content list of its parent.
Many methods in the HTML::Element documentation have similar constraints spelled out, although the typical programmer will never find them to be an obstacle in and of themselves. If one of those constraints is violated, it is typically a sign that something is conceptually wrong elsewhere in the program.
For example, if you try $element->replace_with(...) and are surprised by an error message that "the target node has no parent," it is almost definitely because you either already replaced the element with something (leaving it parentless) or deleted it (leaving it parentless, contentless, and attributeless). For example, that error message would result if our program had this:
$big_table->delete; $big_table->replace_with( $good_td->content_list ); # Wrong!
instead of this:
$big_table->replace_with( $good_td->content_list ); $big_table->delete; # Right.