Attaching in Another Tree (Perl & LWP)

So far we've detached elements from one part of a tree and attached them elsewhere in the same tree. But there's nothing stopping you from attaching them in other trees.

For example, consider a case like the above example, where we extract the text in the <td class="story"> ... </td> element, but this time, instead of attaching it elsewhere in the same document tree, we're attaching it at a certain point in a different tree that we're using as a template. The template document looks like this:

You'll note that the web designers have helpfully inserted comments to denote where the inserted content should start and end. But when you have HTML::TreeBuilder parse the document with default parse options and dump the tree, you don't see any sign of the comments:

10.4.1. Retaining Comments

However, storing comments is controlled by an HTML::TreeBuilder parse option, store_comments( ), which is off by default. If we parse the file like so:

use strict;
use HTML::TreeBuilder;
my $template_root = HTML::TreeBuilder->new;
$template_root->store_comments(1);
$template_root->parse_file('rewriters1/template1.html')
 || die "Can't read template file: $!";
 
$template_root->dump;

the comments now show up in the parse tree:

<html> @0
  <head> @0.0
    <title> @0.0.0
      "Put the title here"
  <body> @0.1
    <!-- printable version --> @0.1.0
    <blockquote> @0.1.1
      <font size="-1"> @0.1.1.0
        <!-- start --> @0.1.1.0.0
        " ...put the content here... "
        <!-- end --> @0.1.1.0.2
        <hr> @0.1.1.0.3
        "Copyright 2002. Printed from the United Lies Syndicate web site. "

10.4.2. Accessing Comments

What's left is to figure out how to take out what's between the  and  comments, to insert whatever content needs to be put in there, then to write out the document. First we need to find the comments, and to do that we need to figure out how comments are stored in the tree, because so far we've only dealt with elements and bits of text.

Mercifully, what we know about element objects in trees still applies, because that's how comments are stored: as element objects. But because comments aren't actual elements, the HTML::Element documentation refers to them as pseudoelements, and they are given a tag name that no real element could have: ~comment. The actual content of the comment ( start ) is stored as the value of the text attribute. In other words,  is stored as if it were <~comment text=' start '></~comment>. So finding comments is straightforward:

foreach my $c ($template_root->find_by_tag_name('~comment')) {
  print "A comment has text [", $c->attr('text'), "].\n";
}

That prints this:

A comment has text [ printable version ]
A comment has text [ start ]
A comment has text [ end ]

Finding the start and end comments is a matter of filtering those comments:

use strict;
use HTML::TreeBuilder;
my $template_root = HTML::TreeBuilder->new;
$template_root->store_comments(1);
$template_root->parse_file('rewriters1/template1.html')
 || die "Can't read template file: $!";
 
my($start_comment, $end_comment);
foreach my $c ($template_root->find_by_tag_name('~comment')) {
  if($c->attr('text') =~ m/^\s*start\s*$/) {
    $start_comment = $c;
  } elsif($c->attr('text') =~ m/^\s*end\s*$/) {
    $end_comment = $c;
  }
}
die "Couldn't find template's 'start' comment!" unless $start_comment;
die "Couldn't find template's 'end' comment!"   unless $end_comment;
 
die "start and end comments don't have the same parent?!"
  unless $start_comment->parent eq $end_comment->parent;
# Make sure things are sane.

10.4.3. Attaching Content

Once that's done, we need some way of taking some new content (which we'll get elsewhere) and putting that in place of what's between the "start" comment and the "end" comment. There are many ways of doing this, but this is the most straightforward in terms of the methods we've already seen in this chapter:

sub put_into_template {
  my @to_insert = @_;
  my $parent = $start_comment->parent;
  my @old_content = $parent->detach_content;
  my @new_content;

  # Copy everything up to the $start_comment into @new_content,
  # and then everything starting at $end_comment, and ignore
  # everything inbetween and instead drop in things from @to_insert.

  my $am_saving = 1;
  foreach my $node (@old_content) {
    if($am_saving) {
      push @new_content, $node;
      if($node eq $start_comment) {
        push @new_content, @to_insert;
        $am_saving = 0;   # and start ignoring nodes.
      }
    } else {  # I'm snipping out things to ignore
      if($node eq $end_comment) {
        push @new_content, $node;
        $am_saving = 1;
      } else {  # It's an element to ignore, and to destroy.
        $node->delete if ref $node;
      }
    }
  }
  $parent->push_content(@new_content);  # attach new children
  return;
}

This seems a bit long, but it's mostly the work of just tracking whether we're in the mode of saving things from the old content list or ignoring (and in fact deleting) things from the old content list. With that subroutine in our program, we can test whether it works:

put_into_template("Testing 1 2 3.");
$template_root->dump;
put_into_template("Is this mic on?");
$template_root->dump;

That prints this:

<html> @0
  <head> @0.0
    <title> @0.0.0
      "Put the title here"
  <body> @0.1
    <!-- printable version --> @0.1.0
    <blockquote> @0.1.1
      <font size="-1"> @0.1.1.0
        <!-- start --> @0.1.1.0.0
        "Testing 1 2 3."
        <!-- end --> @0.1.1.0.2
        <hr> @0.1.1.0.3
        "Copyright 2002. Printed from the United Lies Syndicate web site. "
<html> @0
  <head> @0.0
    <title> @0.0.0
      "Put the title here"
  <body> @0.1
    <!-- printable version --> @0.1.0
    <blockquote> @0.1.1
      <font size="-1"> @0.1.1.0
        <!-- start --> @0.1.1.0.0
        "Is this mic on?"
        <!-- end --> @0.1.1.0.2
        <hr> @0.1.1.0.3
        "Copyright 2002. Printed from the United Lies Syndicate web site. "

This shows that not only did we manage to replace the template's original ...put the content here... text node with a Testing 1 2 3. node, but also another call to replace it with Is this mic on? worked too. From there, it's just a matter of adapting the code from the last section, which found the content in a file. Except this time we use our new put_into_template( ) function on that content:

# Read an individual file for its content now.
my $content_file_root = HTML::TreeBuilder->new;
my $input_filespec = 'rewriters1/in002.html';   # or whatever input file
$content_file_root->parse_file($input_filespec)
 || die "Can't read input file $input_filespec: $!";
 
# Find its real content:
my $good_td = $content_file_root->look_down( '_tag', 'td',  'class', 'story', );
die "No good td?!" unless $good_td;
 
put_into_template( $good_td->content_list );
$content_file_root->delete;  # We don't need it anymore.
 
open(OUT, ">rewriters1/out003a.html") || die "Can't write: $!";
  # or whatever output filespec
print OUT $template_root->as_HTML(undef, '  '); # two-space indent in output
close(OUT);

When this runs, we see can see in the output file that the content was successfully inserted into the template and written out:

<html>
  <head>
    <title>Put the title here</title>
  </head>
  <body>
    <!-- printable version -->
    <blockquote><font size="-1">
        <!-- start -->
        <h1>Shatner and Kunis Sweep the Oscars</h1>
        <p>Stars of <cite>American Psycho II</cite> walked away with four Academy
           Awards...
        <!-- end -->
        <hr>Copyright 2002. Printed from the United Lies Syndicate web site.
        </font></blockquote>
  </body>
</html>

All is well, except the title is no good. It still says "Put the title here". All that's left is to replace the content of the template's title with the content of the current file's title. We just find the title element in each, and swap content:

my $template_title = $template_root->find_by_tag_name('title')
  || die "No title in template?!";
$template_title->delete_content;
my $content_title = $content_file_root->find_by_tag_name('title');
if($content_title) {
  $template_title->push_content( $content_title->content_list );
    # This method, like all methods, automatically detaches
    #  elements from where they are currently, as necessary.
} else {
  $template_title->push_content( 'No title' );
}

We put that code in our program anywhere between when we read the file into $content_file_root and when we destroy it; it works happily and puts the right content into the output file's title element:

<html>
  <head>
    <title>Shatner and Kunis Sweep the Oscars</title>
  </head>
[...]

Because this works for a single given input file, and because we tested earlier to make sure our put_into_template( ) routine works for all subsequent invocations as well as for the first, that means we have the main building block for a system that does template extraction and insertion for any number of files. All we have to do is turn that into a function, and call it as many times as needed. For example:

# ...read in $template_root...
# ...get names of files to change into @input_files...
foreach my $input_filespec (@input_files) {
  template_redo($input_filespec, "../printables/$input_filespec");
}

sub template_redo {
  my($input_filespec, $output_filespec) = @_;
  my $content_file_root = HTML::TreeBuilder->new;
  $content_file_root->parse_file($input_filespec)
   || die "Can't read input file $input_filespec: $!";

  #  ...then extract content and put into the template tree, as above...

  $content_file_root->delete;  # We don't need it anymore.
  open(OUT, ">$output_filespec") || die "Can't write $output_file: $!";
  print OUT $template_root->as_HTML(undef, '  ');
  close(OUT);
}

10.4. Attaching in Another Tree

10.4.1. Retaining Comments

10.4.2. Accessing Comments

10.4.3. Attaching Content