Example: BBC News (Perl & LWP)

9.4. Example: BBC News

In Chapter 7, "HTML Processing with Tokens", we considered the task of extracting the headline link URLs from the BBC News main page, and we implemented it in terms of HTML::TokeParser. Here, we'll consider the same problem from the perspective of HTML::TreeBuilder.

To review the problem: when you look at the source of http://news.bbc.co.uk, you discover that each headline link is wrapped in one of two kinds of code. There are a lot of headlines expressed with code like this:

<B CLASS="h3"><A href="/hi/english/business/newsid_1576000/1576290.stm">Bank
of England mulls rate cut</A></B><BR>
  
<B CLASS="h3"><A href="/hi/english/uk_politics/newsid_1576000/1576541.stm">Euro
battle revived by Blair speech</A></B><BR>

and some headlines expressed with code like this:

<A href="/hi/english/business/newsid_1576000/1576636.stm">
  <B class="h2"> Swissair shares wiped out</B><BR>
</A>

<A href="/hi/english/world/middle_east/newsid_1576000/1576113.stm">
  <B class="h1">Mid-East blow to US anti-terror drive</B><BR>
</A>

(Note that in this second case, the B element's class value can be h1 or h2.)

In both cases, we can find what we want by first looking for B elements. We then look for the href attribute either on the A element that's a child of this B element, or on the A element that's this B element's parent. Whether we look for a parent A node or a child A node depends on the class attribute of the B element. To make sure we're on the right track, we can code up something to formalize our idea of what sorts of nodes we want, and call the dump method on each of them.

use strict;
use HTML::TreeBuilder 3;

my $tree = HTML::TreeBuilder->new( );
$tree->parse_file('bbc.html') || die $!;  # the saved source from BBC News
scan_bbc_tree( $tree, 'http://news.bbc.co.uk/' );
$tree->delete( );

sub scan_bbc_tree {
  my($root, $docbase) = @_;
    # $docbase will be needed if we want to absolutize the URL
  foreach my $b ($root->find_by_tag_name('b')) {
    my $class = $b->attr('class') || next;
    if($class eq 'h3') {
      # expect one 'a' element as a child
      print "Found a b-h3.  Dumping it:\n";
      $b->dump;
    } elsif($class eq 'h1' or $class eq 'h2') {
      # expect the parent to be an 'a'
      print "Found a b-h[1-2].  Dumping its parent:\n";
      $b->parent->dump;
    }
  }
  return;  
}

When run on the full file, that program produces this output:

Found a b-h3.  Dumping it:
<b class="h3"> @0.1.2.2.0.0.3.2.0.3.0.0.0.0.6
  <a href="/sport/hi/english/in_depth/2001/england_in_zimbabwe/newsid_1574000/
1574824.stm"> @0.1.2.2.0.0.3.2.0.3.0.0.0.0.6.0
    "Zimbabwe suffer treble blow"

Found a b-h3.  Dumping it:
<b class="h3"> @0.1.2.2.0.0.3.2.0.6.1.0
  <a href="/hi/english/business/newsid_1576000/1576546.stm"> @0.1.2.2.0.0.3.2.0.6.1.0.0
    "UK housing market stalls"

Found a b-h[1-2].  Dumping its parent:
<a href="/hi/english/uk_politics/newsid_1576000/1576051.stm"> @0.1.2.2.0.0.1.2.0.14.2
  " "
  <b class="h1"> @0.1.2.2.0.0.1.2.0.14.2.1
    "UK hate crime laws to be tightened"
  <br> @0.1.2.2.0.0.1.2.0.14.2.2

Found a b-h[1-2].  Dumping its parent:
<a href="/hi/english/uk/newsid_1576000/1576379.stm"> @0.1.2.2.0.0.1.2.0.18.2
  " "
  <b class="h2"> @0.1.2.2.0.0.1.2.0.18.2.1
    "Leeds footballers' trial begins"
  <br> @0.1.2.2.0.0.1.2.0.18.2.2

[...and others just like those...]

This output shows all the sorts of nodes from which we'll want to extract data and contains no other kinds of nodes. With the situation we see in the first two cases, the b element with the class="h3" attribute indeed has only one child node, which is an a element whose href we want, and in the latter two cases, we need only look to the href attribute on the parent of the b element (which has a class="h1" or class="h2" attribute). So because we're identifying things correctly, we can go ahead and change our code so that instead of dumping nodes, it will actually pull the hrefs out, absolutize them, and print them:

use URI;
sub scan_bbc_tree {
  my($root, $docbase) = @_;
  foreach my $b ($root->find_by_tag_name('b')) {
    my $class = $b->attr('class') || next;
    if($class eq 'h3') {
      # Expect one 'a' element as a child
      my @children = $b->content_list;
      if(@children == 1 and ref $children[0] and $children[0]->tag eq 'a') {
        print URI->new_abs(
          $children[0]->attr('href') || next,
          $docbase
        ), "\n";
      }
    } elsif($class eq 'h1' or $class eq 'h2') {
      # Expect an 'a' element as a parent
      my $parent = $b->parent;
      if($parent and $parent->tag eq 'a') {
        print URI->new_abs(
          $parent->attr('href') || next,
          $docbase
        ), "\n";
      }
    }
  }
  return;  
}

When run, this correctly reports all the URLs in the document:

http://news.bbc.co.uk/sport/hi/english/in_depth/2001/england_in_zimbabwe/newsid_
1574000/1574824.stm
http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576546.stm
http://news.bbc.co.uk/hi/english/uk_politics/newsid_1576000/1576051.stm
http://news.bbc.co.uk/hi/english/uk/newsid_1576000/1576379.stm
[...etc...]

If we want to make our program also capture the text inside the link, that's straightforward too; we need only change each occurrence of:

print URI->new_abs(...

to:

print $b->as_text( ), "\n  ", URI->new_abs(...

Then you'll get output like this:

UK housing market stalls
  http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576546.stm
UK hate crime laws to be tightened
  http://news.bbc.co.uk/hi/english/uk_politics/newsid_1576000/1576051.stm
Leeds footballers' trial begins
  http://news.bbc.co.uk/hi/english/uk/newsid_1576000/1576379.stm
 Swissair shares wiped out
  http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576636.stm
[...]

Notice that in the fourth link there, we have a space at the start. Wanting to not have whitespace at the start or end of as_text( ) is common enough that there's a method just for that: as_trimmed_text( ), which wraps around as_text( ), removes any whitespace at the start or end, and collapses any whitespace nodes on the inside.[3] When we replace our calls to get_text( ) with calls to get_trimmed_text( ), that last link changes to this:

[3] This is exactly the same as the $stream->get_text( ) versus $stream->get_trimmed_text( ) distinction in HTML::TokeParser.

[...]
Swissair shares wiped out
  http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576636.stm
[...]

that is, without the space at the start of the line.