Token Sequences (Perl & LWP)

Some problems cannot be solved with a single-token approach. Often you need to scan for a sequence of tokens. For example in Chapter 4, "URLs", we extracted the Amazon sales rank from HTML like this:

Here we're looking for the text Amazon.com Sales Rank: , an end-tag for b, and the next token as a text token with the sales rank. To solve this, we need to check the next few tokens while being able to put them back if they're not what we expect.

The tokens stored in @next will be returned to the stream. For example, to solve our Amazon problem:

If it's the text we're looking for, we cautiously explore the next tokens. If the next one is a </b> end-tag, check the next token to ensure that it's text. If it is, then that's the sales rank. If any of the tests fail, put the tokens back on the stream and go back to processing.

7.4.1. Example: BBC Headlines

Suppose, for example, that your morning ritual is to have the help come and wake you at about 11 a.m. as they bring two serving trays to your bed. On one tray there's a croissant, some pain au chocolat, and of course some café au lait, and on the other tray, your laptop with a browser window already open on each story from BBC News's front page (http://news.bbc.co.uk). However, the help have been getting mixed up lately and opening the stories on The Guardian's web site, and that's a bit awkward, since clearly The Guardian is an after-lunch paper. You'd say something about it, but one doesn't want to make a scene, so you just decide to write a program that the help can run on the laptop to find all the BBC story URLs.

So you look at the source of http://news.bbc.co.uk and discover that each headline link is wrapped in one of two kinds of code. There are lots of headlines in code such as these:

<B CLASS="h3"><A href="/hi/english/business/newsid_1576000/1576290.stm">Bank
of England mulls rate cut</A></B><BR>
  
<B CLASS="h3"><A href="/hi/english/uk_politics/newsid_1576000/1576541.stm">Euro
battle revived by Blair speech</A></B><BR>

and also some headlines in code like this:

<A href="/hi/english/business/newsid_1576000/1576636.stm">
  <B class="h2"> Swissair shares wiped out</B><BR>
</A>

<A href="/hi/english/world/middle_east/newsid_1576000/1576113.stm">
  <B class="h1">Mid-East blow to US anti-terror drive</B><BR>
</A>

(Note that the a start-tag's class value can be h1 or h2.)

Studying this, you realize that this is how you find the story URLs:

Every time there's a B start-tag with class value of h3, and then an A start-tag with an href value, save that href.
Every time there's an A start-tag with an href value, a text token consisting of just whitespace, and then a B start-tag with a class value of h1 or h2, save the first token's href value.

7.4.2. Translating the Problem into Code

We can take some shortcuts when translating this into $stream->unget_token($token) code. The following HTML is typical:

<B CLASS="h3">Top Stories</B><BR>
...
<B CLASS="h3"><A href="/hi/english/business/newsid_1576000/1576290.stm">Bank
of England mulls rate cut</A></B><BR>

When we see the first B-h3 start-tag token, we think it might be the start of a B-h3-A-href pattern. So we get another token and see if it's an A-href token. It's not (it's the text token Top Stories), so we put it back into the stream (useful in case some other pattern we're looking for involves that being the first token), and we keep looping. Later, we see another B-h3, we get another token, and we inspect it to see if it's an A-href token. This time it is, so we process its href value and resume looping. There's no reason for us to put that a-href back, so the next iteration of the loop will resume with the next token being Bank of England mulls rate cut.

sub scan_bbc_stream {
  my($stream, $docbase) = @_;

 Token:
  while(my $token = $stream->get_token) {

    if ($token->[0] eq 'S'  and  $token->[1] eq 'b'  and  
       ($token->[2]{'class'} || '') eq 'h3') {
      # The href we want is in the NEXT token... probably.
      # Like: <B CLASS="h3"><A href="magic_url_here">

      my(@next) = ($stream->get_token);

      if ($next[0] and $next[0][0] eq 'S'  and  $next[0][1] eq 'a'  and
          defined $next[0][2]{'href'} ) {
         # We found <a href="...">!  This rule matches!
         print URI->new_abs($next[0][2]{'href'}, $docbase), "\n";
         next Token;
      }
      # We get here only if we've given up on this rule:
      $stream->unget_token(@next);
    }

    # fall thru to subsequent rules here...

  }
  return;
}

The general form of the rule above is this: if the current token looks promising, pull off a token and see if that looks promising too. If, at any point, we see an unexpected token or hit the end of the stream, we restore what we've pulled off (held in the temporary array @next), and continue to try other rules. But if all the expectations in this rule are met, we make it to the part that processes this bunch of tokens (here it's just a single line, which prints the URL), and then call next Token to start another iteration of this loop without restoring the tokens that have matched this pattern. (If you are disturbed by this use of a named block and last ing and next ing around, consider that this could be written as a giant if/else statement at the risk of potentially greater damage to what's left of your sanity.)

Each such rule, then, can pull from the stream however many tokens it needs to either match or reject the pattern it's after. Either it matches and starts another iteration of this loop, or it restores the stream to exactly the way it was before this rule started pulling from it. This business of a temporary @next list may seem like overkill when we only have to look one token ahead, only ever looking at $next[0]. However, the if block for the next pattern (which requires looking two tokens ahead) shows how the same framework can be accommodating:

# Add this right after the first if-block ends.
if($token->[0] eq 'S'  and  $token->[1] eq 'a'  and
   defined $token->[2]{'href'} ) {
  # Like: <A href="magic_url_here"> <B class="h2">

  my(@next) = ($stream->get_token);
  if ($next[0] and $next[0][0] eq 'T' and $next[0][1] =~ m/^\s+/s ) {
    # We found whitespace.
    push @next, $stream->get_token;
    if ($next[1] and $next[1][0] eq 'S'  and  $next[1][1] eq 'b'  and
       ($next[1][2]{'class'} || '') =~ m/^h[12]$/s ) {
      # We found <b class="h2">!  This rule matches!
      print URI->new_abs( $token->[2]{'href'}, $docbase ), "\n";
      next Token;
    }
  }
  # We get here only if we've given up on this rule:
  $stream->unget_token(@next);
}

7.4.3. Bundling into a Program

With all that wrapped up in a pure function scan_bbc_stream( ), we can test it by first saving the contents of http://news.bbc.co.uk locally as bbc.html (which we probably already did to scrutinize its source code and figure out what HTML patterns surround headlines), and then calling this:

use strict;
use HTML::TokeParser;
use URI;

scan_bbc_stream(
  HTML::TokeParser->new('bbc.html') || die($!),
  'http://news.bbc.co.uk/' # base URL
);

When run, this merrily scans the local copy and say:

http://news.bbc.co.uk/hi/english/world/middle_east/newsid_1576000/1576113.stm
http://news.bbc.co.uk/hi/english/world/south_asia/newsid_1576000/1576186.stm
http://news.bbc.co.uk/hi/english/uk_politics/newsid_1576000/1576051.stm
http://news.bbc.co.uk/hi/english/uk/newsid_1576000/1576379.stm
http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576636.stm
http://news.bbc.co.uk/sport/hi/english/in_depth/2001/england_in_zimbabwe/newsid_
1574000/1574824.stm
http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576546.stm
http://news.bbc.co.uk/hi/english/uk/newsid_1576000/1576313.stm
http://news.bbc.co.uk/hi/english/uk_politics/newsid_1576000/1576541.stm
http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576290.stm
http://news.bbc.co.uk/hi/english/entertainment/music/newsid_1576000/1576599.stm
http://news.bbc.co.uk/hi/english/sci/tech/newsid_1574000/1574048.stm
http://news.bbc.co.uk/hi/english/health/newsid_1576000/1576776.stm
http://news.bbc.co.uk/hi/english/in_depth/uk_politics/2001/conferences_2001/labour/
newsid_1576000/1576086.stm

At least that's what the program said once I got scan_bbc_stream( ) in its final working state shown above. As I was writing it and testing bits of it, I could run and re-run the program, scanning the same local file. Then once it's working on the local file (or files, depending on how many test cases you have), you can write the routine that gets what's at a URL, makes a stream pointing to its content, and runs a given scanner routine (such as scan_bbc_stream( )) on it:

my $browser;
BEGIN {
  use LWP::UserAgent;
  $browser = LWP::UserAgent->new;
  # and any other $browser initialization code here
}

sub url_scan {
  my($scanner, $url) = @_;
  die "What scanner function?" unless $scanner and ref($scanner) eq 'CODE';
  die "What URL?" unless $url;
  my $resp = $browser->get( $url );
  die "Error getting $url: ", $resp->status_line
    unless $resp->is_success;
  die "It's not HTML, it's ", $resp->content_type
    unless $resp->content_type eq 'text/html';

  my $stream = HTML::TokeParser->new( $resp->content_ref )
    || die "Couldn't make a stream from $url\'s content!?";
  # new( ) on a string wants a reference, and so that's what
  #  we give it!  HTTP::Response objects just happen to
  #  offer a method that returns a reference to the content.
  $scanner->($stream, $resp->base);
}

If you thought the contents of $resp could be very large, you could save the contents to a temporary file, and start the stream off with HTML::TokeParser->new($tempfile). With the above url_scan( ), to retrieve the BBC main page and scan it, you need only replace our test statement that scans the input stream, with this:

url_scan(\&scan_bbc_stream, 'http://news.bbc.co.uk/');

And then the program outputs the URLs from the live BBC main page (or will die with an error message if it can't get it). To actually complete the task of getting the printed URLs to each open a new browser instance, well, this depends on your browser and OS, but for my MS Windows laptop and Netscape, this Perl program will do it:

my $ns = "c:\\program files\\netscape\\communicator\\program\\netscape.exe";
die "$ns doesn't exist" unless -e $ns;
die "$ns isn't executable" unless -x $ns;
while (<>) { chomp; m/\S/ and system($ns, $_) and die $!; }

This is then called as:

C:\perlstuff> perl bbc_urls.pl | perl urls2ns.pl

Under Unix, the correct system( ) command is:

system("netscape '$url' &")

7.4. Token Sequences

7.4.1. Example: BBC Headlines

7.4.2. Translating the Problem into Code

7.4.3. Bundling into a Program