Using Extracted Text (Perl & LWP)

7.6. Using Extracted Text

Consider the BBC story-link extractor introduced earlier. Its task was to find links to stories, in either of these kinds of patterns:

<B CLASS="h3"><A href="/hi/english/business/newsid_1576000/1576290.stm">Bank
  of England mulls rate cut</A></B><BR>

<A href="/hi/english/world/middle_east/newsid_1576000/1576113.stm">
  <B class="h1">Mid-East blow to US anti-terror drive</B><BR>
</A>

and then to isolate the URL, absolutize it, and print it. But it ignores the actual link text, which starts with the next token in the stream. If we want that text, we could get the next token by calling get_text( ):

print $stream->get_text( ), "\n  ",
  URI->new_abs($next[0][2]{'href'}, $docbase), "\n";

That prints the text like this:

Bank
of England mulls rate cut
  http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576290.stm

Note that the newline (and any indenting, if there was any) in the source hasn't been filtered out. For some applications, this makes no difference, but for neatness sake, let's keep headlines to one line each. Changing get_text( ) to get_trimmed_text( ) makes that happen:

print $stream->get_trimmed_text( ), "\n  ",
  URI->new_abs($next[0][2]{'href'}, $docbase), "\n";
Bank of England mulls rate cut
  http://news.bbc.co.uk/hi/english/business/newsid_1576000/1576290.stm

If the headlines are potentially quite long, we can pass them through Text::Wrap, to wrap them at 72 columns.

There's a trickier problem that occurs often with get_text( ) or get_trimmed_text( ). What if the HTML we're parsing looks like this?

<B CLASS="h3"><A href="/unlikely/2468.stm">Shatner &amp; Kunis win Oscars
  for <cite>American Psycho II</cite> r&ocirc;les</A></B><BR>

If we've just parsed the b and the a, the next token in the stream is a text token, Shatner & Kunis win Oscars for , that's what get_text( ) returns (get_trimmed_text( ) returns the same thing, minus the final space). But we don't want only the first text token in the headline, we want the whole headline. So instead of defining the headline as "the next text token," we could define it as "all the text tokens until the next </a>." So the program changes to:

print $stream->get_trimmed_text('/a'), "\n  ",
  URI->new_abs($next[0][2]{'href'}, $docbase), "\n";

That happily prints:

Shatner & Kunis win Oscars for American Psycho II rôles
  http://news.bbc.co.uk/unlikely/2468.stm

Note that the & and ô entity references were resolved to & and ô. If you were using such a program to spit out something other than plain text (such as XML or RTF), a bare & and/or a bare high-bit character such as ô might be unacceptable, and might need escaping in some fashion. Even if you are emitting plain text, the \xA0 (nonbreaking space) or \xAD (soft hyphen) characters may not be happily interpreted by whatever application you're reading the text with, in which case a tr/\xA0/ / and tr/\xAD//d are called for. If you're taking the output of get_text( ) or get_trimmed_text( ) and sending it to a system that understands only U.S. ASCII, then passing the text through a module such as Text::Unidecode might be called for to turn the ô into an o. This is not really an HTML or HTML::TokeParser matter at all, but is the sort of problem that commonly arises when extracting content from HTML and putting it into other formats.