11.4. An HTTP Authentication Example:The Unicode Mailing Archive

Most password-protected sites (whether protected via HTTP Basic Authentication or otherwise) are that way because the sites' owners don't want just anyone to look at the content. And it would be a bit odd if I gave away such a username and password by mentioning it in this book! However, there is one well-known site whose content is password-protected without being secret: the mailing list archive of the Unicode mailing lists.

In an effort to keep email-harvesting bots from finding the Unicode mailing list archive while spidering the Web for fresh email addresses, the Unicode.org sysadmins have put a password on that part of their site. But to allow people (actual not-bot humans) to access the site, the site administrators publicly state the password, on an unprotected page, at http://www.unicode.org/mail-arch/, which links to the protected part, but also states the username and password you should use.

The main Unicode mailing list (called unicode) once in a while has a thread that is really very interesting and you really must read, but it's buried in a thousand other messages that are not even worth downloading, even in digest form. Luckily, this problem meets a tidy solution with LWP: I've written a short program that, on the first of every month, downloads the index of all the previous month's messages and reports the number of messages that has each topic as its subject.

The trick is that the web pages that list this information are password-protected. Moreover, the URL for the index of last month's posts is different every month, but in a fairly obvious way. The URL for March 2002, for example, is:

http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/

Deducing the URL for the month that has just ended is simple enough:

# To be run on the first of every month...
use POSIX ('strftime');
my $last_month = strftime("y%Y-m%m", localtime(time - 24 * 60 * 60));
# Since today is the first, one day ago (24*60*60 seconds) is in
#  last month.
my $url = "http://www.unicode.org/mail-arch/unicode-ml/$last_month/";

But getting the contents of that URL involves first providing the username and password and realm name. The Unicode web site doesn't publicly declare the realm name, because it's an irrelevant detail for users with interactive browsers, but we need to know it for our call to the credential method. To find out the realm name, try accessing the URL in an interactive browser. The realm will be shown in the authentication dialog box, as shown in Figure 11-1.

In this case, it's "Unicode-MailList-Archives," which is all we needed to make our request:

my $browser = LWP::UserAgent->new;
$browser->credentials(
  'www.unicode.org:80',  # Don't forget the ":80"!
  # This is no secret...
  'Unicode-MailList-Archives',
  'unicode-ml' => 'unicode'
);
print "Getting topics for last month, $last_month\n",
      " from $url\n";
my $response = $browser->get($url);
die "Error getting $url: ", $response->status_line
  if $response->is_error;

If this fails (if the Unicode site's admins have changed the username or password or even the realm name), that will die with this error message:

Error getting http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/:
401 Authorization Required at unicode_list001.pl line 21.

But assuming the authorization data is correct, the page is retrieved as if it were a normal, unprotected page. From there, counting the topics and noting the absolute URL of the first message of each thread is a matter of extracting data from the HTML source and reporting it concisely.

my(%posts, %first_url);
while( ${ $response->content_ref }
 =~ m{<li><a href="(\d+.html)"><strong>(.*?)</strong>}g
   # Like: <li><a href="0127.html"><strong>Klingon</strong>
) {
  my($url, $topic) = ($1,$2);
 
  # Strip any number of "Re:" prefixes.
  while( $topic =~ s/^Re:\s+//i ) {}
 
  ++$posts{$topic};
  use URI;   # For absolutizing URLs...
  $first_url{$topic} ||= URI->new_abs($url, $response->base);
}
 
print "Topics:\n", reverse sort map   # Most common first:
  sprintf("% 5s %s\n       %s\n",
          $posts{$_}, $_, $first_url{$_}
  ), keys %posts;

Typical output starts out like this:

Getting topics for last month, y2002-m02
 from http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/
Topics:
   86 Unicode and Security
       http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0021.html
   47 ISO 3166 (country codes) Maintenance Agency Web pages move
       http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0390.html
   41 Unicode and end users
       http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0260.html
   27 Unicode Search Engines
       http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0360.html
   22 Smiles, faces, etc
       http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0275.html
   18 This spoofing and security thread
       http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0216.html
   16 Standard Conventions and euro
       http://www.unicode.org/mail-arch/unicode-ml/y2002-m02/0418.html

This continues for a few pages.