2.5. Example: AltaVista

Every so often, two people, somewhere, somehow, will come to argue over a point of English spelling—one of them will hold up a dictionary recommending one spelling, and the other will hold up a dictionary recommending something else. In olden times, such conflicts were tidily settled with a fight to the death, but in these days of overspecialization, it is common for one of the spelling combatants to say "Let's ask a linguist. He'll know I'm right and you're wrong!" And so I am contacted, and my supposedly expert opinion is requested. And if I happen to be answering mail that month, my response is often something like:

Dear Mr. Hing:

I have read with intense interest your letter detailing your struggle with the question of whether your favorite savory spice should be spelled in English as "asafoetida" or whether you should heed your secretary's admonishment that all the kids today are spelling it "asafetida."

I could note various factors potentially involved here; notably, the fact that in many cases, British/Commonwealth spelling retains many "ae"/"oe" digraphs whereas U.S./Canadian spelling strongly prefers an "e" ("foetus"/"fetus," etc.). But I will instead be (merely) democratic about this and note that if you use AltaVista (http://altavista.com, a well-known search engine) to run a search on "asafetida," it will say that across all the pages that AltaVista has indexed, there are "about 4,170" matched; whereas for "asafoetida" there are many more, "about 8,720."

So you, with the "oe", are apparently in the majority.

To automate the task of producing such reports, I've written a small program called alta_count, which queries AltaVista for each term given and reports the count of documents matched:

% alta_count asafetida asafoetida
asafetida: 4,170 matches
asafoetida: 8,720 matches

At time of this writing, going to http://altavista.com, putting a word or phrase in the search box, and hitting the Submit button yields a result page with a URL that looks like this:


Now, you could construct these URLs for any phrase with something like:

$url = 'http://www.altavista.com/sites/search/web?q=%22'
       . $phrase
       . '%22&kl=XX'  ;

But that doesn't take into account the need to encode characters such as spaces in URLs. If I want to run a search on the frequency of "boy toy" (as compared to the alternate spelling "boytoy"), the space in that phrase needs to be encoded as %20, and if I want to run a search on the frequency of "résumé," each "é" needs to be encoded as %E9.

The correct way to generate the query strings is to use the URI::Escape module:

use URI::Escape;    # That gives us the uri_escape function
$url = 'http://www.altavista.com/sites/search/web?q=%22'
       . uri_escape($phrase)
       . '%22&kl=XX'  ;

Now we just have to request that URL and skim the returned content for AltaVista's standard phrase "We found [number] results." (That's assuming the response comes with an okay status code, as we should get unless AltaVista is somehow down or inaccessible.)

Example 2-6 is the complete alta_count program.

Example 2-6. The alta_count program

#!/usr/bin/perl -w
use strict;
use URI::Escape;
foreach my $word (@ARGV) {
  next unless length $word; # sanity-checking
  my $url = 'http://www.altavista.com/sites/search/web?q=%22'
    . uri_escape($word) . '%22&kl=XX';
  my ($content, $status, $is_success) = do_GET($url);
  if (!$is_success) {
    print "Sorry, failed: $status\n";
  } elsif ($content =~ m/>We found ([0-9,]+) results?/) { # like "1,952"
    print "$word: $1 matches\n";
  } else {
    print "$word: Page not processable, at $url\n";
  sleep 2; # Be nice to AltaVista's servers!!!

# And then my favorite do_GET routine:
use LWP; # loads lots of necessary classes.
my $browser;
sub do_GET {
  $browser = LWP::UserAgent->new unless $browser;
  my $resp = $browser->get(@_);
  return ($resp->content, $resp->status_line, $resp->is_success, $resp)
    if wantarray;
  return unless $resp->is_success;
  return $resp->content;

With that, I can run:

% alta_count boytoy 'boy toy'
boytoy: 6,290 matches
boy toy: 26,100 matches

knowing that when it searches for the frequency of "boy toy," it is duly URL-encoding the space character.

This approach to HTTP GET query parameters, where we insert one or two values into an otherwise precooked URL, works fine for most cases. For a more general approach (where we produce the part after the ? completely from scratch in the URL), see Chapter 5, "Forms".