2.3. LWP::Simple

GET is the simplest and most common type of HTTP request. Form parameters may be supplied in the URL, but there is never a body to the request. The LWP::Simple module has several functions for quickly fetching a document with a GET request. Some functions return the document, others save or print the document.

2.3.1. Basic Document Fetch

The LWP::Simple module's get( ) function takes a URL and returns the body of the document:

$document = get("http://www.suck.com/daily/2001/01/05/1.html");

If the document can't be fetched, get( ) returns undef. Incidentally, if LWP requests that URL and the server replies that it has moved to some other URL, LWP requests that other URL and returns that.

With LWP::Simple's get( ) function, there's no way to set headers to be sent with the GET request or get more information about the response, such as the status code. These are important things, because some web servers have copies of documents in different languages and use the HTTP language header to determine which document to return. Likewise, the HTTP response code can let us distinguish between permanent failures (e.g., "404 Not Found") and temporary failures ("505 Service [Temporarily] Unavailable").

Even the most common type of nontrivial web robot (a link checker), benefits from access to response codes. A 403 ("Forbidden," usually because of file permissions) could be automatically corrected, whereas a 404 ("Not Found") error implies an out-of-date link that requires fixing. But if you want access to these codes or other parts of the response besides just the main content, your task is no longer a simple one, and so you shouldn't use LWP::Simple for it. The "simple" in LWP::Simple refers not just to the style of its interface, but also to the kind of tasks for which it's meant.

2.3.2. Fetch and Store

One way to get the status code is to use LWP::Simple's getstore( ) function, which writes the document to a file and returns the status code from the response:

$status = getstore("http://www.suck.com/daily/2001/01/05/1.html",
                   "/tmp/web.html");

There are two problems with this. The first is that the document is now stored in a file instead of in a variable where you can process it (extract information, convert to another format, etc.). This is readily solved by reading the file using Perl's built-in open( ) and <FH> operators; see below for an example.

The other problem is that a status code by itself isn't very useful: how do you know whether it was successful? That is, does the file contain a document? LWP::Simple offers the is_success( ) and is_error( ) functions to answer that question:

$successful = is_success(status);
$failed     = is_error(status);

If the status code status indicates a successful request (is in the 200-299 range), is_success( ) returns true. If status is an error (400-599), is_error( ) returns true. For example, this bit of code saves the BookTV (CSPAN2) listings schedule and emits a message if Gore Vidal is mentioned:

use strict;
use warnings;
use LWP::Simple;
my $url  = 'http://www.booktv.org/schedule/';
my $file = 'booktv.html';
my $status = getstore($url, $file);
die "Error $status on $url" unless is_success($status);
open(IN, "<$file") || die "Can't open $file: $!";
while (<IN>) {
  if (m/Gore\s+Vidal/) {
    print "Look!  Gore Vidal!  $url\n";
    last;
  }
}
close(IN);

2.3.3. Fetch and Print

LWP::Simple also exports the getprint( ) function:

$status = getprint(url);

The document is printed to the currently selected output filehandle (usually STDOUT). In other respects, it behaves like getstore( ). This can be very handy in one-liners such as:

% perl -MLWP::Simple -e "getprint('http://cpan.org/RECENT')||die" | grep Apache

That retrieves http://cpan.org/RECENT, which lists the past week's uploads in CPAN (it's a plain text file, not HTML), then sends it to STDOUT, where grep passes through the lines that contain "Apache."

2.3.4. Previewing with HEAD

LWP::Simple also exports the head( ) function, which asks the server, "If I were to request this item with GET, what headers would it have?" This is useful when you are checking links. Although, not all servers support HEAD requests properly, if head( ) says the document is retrievable, then it almost definitely is. (However, if head( ) says it's not, that might just be because the server doesn't support HEAD requests.)

The return value of head( ) depends on whether you call it in scalar context or list context. In scalar context, it is simply:

$is_success = head(url);

If the server answers the HEAD request with a successful status code, this returns a true value. Otherwise, it returns a false value. You can use this like so:

die "I don't think I'll be able to get $url" unless head($url);

Regrettably, however, some old servers, and most CGIs running on newer servers, do not understand HEAD requests. In that case, they should reply with a "405 Method Not Allowed" message, but some actually respond as if you had performed a GET request. With the minimal interface that head( ) provides, you can't really deal with either of those cases, because you can't get the status code on unsuccessful requests, nor can you get the content (which, in theory, there should never be any).

In list context, head( ) returns a list of five values, if the request is successful:

(content_type, document_length, modified_time, expires, server)
    = head(url);

The content_type value is the MIME type string of the form type/subtype; the most common MIME types are listed in Appendix C, "Common MIME Types". The document_length value is whatever is in the Content-Length header, which, if present, should be the number of bytes in the document that you would have gotten if you'd performed a GET request. The modified_time value is the contents of the Last-Modified header converted to a number like you would get from Perl's time( ) function. For normal files (GIFs, HTML files, etc.), the Last-Modified value is just the modification time of that file, but dynamically generated content will not typically have a Last-Modified header.

The last two values are rarely useful; the expires value is a time (expressed as a number like you would get from Perl's time( ) function) from the seldom used Expires header, indicating when the data should no longer be considered valid. The server value is the contents of the Server header line that the server can send, to tell you what kind of software it's running. A typical value is Apache/1.3.22 (Unix).

An unsuccessful request, in list context, returns an empty list. So when you're copying the return list into a bunch of scalars, they will each get assigned undef. Note also that you don't need to save all the values—you can save just the first few, as in Example 2-4.

Example 2-4. Link checking with HEAD

use strict;
use LWP::Simple;
foreach my $url (
  'http://us.a1.yimg.com/us.yimg.com/i/ww/m5v9.gif',
  'http://hooboy.no-such-host.int/',
  'http://www.yahoo.com',
  'http://www.ora.com/ask_tim/graphics/asktim_header_main.gif',
  'http://www.guardian.co.uk/',
  'http://www.pixunlimited.co.uk/siteheaders/Guardian.gif',
) {
  print "\n$url\n";

  my ($type, $length, $mod) = head($url);
  # so we don't even save the expires or server values!

  unless (defined $type) {
    print "Couldn't get $url\n";
    next;
  }
  print "That $type document is ", $length || "???", " bytes long.\n";
  if ($mod) {
    my $ago = time( ) - $mod;
    print "It was modified $ago seconds ago; that's about ",
      int(.5 + $ago / (24 * 60 * 60)), " days ago, at ",
      scalar(localtime($mod)), "!\n";
  } else {
    print "I don't know when it was last modified.\n";
  }
}

Currently, that program prints the following, when run:

http://us.a1.yimg.com/us.yimg.com/i/ww/m5v9.gif
That image/gif document is 5611 bytes long.
It was modified 251207569 seconds ago; that's about 2907 days ago, at Thu Apr 14 18:00:00 1994!

http://hooboy.no-such-host.int/
Couldn't get http://hooboy.no-such-host.int/

http://www.yahoo.com
That text/html document is ??? bytes long.
I don't know when it was last modified.

http://www.ora.com/ask_tim/graphics/asktim_header_main.gif
That image/gif document is 8588 bytes long.
It was modified 62185120 seconds ago; that's about 720 days ago, at Mon Apr 10 12:14:13 2000!

http://www.guardian.co.uk/
That text/html document is ??? bytes long.
I don't know when it was last modified.

http://www.pixunlimited.co.uk/siteheaders/Guardian.gif
That image/gif document is 4659 bytes long.
It was modified 24518302 seconds ago; that's about 284 days ago, at Wed Jun 20 11:14:33 2001!

Incidentally, if you are using the very popular CGI.pm module, be aware that it exports a function called head( ) too. To avoid a clash, you can just tell LWP::Simple to export every function it normally would except for head( ):

use LWP::Simple qw(!head);
use CGI qw(:standard);

If not for that qw(!head), LWP::Simple would export head( ), then CGI would export head( ) (as it's in that module's :standard group), which would clash, producing a mildly cryptic warning such as "Prototype mismatch: sub main::head ($) vs none." Because any program using the CGI library is almost definitely a CGI script, any such warning (or, in fact, any message to STDERR) is usually enough to abort that CGI with a "500 Internal Server Error" message.

So, the example URL in get("http://www.suck.com/daily/2001/01/05/1.html") was just some random fun, but it generated a flurry of angry letters to the publisher!  Well, not exactly.  It was a single letter from a reader.  Or, really, it was one short email message from someone with a bellsouth.net address.  He said:

From: [some guy]@bellsouth.net
Subject: Perl LWP: Poor quality web page selection

I recently purchased Perl & LWP, written by Sean Burke. The technical
content is fine.
However, I feel there was a poor selection with web pages in perl lwp
examples in the book.
Page 20 and several other pages give technical examples with the web page http://www.suck.com.
I was dissapointed in Orielly books when I saw the content of www.suck.com.

Please keep you books your clean.


Presumably his poor grammar, spelling, and formatting are due simply to the supreme shock that he had just suffered in seeing such things as appear at suck.com.  We must imagine him barely able to type at all, even once his valet had carried him to the fainting-couch, loosened his ascot, and fanned him nervously.

And I confess that I have failed to, uh, keep me books my clean -- my RTF book contains a dirty limerick... in Latin!  Oh, I'm incorrigible!!