12.2. A User Agent for Robots

So far in this book, we've been using one type of user-agent object: objects of the class LWP::UserAgent. This is generally appropriate for a program that makes only a few undemanding requests of a remote server. But for cases in which we want to be quite sure that the robot behaves itself, the best way to start is by using LWP::RobotUA instead of LWP::UserAgent.

An LWP::RobotUA object is like an LWP::UserAgent object, with these exceptions:

Besides having all the attributes of an LWP::UserAgent object, an LWP::RobotUA object has one additional interesting attribute, $robot->delay($minutes), which controls how long this object should wait between requests to the same host. The current default value is one minute. Note that you can set it to a non-integer number of minutes. For example, to set the delay to seven seconds, use $robot->delay(7/60).

So we can take our New York Times program from Chapter 11, "Cookies, Authentication,and Advanced Requests" and make it into a scrupulously well-behaved robot by changing this one line:

my $browser = LWP::UserAgent->new( );

to this:

use LWP::RobotUA;
my $browser = LWP::RobotUA->new( 'JamiesNYTBot/1.0',
  'jamie@newsjunkie.int' # my address
$browser->delay(5/60); # 5 second delay between requests

We may not notice any particular effect on how the program behaves, but it makes quite sure that the $browser object won't perform its requests too quickly, nor request anything the Times's webmaster thinks robots shouldn't request.

In new programs, I typically use $robot as the variable for holding LWP::RobotUA objects instead of $browser. But this is a merely cosmetic difference; nothing requires us to replace every $browser with $robot in the Times program when we change it from using an LWP::UserAgent object to an LWP::RobotUA object.

You can freely use LWP::RobotUA anywhere you could use LWP::UserAgent, in a Type One or Type Two spider. And you really should use LWP::RobotUA as the basis for any Type Three or Type Four spiders. You should use it not just so you can effortlessly abide by robots.txt rules, but also so that you don't have to remember to write in sleep statements all over your programs to keep it from using too much of the remote server's bandwidth—or yours!