Chapter 12. Spiders

Contents:

Types of Web-Querying Programs
A User Agent for Robots
Example: A Link-Checking Spider
Ideas for Further Expansion

So far we have focused on the mechanics of getting and parsing data off the Web, just a page here and a page there, without much attention to the ramifications. In this section, we consider issues that arise from writing programs that send more than a few requests to given web sites. Then we move on to how to write recursive web user agents, or spiders. With these skills, you'll be able to write programs that automatically navigate web sites, from simple link checkers to powerful bulk-download tools.

12.1. Types of Web-Querying Programs

Let's say your boss comes to you and says "I need you to write a spider." What does he mean by "spider"? Is he talking about the simple one-page screen scrapers we wrote in earlier chapters? Or does he want to extract many pages from a single server? Or maybe he wants you to write a new Google, which attempts to find and download every page on the Web. Roughly speaking, there are four kinds of programs that make requests to web servers:

Type One Requester
This program requests a couple items from a server, knowing ahead of time the URL of each. An example of this is our program in Chapter 7, "HTML Processing with Tokens" that requested just the front page of the BBC News web site.
Type Two Requester
This program requests a few items from a server, then requests the pages to which those link (or possibly just a subset of those). An example of this is the program we alluded to in Chapter 11, "Cookies, Authentication,and Advanced Requests" that would download the front page of the New York Times web site, then downloaded every story URL that appeared there.
Type Three Requester
This single-site spider requests what's at a given URL, finds links on that page that are on the same host, and requests those. Then, for each of those, it finds links to things on the same host, and so on, until potentially it visits every URL on the host.
Type Four Requester
This host-spanning spider requests what's at a given URL, finds links on that page that are anywhere on the Web, and requests those. Then, for each of those, it finds links to things anywhere on the Web (or at least things that are accessed via HTTP) and so on, until it visits every URL on the Web, in theory.

From each of the above types to the next, there is an added bit of logic that radically changes the scope and nature of the program.

A Type One Requester makes only a few requests. This is not normally a noticeable imposition on the remote server, unless one of these requests is for a document that's very large or that has to be dynamically generated with great difficulty.

A Type Two Requester places rather more burden on the remote server, simply because it generates many more requests. For example, our New York Times story downloader in Chapter 11, "Cookies, Authentication,and Advanced Requests" downloads not one or two pages, but several dozen. Because we don't want this to burden the Times's servers, we considerately called sleep(2) after every request.

In fact, that probably makes our program much kinder to the remote server than a typical browser would be. Typically, browsers create several simultaneous connections when downloading all the various images, stylesheets, and applets they need to render a given web page. However, a typical session with a graphical browser doesn't involve downloading so many different pages.

Note that with this sort of program, the scope of the program is clearly finite; it processes only the presumably small number of links that appear on a few pages. So there is no real chance of the program surprising you by requesting vastly more pages than you'd expect. For example, if you run your program that downloads links off the New York Times's front page, it downloads just those and that's it. If you run it, and the total count of downloaded pages is 45, you can assume that when you run it tomorrow, it will be about that many: maybe 30, 60, maybe even 70, but not 700 or 70,000. Moreover, when you see that the average length of each story downloaded is 30 KB, you can assume that it's unlikely for any future story to be 100 KB, and extremely unlikely for any to be 10 MB.

But a Type Three Requester is the first kind that could potentially go seriously awry. Previously, we could make safe assumptions about the nature of the pages whose links we were downloading. But when a program (or, specifically, a spider, as we can freely call these sorts of recursive programs) could request anything and everything on the server, it will be visiting pages we know nothing about, and about which we can't make any assumptions. For example, suppose we request the main page of our local paper's web site, and suppose that it links to a local events calendar for this month. If the events calendar is dynamically generated from a database, this month's page probably has a link to next month's page, and next month's to the month after, and so on forever, probably regardless of whether each "next month" has any events to it. So if you wrote a spider that wouldn't stop until it had requested every object on the server, for this server, it would never stop, because the number of pages on the server is infinite. In webmaster jargon, these are referred to as "infinite URL spaces."

A Type Four Requester has all the problems of Type Threes, except that instead of running the risk of annoying just the webmaster of the local paper, it can annoy any number of webmasters all over the world. Just one of the many things that can go wrong with these kinds of host-spanning spiders is if it sees a link to Yahoo!. It will follow that link, and then start recursing through all of Yahoo!, and visiting every site to which Yahoo! links. Because these sorts of spiders demand typically immense resources and are not "general purpose" by any means, we will not be discussing them.

If you are interested in this type of spider, you should read this chapter to understand the basic ideas of single-site spiders, then read Totty et al's HTTP: The Definitive Guide (O'Reilly), which goes into great detail on the special problems that await large-scale spiders.