Chapter 1. Introduction to Web Automation

Contents:

The Web as Data Source
History of LWP
Installing LWP
Words of Caution
LWP in Action

LWP (short for "Library for World Wide Web in Perl") is a set of Perl modules and object-oriented classes for getting data from the Web and for extracting information from HTML. This chapter provides essential background on the LWP suite. It describes the nature and history of LWP, which platforms it runs on, and how to download and install it. This chapter ends with a quick walkthrough of several LWP programs that illustrate common tasks, such as fetching web pages, extracting information using regular expressions, and submitting forms.

1.1. The Web as Data Source

Most web sites are designed for people. User Interface gurus consult for large sums of money to build HTML code that is easy to use and displays correctly on all browsers. User Experience gurus wag their fingers and tell web designers to study their users, so they know the human foibles and desires of the ape descendents who will be viewing the web site.

Fundamentally, though, a web site is home to data and services. A stockbroker has stock prices and the value of your portfolio (data) and forms that let you buy and sell stock (services). Amazon has book ISBNs, titles, authors, reviews, prices, and rankings (data) and forms that let you order those books (services).

It's assumed that the data and services will be accessed by people viewing the rendered HTML. But many a programmer has eyed those data sources and services on the Web and thought "I'd like to use those in a program!" For example, they could page you when your portfolio falls past a certain point or could calculate the "best" book on Perl based on the ratio of its price to its average reader review.

LWP lets you do this kind of web automation. With it, you can fetch web pages, submit forms, authenticate, and extract information from HTML. Once you've used it to grab news headlines or check links, you'll never view the Web in the same way again.

As with everything in Perl, there's more than one way to automate accessing the Web. In this book, we'll show you everything from the basic way to access the Web (via the LWP::Simple module), through forms, all the way to the gory details of cookies, authentication, and other types of complex requests.

1.1.1. Screen Scraping

Once you've tackled the fundamentals of how to ask a web server for a particular page, you still have to find the information you want, buried in the HTML response. Most often you won't need more than regular expressions to achieve this. Chapter 6, "Simple HTML Processing with Regular Expressions" describes the art of extracting information from HTML using regular expressions, although you'll see the beginnings of it as early as Chapter 2, "Web Basics", where we query AltaVista for a word, and use a regexp to match the number in the response that says "We found [number] results."

The more discerning LWP connoisseur, however, treats the HTML document as a stream of tokens (Chapter 7, "HTML Processing with Tokens", with an extended example in Chapter 8, "Tokenizing Walkthrough") or as a parse tree (Chapter 9, "HTML Processing with Trees"). For example, you'll use a token view and a tree view to consider such tasks as how to catch <img...> tags that are missing some of their attributes, how to get the absolute URLs of all the headlines on the BBC News main page, and how to extract content from one web page and insert it into a different template.

In the old days of 80x24 terminals, "screen scraping" referred to the art of programmatically extracting information from the screens of interactive applications. That term has been carried over to mean the act of automatically extracting data from the output of any system that was basically designed for interactive use. That's the term used for getting data out of HTML that was meant to be looked at in a browser, not necessarily extracted for your programs' use.

1.1.2. Brittleness

In some lucky cases, your LWP-related task consists of downloading a file without requiring your program to parse it in any way. But most tasks involve having to extract a piece of data from some part of the returned document, using the screen-scraping tactics as mentioned earlier. An unavoidable problem is that the format of most web content can change at any time. For example in Chapter 8, "Tokenizing Walkthrough", I discuss the task of extracting data from the program listings at the web site for the radio show Fresh Air. The principle I demonstrate for that specific case is true for all extraction tasks: no pattern in the data is permanent and so any data-parsing program will be "brittle."

For example, if you want to match text in section headings, you can write your program to depend on them being inside <h2>...</h2> tags, but tomorrow the site's template could be redesigned, and headings could then be in <h3 class='hdln'>...</h3> tags, at which point your program won't see anything it considers a section heading. In practice, any given site's template won't change on a daily basis (nor even yearly, for most sites), but as you read this book and see examples of data extraction, bear in mind that each solution can't be the solution, but is just a solution, and a temporary and brittle one at that.

As somewhat of a lesson in brittleness, in this book I show you data from various web sites (Amazon.com, the BBC News web site, and many others) and show how to write programs to extract data from them. However, that code is fragile. Some sites get redesigned only every few years; Amazon.com seems to change something every few weeks. So while I've made every effort to provide accurate code for the web sites as they exist at the time of this writing, I hope you will consider the programs in this book valuable as learning tools even after the sites will have changed beyond recognition.

1.1.3. Web Services

Programmers have begun to realize the great value in automating transactions over the Web. There is now a booming industry in web services, which is the buzzword for data or services offered over the Web. What differentiates web services from web sites is that web services don't emit HTML for the ultimate reading pleasure of humans, they emit XML for programs.

This removes the need to scrape information out of HTML, neatly solving the problem of ever-changing web sites made brittle by the fickle tastes of the web-browsing public. Some web services standards (SOAP and XML-RPC) even make the remote web service appear to be a set of functions you call from within your program—if you use a SOAP or XML-RPC toolkit, you don't even have to parse XML!

However, there will always be information on the Web that isn't accessible as a web service. For that information, screen scraping is the only choice.

When I wrote the above, in 2001 or so, RSS had been around for years, but hardly anyone actually used it for anything. It was a sort of Catch-22: why generate an RSS when almost nobody had an RSS reader, so why make an RSS reader, if there's nothing interesting available as RSS? 

Then, about 2001, Web services appeared and looked like they were going to be one of those "this time for sure!" solutions where we all get to have a glorious Semantic Web where you'd request and receive semantically marked-up information over some sort of web service protocol like XML-RPC.

Well, it so happened that XML-RPC went on to be a handy way for big programs to talk about their big data; but RSS lept ahead as the way for normal folks to get web data. RSS might not be an ideal of semantic-ness, but it's still much less encumbered than HTML.

Of course, solving one problem creates two more:

First, sometimes you then need to extract information from an RSS feed -- and at that point, just about everything I say in this book about extracting data from HTML applies just as well to RSS. (Although, you may be tempted to use XSL; it's worth a try, but I find the language an odd mix of concise and exasperating.)

And secondly, sometimes you need to get data into RSS, from an HTML source that for one reason or another doesn't have a CMS that can be goaded into "easily" producing an RSS feed. At that point, you're right back into the topic of this book, harvesting and screen-scraping web pages, with just an extra step at the end where you save your data as RSS.

To abuse an expression, the more things stay the same, the more they stay the same.