Elements of an HTML Form
LWP and GET Requests
Automating Form Analysis
Idiosyncrasies of HTML Forms
POST Example: License Plates
POST Example: ABEBooks.com
File Uploads
Limits on Forms
Much of the interesting data of the Web is accessible only through HTML forms. This chapter shows you how to write programs to submit form data and get the resulting page. In covering this unavoidably complex topic, we consider packing form data into GET and POST requests, how each type of HTML form element produces form data, and how to automate the process of submitting form data and processing the responses.
The basic model for the Web is that the typical item is a "document" with a known URL, and when you want to access it (whether it's the Rhoda episode guide, or the front page of today's Boston Globe), you just get it, no questions asked. Even when there are cookies or HTTP authentication involved, these are basically just addenda to the process of requesting the known URL from the appropriate server. But some web resources require parameters beyond just their URL, parameters that are generally fed in by the user through HTML forms, and that the browser then sends either as dynamic parts of a URL (in the case of a GET request) or as content of a POST request.
A program on the receiving end of form data may simply use it as a query for searching other data, such as scanning all the RFCs and listing the ones by specific authors. Or a program may store the data, as with taking the user's data and saving it as a new post to a message base. Or a program may do grander things with the user-provided data, such as debiting the credit card number provided, logging the products being ordered, and putting them on the roster of items to be sent out. The details of writing those kinds of programs are covered in uncountable books on CGI, mod_perl, ASP, and the like. You are probably familiar with writing server-side programs in at least one of these frameworks, probably through having written CGIs in Perl, maybe with the huge and hugely popular Perl library, CGI.pm.
But what we are interested in here is the process of data getting from HTML forms into those server-side programs. Once you understand that process, you can write LWP programs that simulate that process, by providing the same kind of data as a real live user would provide keying data into a real live browser.
A good example of a straightforward form is the U.S. Census Bureau's Gazetteer (geographical index) system. The search form, at http://www.census.gov/cgi-bin/gazetteer, consists of:
<form method=get action=/cgi-bin/gazetteer> <hr noshade> <h3> <font size=+2>S</font>earch for a <font size=+2>P</font>lace in the <font size=+2>US</font> </h3> <p> Name: <input name="city" size=15> State (optional): <input name="state" size=3><br> or a 5-digit zip code: <input name="zip" size=8> <p> <input type="submit" value="Search"> </form>
We've highlighted the interesting bits. The method attribute of the <form> tag says whether to use GET or POST to submit the form data. The action attribute gives the URL to receive the form data. The components of a form are text fields, drop-down lists, checkboxes, and so on, each identified by a name. Here the <input> tags define text fields with the names city and state, zip, and a submit button called Search. The browser submits the state of the form components (what's been typed into the text boxes, which checkboxes are checked, which submit button you pressed) as a set of name=value pairs. If you typed "Dulce" into the city field, part of the browser's request for /cgi-bin/gazetteer would be city=Dulce.
Which part of the request contains the submitted name=value pairs depends on whether it's a GET or POST request. GET requests encode the pairs in the URL being requested, each pair separated by an ampersand (&) character, while POST requests encode them in the body of the request, one pair per line. In both cases the names and values are URL encoded.