The Hypertext Transfer Protocol (HTTP) is used to fetch most documents on the Web. It is formally specified in RFC 2616, but this section explains everything you need to know to use LWP.
HTTP is a server/client protocol: the server has the file, and the client wants it. In regular web surfing, the client is a web browser such as Mozilla or Internet Explorer. The URL for a document identifies the server, which the browser contacts and requests the document from. The server returns either in error ("file not found") or success (in which case the document is attached).
Example 2-1 contains a sample request from a client.
GET /daily/2001/01/05/1.html HTTP/1.1 Host: www.suck.com User-Agent: Super Duper Browser 14.6 [blank line]
A successful response is given in Example 2-2.
HTTP/1.1 200 OK Content-type: text/html Content-length: 24204 [blank line] [and then 24,204 bytes of HTML code]
A response indicating failure is given in Example 2-3.
HTTP/1.1 404 Not Found Content-type: text/html Content-length: 135 <html><head><title>Not Found</title></head><body> Sorry, the object you requested was not found. </body><html> [and then the server closes the connection]
The request line says what the client wants to do (the method), what it wants to do it to (the path), and what protocol it's speaking. Although the HTTP standard defines several methods, the most common are GET and POST. The path is part of the URL being requested (in Example 2-1 the path is /daily/2001/01/05/1.html). The protocol version is generally HTTP/1.1.
Each header line consists of a key and a value (for example, User-Agent: SuperDuperBrowser/14.6). In versions of HTTP previous to 1.1, header lines were optional. In HTTP 1.1, the Host: header must be present, to name the server to which the browser is talking. This is the "server" part of the URL being requested (e.g., www.suck.com). The headers are terminated with a blank line, which must be present regardless of whether there are any headers.
The optional message body can contain arbitrary data. If a body is sent, the request's Content-Type and Content-Length headers help the server decode the data. GET queries don't have any attached data, so this area is blank (that is, nothing is sent by the browser). For our purposes, only POST queries use this third part of the HTTP request.
The following are the most useful headers sent in an HTTP request.
"Referrer" would be a more correct English spelling of the word, but "Referer" got frozen into the spec years ago. Maybe the blame lies on a UK (or Irish, Indian, etc) person mistakenly assuming that "referer" would be a correct US spelling, the way that UK "traveller" does become "traveler" in the US. Admittedly, it is a confusing enough issue.
"www.youthere.int"? Yes, there's an ".int" TLD. It's for international treaty organizations (like the World Health Organization or NATO), which means that it will likely be permanently free of clever (or even non-acronymic) domain names. So I use it extensively as my suffix for nonsense hostnames throughout this book.
But little did I know when I wrote the book, that RFC 2606 had already reserved
example.com/.net/.orgfor just the purpose of having an example domain name for use in, well, books about the Web, and other such documentation.
The status line states which protocol the server is speaking, then gives a numeric status code and a short message. For example, "HTTP/1.1 404 Not Found." The numeric status codes are grouped—200-299 are success, 400-499 are permanent failures, and so on. A full list of HTTP status codes is given in Appendix B, "HTTP Status Codes".
The header lines let the server send additional information about the response. For example, if authentication is required, the server uses headers to indicate the type of authentication. The most common header—almost always present for both successful and unsuccessful requests—is Content-Type, which helps the browser interpret the body. Headers are terminated with a blank line, which must be present even if no headers are sent.
Many responses contain a Content-Length line that specifies the length, in bytes, of the body. However, this line is rarely present on dynamically generated pages, and because you never know which pages are dynamically generated, you can't rely on that header line being there.
(Other, rarer header lines are used for specifying that the content has moved to a given URL, or that the server wants the browser to send HTTP cookies, and so on; however, these things are generally handled for you automatically by LWP.)
The body of the response follows the blank line and can be any arbitrary data. In the case of a typical web request, this is the HTML document to be displayed. If an error occurs, the message body doesn't contain the document that was requested but usually consists of a server-generated error message (generally in HTML, but sometimes not) explaining the error.