Automating Data Extraction
Regular Expression Techniques
Troubleshooting
When Regular Expressions Aren't Enough
Example: Extracting Links from a Bookmark File
Example: Extracting Links from Arbitrary HTML
Example: Extracting Temperatures from Weather Underground
The preceding chapters have been about getting things from the Web. But once you get a file, you have to process it. If you get a GIF, you'll use some module or external program that reads GIFs and likewise if you get a PNG, an RSS file, an MP3, or whatever. However, most of the interesting processable information on the Web is in HTML, so much of the rest of this book will focus on getting information out of HTML specifically.
In this chapter, we will use a rudimentary approach to processing HTML source: Perl regular expressions. This technique is powerful and most web sites can be mined in this fashion. We present the techniques of using regular expressions to extract data and show you how to debug those regular expressions. Examples from Amazon, the O'Reilly Network, Netscape bookmark files, and the Weather Underground web site demonstrate the techniques.
Suppose we want to extract information from an Amazon book page. The first problem is getting the HTML. Browsing Amazon shows that the URL for a book page is http://www.amazon.com/exec/obidos/ASIN/ISBN, where ISBN is the book's unique International Standard Book Number. So to fetch the Perl Cookbook's page, for example:
#!/usr/bin/perl -w use strict; use LWP::Simple; my $html = get("http://www.amazon.com/exec/obidos/ASIN/1565922433") or die "Couldn't fetch the Perl Cookbook's page.";
The relevant piece of HTML looks like this:
<br clear="left"> <FONT FACE="Arial,Helvetica" size=2> <b>Paperback</b> - 794 pages (August 1998) <br></font> <font face="Arial,Helvetica" size=-2> O'Reilly & Associates; </font> <font face="Arial,Helvetica" size=-2> ISBN: 1565922433 ; Dimensions (in inches): 1.55 x 9.22 x 7.08 <br> <FONT FACE="Arial,Helvetica" size=2> </font><br> </font> </span> <font face=verdana,arial,helvetica size=-1> <b>Amazon.com Sales Rank: </b> 4,070 </font><br> <font face=verdana,arial,helvetica size=-1>
The easiest way to extract information here is to use regular expressions. For example:
$html =~ m{Amazon\.com Sales Rank: </b> ([\d,]+) </font><br>}; $sales_rank = $1; $sales_rank =~ tr[,][]d; # 4,070 becomes 4070
This regular expression describes the information we want (a string of digits and commas), as well as the text around the text we're after (Amazon.com Sales Rank: and </font><br>). We use curly braces to delimit the regular expression to avoid problems with the slash in </font>, and we use parentheses to capture the desired information. We save that information to $sales_rank, then modify the variable's value to clean up the data we extracted.
The final program appears in Example 6-1.
#!/usr/bin/perl -w # cookbook-rank - find rank of Perl Cookbook on Amazon use LWP::Simple; my $html = get("http://www.amazon.com/exec/obidos/ASIN/1565922433") or die "Couldn't fetch the Perl Cookbook's page."; $html =~ m{Amazon\.com Sales Rank: </b> ([\d,]+) </font><br>} || die; my $sales_rank = $1; $sales_rank =~ tr[,][]d; # 4,070 becomes 4070 print "$sales_rank\n";
It's then straightforward to generalize the program by allowing the user to provide the ISBN on the command line, as shown in Example 6-2.
#!/usr/bin/perl -w # amazon-rank: fetch Amazon rank given ISBN on cmdline use LWP::Simple; my $isbn = shift or die "usage:\n$0 ISBN\n"; my $html = get("http://www.amazon.com/exec/obidos/ASIN/$isbn"); $html =~ m{Amazon\.com Sales Rank: </b> ([\d,]+) </font><br>} || die; my $sales_rank = $1; $sales_rank =~ tr[,][]d; # 4,070 becomes 4070 print "$sales_rank\n";
We could take this program in any direction we wanted. For example, it would be a simple enhancement to take a list of ISBNs from the command line or from STDIN, if none were given on the command line. It would be trickier, but more useful, to have the program accept book titles instead of just ISBNs. A more elaborate version of this basic program is one of O'Reilly's actual market research tools.