6.3. Troubleshooting

Both when developing and maintaining data extraction programs, things can go wrong. Suddenly, instead of an article summary, you see a huge mass of HTML, or you don't get any output at all. Several things might cause this. For example, the web site's HTML changed, or your program wasn't flexible enough to deal with all the naturally occurring variations in the HTML.

There are two basic types of problems: false positives and false negatives. A false positive is when your regular expression identifies something it thinks is the information you're after, but it isn't really. For example, if the O'Reilly Network used the itemtemplate and summary format for things that aren't articles, the summary extraction program in Example 6-3 would report headlines that aren't really headlines.

There are two ways to deal with false positives. You can tighten your regular expression to prevent the uninteresting piece of HTML from matching. For example, matching text with /[^<]*/ instead of /.*?/ ensures the text has no HTML. The other way to prevent a false positive is to inspect the results of the match to ensure they're relevant to your search. For example, in Example 6-3, we checked that the URL, title, and summary were found when we decomposed the chunk.

A false negative is where your program fails to find information for which it is looking. There are also two ways to fix this. The first is to relax your regular expression. For example, replace a single space with /\s*/ to allow for any amount of whitespace. The second way is to make another pass through the document with a separate regular expression or processing technique, to catch the data you missed the first time around. For example, extract into an array all the things that look like news headlines, then remove the first element from the array if you know it's always going to be an advertisement instead of an actual headline.

Often the hardest part of debugging a regular expression is locating which part isn't matching or is matching too much. There are some simple steps you can take to identify where your regular expression is going wrong.

First, print the text you're matching against. Print it immediately before the match, so you are totally certain what the regular expression is being applied to. You'd be surprised at the number of subtle ways the page your program fetches can differ from the page for which you designed the regular expression.

Second, put capturing parentheses around every chunk of the regular expression to see what's matching. This lets you find runaway matches, i.e., places where a quantifier matches too much. For example, the /.*/ intended to skip just the formatting HTML might instead skip the formatting HTML, three entries, and another piece of formatting HTML. In such situations, it's typically because either the thing being quantified was too general (e.g., instead of the dot, we should have had /[^<]/ to avoid matching HTML), or because the literal text after the quantifier wasn't enough to identify the stop point. For example, /<font/ instead of /<font size=-1/ might make a minimal quantifier stop too soon (at the first font tag, instead of the correct font tag) or a greedy quantifier match too much (at the last font tag, instead of the last size=-1 font tag).

If the regular expression you've created isn't matching at all, repeatedly take the last chunk off the regular expression until it does match. The last bit you removed was causing the match to fail, so inspect it to see why.

For example, let's find out why this isn't matching:

$text = qq(<a href="file.html"><b>Dog</b></a>Woof\nWoof</p>);
($file, $title, $summary) = 
    $text =~ m{<a href="(.*?)"><b>(.*?)</b></a>\s*(.*?)</p>};

Taking the last piece off yields this regular expression:

<a href="(.*?)"><b>(.*?)</b></a>\s*(.*?)

This matches. This tells us that /</p>/ wasn't being found after /(.*?)/ matched. We're not going to see much if we print $3 at this point, as we're matching minimally, and without something forcing the quantifier to match more than 0, it'll be happy to match nothing.

The way around this is to remove the minimal matching—how much could it match?

<a href="(.*?)"><b>(.*?)</b></a>\s*(.*)

Printing $3 now show us that /.*/ is matching only Woof, instead of Woof\nWoof. The newline should be the giveaway—we need to add the /s modifier to the original regular expression (be sure to change the /.*/ back to /.*?/!) to ensure that summaries with embedded newlines are correctly located.