My core approach in these cases is to pick some set of assumptions and stick with it, but also to assume that they will fail. So I write the code so that when it does fail, the point of failure will be easy to isolate. I do this is with debug levels, also called trace levels. Consider this expanded version of our code:
use strict; use constant DEBUG => 0; use HTML::TokeParser; parse_fresh_stream( HTML::TokeParser->new('fresh1.html') || die($!), 'http://freshair.npr.org/dayFA.cfm?todayDate=07%2F02%2F2001' ); sub parse_fresh_stream { use URI; my($stream, $base_url) = @_; DEBUG and print "About to parse stream with base $base_url\n"; while(my $a_tag = $stream->get_tag('a')) { DEBUG > 1 and printf "Considering {%s}\n", $a_tag->[3]; my $url = URI->new_abs( ($a_tag->[1]{'href'} || next), $base_url); unless($url->scheme eq 'http') { DEBUG > 1 and print "Scheme is no good in $url\n"; next; } unless($url->host =~ m/www\.npr\.org/) { DEBUG > 1 and print "Host is no good in $url\n"; next; } unless($url->path =~ m{/ramfiles/.*\.ram$}) { DEBUG > 1 and print "Path is no good in $url\n"; next; } DEBUG > 1 and print "IT'S GOOD!\n"; my $text = $stream->get_trimmed_text('/a') || "??"; printf "%s\n %s\n", $text, $url; } DEBUG and print "End of stream\n"; return; }
Among the notable changes here, I'm making a URI object for each URL I'm scrutinizing, and to make a new absolute URI object out of each potentially relative URL, I have to pass the base URL as a parameter to the parse_fresh_stream( ) function. Once I do that, I get to isolate parts of URLs the proper way, using URI methods such as host( ) and path( ), instead of by applying regexp matches to the bare URL.
The greatest change is the introduction of all the lines with "DEBUG" in them. Because the DEBUG constant is declared with value 0, all the tests of whether DEBUG is nonzero are obviously always false, and so all these lines are never run; in fact, the Perl compiler removes them from the parse tree of this program, so they're discarded the moment they're parsed. (Incidentally, there's nothing magic about the name "DEBUG"; you can call it "TRACE" or "Talkytalky" or "_mumbles" or whatever you want. However, using all caps is a matter of convention.) So, with a DEBUG value of 0, when you run this program, it simply prints this:
Listen to Current Show http://www.npr.org/ramfiles/fa/20011011.fa.ram Listen to Monday - July 2, 2001 http://www.npr.org/ramfiles/fa/20010702.fa.ram Listen to Editor and writer Walter Kirn http://www.npr.org/ramfiles/fa/20010702.fa.01.ram Listen to Casting director and actress Joanna Merlin http://www.npr.org/ramfiles/fa/20010702.fa.02.ram
(That first link is superfluous, but we'll deal with that in a bit; otherwise, it all works okay.) So these DEBUG lines do nothing. And when we deploy the above program with some code that harvests the pages instead of working from the local test page, the DEBUG lines will continue to do nothing. But suppose that, months later, the program just stops working. That is, it runs, but prints nothing, and we don't know why. Did NPR change the Fresh Air site so much that the old program listings' URLs are no longer serve any content? Or has some part of the format changed? If we just change DEBUG => 0 to DEBUG => 1 and rerun the program, we can see that parse_fresh_stream( ) is definitely being called on a stream from an HTML page, because we see the messages from the print statements in that routine:
About to parse stream with base http://freshair.npr.org/dayFA.cfm?todayDate=07%2F02%2F2001 End of stream
Change the DEBUG level to 2, and we get more detailed output:
About to parse stream with base http://freshair.npr.org/dayFA.cfm?todayDate=07%2F02%2F2001 Considering {<A HREF="index.cfm">} Host is no good in http://freshair.npr.org/index.cfm Considering {<A HREF="http://www.npr.org/ramfiles/fa/20011011.fa.prok">} Path is no good in http://www.npr.org/ramfiles/fa/20011011.fa.prok Considering {<A HREF="dayFA.cfm?todayDate=current">} [...] Considering {<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.prok">} Path is no good in http://www.npr.org/ramfiles/fa/20010702.fa.prok Considering {<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.01.prok">} Path is no good in http://www.npr.org/ramfiles/fa/20010702.fa.01.prok Considering {<A HREF="http://freshair.npr.org/guestInfoFA.cfm?name=walterkirn">} Host is no good in http://freshair.npr.org/guestInfoFA.cfm?name=walterkirn Considering {<A HREF="http://www.npr.org/ramfiles/fa/20010702.fa.02.prok">} Path is no good in http://www.npr.org/ramfiles/fa/20010702.fa.02.prok Considering {<A HREF="http://freshair.npr.org/guestInfoFA.cfm?name=joannamerlin">} Host is no good in http://freshair.npr.org/guestInfoFA.cfm?name=joannamerlin Considering {<A HREF="dayFA.cfm?todayDate=06%2F29%2F2001">} Host is no good in http://freshair.npr.org/dayFA.cfm?todayDate=06%2F29%2F2001 Considering {<A HREF="dayFA.cfm?todayDate=07%2F03%2F2001">} Host is no good in http://freshair.npr.org/dayFA.cfm?todayDate=07%2F03%2F2001 End of stream
Our parse_fresh_stream( ) routine is still correctly rejecting index.cfm and the like, for having a "no good" host (i.e., not www.npr.org). And we can see that it's happening on those "ramfiles" links, and it's not rejecting their host, because they are on www.npr.org. But it rejects their paths. When we look back at the code that triggers rejection based on the path, it kicks in only when the path fails to match m{/ramfiles/.*\.ram$}. Why don't our ramfiles paths match that regexp anymore? Ah ha, because they don't end in .ram anymore; they end in .prok, some new audio format that NPR has switched to! This is evident at the end of the lines beginning "Path is no good." Change our regexp to accept .prok, rerun the program, and go about our business. Similarly, if the audio files moved to a different server, we'd be alerted to their host being "no good" now, and we could adjust the regexp that checks that.
We had to make some fragile assumptions to tell interesting links apart from uninteresting ones, but having all these DEBUG statements means that when the assumptions no longer hold, we can quickly isolate the problem.
Speaking of assumptions, what about the fact that (back to our pre-.prok local test file and setting DEBUG back to 0) we get an extra link at the start of the output here?
Listen to Current Show http://www.npr.org/ramfiles/fa/20011011.fa.ram Listen to Monday - July 2, 2001 http://www.npr.org/ramfiles/fa/20010702.fa.ram Listen to Editor and writer Walter Kirn http://www.npr.org/ramfiles/fa/20010702.fa.01.ram Listen to Casting director and actress Joanna Merlin http://www.npr.org/ramfiles/fa/20010702.fa.02.ram
If we go to our browser and use the "Find in Page" function to see where "Listen to Current Show" appears in the rendered page, we'll probably find no match. So where's it coming from? Try the same search on the source, and you'll see:
<A HREF="http://www.npr.org/ramfiles/fa/20011011.fa.ram"> <IMG SRC="images/listen.gif" ALT="Listen to Current Show" WIDTH="124" HEIGHT="47" BORDER="0" HSPACE="0" VSPACE="0"> </A>
Recall that get_text( ) and get_text_trimmed( ) give special treatment to img and applet elements; they treat them as virtual text tags with contents from their alt values (or in the absence of any alt value, the strings [IMG] or [APPLET]). That might be a useful feature normally, but it's bothersome now. So we turn it off by adding this line just before our while loop starts reading from the stream:
$stream->{'textify'} = {};
We know that's the line to use partly because I mentioned it as an aside much earlier, and partly because it's in the HTML::TokeParser manpage (where you can also read about how to do things with the textify feature other than just turn it off). With that change made, our program prints this:
?? http://www.npr.org/ramfiles/fa/20011011.fa.ram Listen to Monday - July 2, 2001 http://www.npr.org/ramfiles/fa/20010702.fa.ram Listen to Editor and writer Walter Kirn http://www.npr.org/ramfiles/fa/20010702.fa.01.ram Listen to Casting director and actress Joanna Merlin http://www.npr.org/ramfiles/fa/20010702.fa.02.ram
That ?? is there because when the first link had no link text (and we're no longer counting alt text), it caused get_trimmed_text( ) to return an empty string. That is a false value in Perl, so it causes the fallthrough to ?? here:
my $text = $stream->get_trimmed_text('/a') || "??";
If we want to explicitly skip things with no link text, we change that to:
my $text = $stream->get_trimmed_text('/a'); unless(length $text) { DEBUG > 1 and print "Skipping link with no link-text\n"; next; }
That makes the program give this output, as we wanted it:
Listen to Monday - July 2, 2001 http://www.npr.org/ramfiles/fa/20010702.fa.ram Listen to Editor and writer Walter Kirn http://www.npr.org/ramfiles/fa/20010702.fa.01.ram Listen to Casting director and actress Joanna Merlin http://www.npr.org/ramfiles/fa/20010702.fa.02.ram
Now that everything else is working, remember that we didn't want all this "Listen to" stuff starting every single link. Moreover, remember that the presence of a "Listen to" at the start of the link text was one of our prospective criteria for whether it's an interesting link. We didn't implement that, but we can implement it now:
unless($text =~ s/^Listen to //) { DEBUG > 1 and print "Odd, \"$text\" doesn't start with \"Listen to\"...\n"; next; } Monday - July 2, 2001 http://www.npr.org/ramfiles/fa/20010702.fa.ram Editor and writer Walter Kirn http://www.npr.org/ramfiles/fa/20010702.fa.01.ram Casting director and actress Joanna Merlin http://www.npr.org/ramfiles/fa/20010702.fa.02.ram
In other words, unless the link next starts with a "Listen to" that we can strip off, this link is rejected. And incidentally, you might notice that with all these little changes we've made, our program now works perfectly!
All it needs to actually pull data from the Fresh Air web site, is to comment out the code that calls the local test file and substitute some simple code to get the data for a block of days. Here's is the whole program source, with those changes and additions:
use strict; use constant DEBUG => 0; use HTML::TokeParser; #parse_fresh_stream( # HTML::TokeParser->new('fresh1.html') || die($!), # 'http://freshair.npr.org/dayFA.cfm?todayDate=07%2F02%2F2001' #); scan_last_month( ); sub scan_last_month { use LWP::UserAgent; my $browser = LWP::UserAgent->new( ); foreach my $date_mdy (weekdays_last_month( )) { my $url = sprintf( 'http://freshair.npr.org/dayFA.cfm?todayDate=%02d%%2f%02d%%2f%04d', @$date_mdy ); DEBUG and print "Getting @$date_mdy URL $url\n"; sleep 3; # Don't hammer the NPR server! my $response = $browser->get($url); unless($response->is_success) { print "Error getting $url: ", $response->status_line, "\n"; next; } my $stream = HTML::TokeParser->new($response->content_ref) || die "What, couldn't make a stream?!"; parse_fresh_stream($stream, $response->base); } } sub weekdays_last_month { # Boring date handling. Feel free to skip. my($now) = time; my $this_month = (gmtime $now)[4]; my(@out, $last_month, $that_month); do { # Get to end of last month. $now -= (24 * 60 * 60); # go back a day $that_month = (gmtime $now)[4]; } while($that_month == $this_month); $last_month = $that_month; do { # Go backwards thru last month my(@then) = (gmtime $now); unshift @out, [$then[4] + 1 , $then[3], $then[5] + 1900] # m,d,yyyy unless $then[6] == 0 or $then[6] == 6; $now -= (24 * 60 * 60); # go back one day $that_month = (gmtime $now)[4]; } while($that_month == $last_month); return @out; } # Unchanged since you last saw it: sub parse_fresh_stream { use URI; my($stream, $base_url) = @_; DEBUG and print "About to parse stream with base $base_url\n"; while(my $a_tag = $stream->get_tag('a')) { DEBUG > 1 and printf "Considering {%s}\n", $a_tag->[3]; my $url = URI->new_abs( ($a_tag->[1]{'href'} || next), $base_url); unless($url->scheme eq 'http') { DEBUG > 1 and print "Scheme is no good in $url\n"; next; } unless($url->host =~ m/www\.npr\.org/) { DEBUG > 1 and print "Host is no good in $url\n"; next; } unless($url->path =~ m{/ramfiles/.*\.ram$}) { DEBUG > 1 and print "Path is no good in $url\n"; next; } DEBUG > 1 and print "IT'S GOOD!\n"; my $text = $stream->get_trimmed_text('/a') || "??"; printf "%s\n %s\n", $text, $url; } DEBUG and print "End of stream\n"; return; }