Example: Extracting Links from Arbitrary HTML (Perl & LWP)

6.6. Example: Extracting Links from Arbitrary HTML

Suppose that the links we want to check are in a remote HTML file that's not quite as rigidly formatted as my local bookmark file. Suppose, in fact, that a representative section looks like this:

<p>Dear Diary,
<br>I was listening to <a href="http://www.freshair.com">Fresh
Air</a> the other day and they had <a href
="http://www.cs.Helsinki.FI/u/torvalds/">Linus Torvalds</a> on,
and he was going on about how he wrote some kinda
<a href="http://www.linux.org/">program</a> or something.  If
he's so smart, why didn't he write something useful, like <a
href="why_I_love_tetris.html">Tetris</a> or <a href="../minesweeper_hints/"
>Minesweeper</a>, huh?

In the case of the bookmarks, we noted that links were each alone on a line, all absolute, and each capturable with m/ HREF="([^"\s]+)" /. But none of those things are true here! Some links (such as href="why_I_love_tetris.html") are relative, some lines have more than one link in them, and one link even has a newline between its href attribute name and its ="..." attribute value.

Regexps are still usable, though—it's just a matter of applying them to a whole document (instead of to individual lines) and also making the regexp a bit more permissive:

while ( $document =~ m/\s+href\s*=\s*"([^"\s]+)"/gi ) {
  my $url = $1;
  ...
}

(The /g modifier ("g" originally for "globally") on the regexp tries to match the pattern as many times as it can, each time picking up where the last match left off.)

Example 6-5 shows this basic idea fleshed out to include support for fetching a remote document, matching each link in it, making each absolute, and calling a checker routine (currently a placeholder) on it.

Example 6-5. diary-link-checker

#!/usr/bin/perl -w
# diary-link-checker - check links from diary page

use strict;
use LWP;

my $doc_url = "http://chichi.diaries.int/stuff/diary.html";
my $document;
my $browser;
init_browser( );

{  # Get the page whose links we want to check:
  my $response = $browser->get($doc_url);
  die "Couldn't get $doc_url: ", $response->status_line
    unless $response->is_success;
  $document = $response->content;
  $doc_url = $response->base;
  # In case we need to resolve relative URLs later
}

while ($document =~ m/href\s*=\s*"([^"\s]+)"/gi) {
  my $absolute_url = absolutize($1, $doc_url);
  check_url($absolute_url);
}

sub absolutize {
  my($url, $base) = @_;
  use URI;
  return URI->new_abs($url, $base)->canonical;
}

sub init_browser {
  $browser = LWP::UserAgent->new;
  # ...And any other initialization we might need to do...
  return $browser;
}

sub check_url {
  # A temporary placeholder...
  print "I should check $_[0]\n";
}

When run, this prints:

I should check http://www.freshair.com/
I should check http://www.cs.Helsinki.FI/u/torvalds/
I should check http://www.linux.org/
I should check http://chichi.diaries.int/stuff/why_I_love_tetris.html
I should check http://chichi.diaries.int/minesweeper_hints/

So our while (regexp) loop is indeed successfully matching all five links in the document. (Note that our absolutize routine is correctly making the URLs absolute, as with turning why_I_love_tetris.html into http://chichi.diaries.int/stuff/why_I_love_tetris.html and ../minesweeper_hints/ into http://chichi.diaries.int/minesweeper_hints/ by using the URI class that we explained in Chapter 4, "URLs".)

Now that we're satisfied that our program is matching and absolutizing links correctly, we can drop in the check_url routine from the Example 6-4, and it will actually check the URLs that the our placeholder check_url routine promised we'd check.