Regular expressions are powerful, but they can't describe everything. In particular, nested structures (for example, lists containing lists, with any amount of nesting possible) and comments are tricky. While you can use regular expressions to extract the components of the HTML and then attempt to keep track of whether you're in a comment or to which nested array you're adding elements, these types of programs rapidly balloon in complexity and become maintenance nightmares.
The best thing to do in these situations is to use a real HTML tokenizer or parser such as HTML::Parser, HTML::TokeParser, and HTML::TreeBuilder (all demonstrated in the next chapter), and forego your regular expressions.