http intermediate

How to parse HTML response with HTML::TreeBuilder in Perl?

Question

How to parse HTML response with HTML::TreeBuilder in Perl?

Parsing HTML responses in Perl is commonly done with the HTML::TreeBuilder module, which is part of the HTML::Tree distribution. This module allows you to build a parse tree from an HTML string, similar to how the DOM works in browsers, enabling easy traversal and extraction of HTML elements.

When you receive an HTML response (for example, from a web request made with LWP::UserAgent), you can feed the HTML content into HTML::TreeBuilder to parse it. The resulting tree can then be searched, traversed, or manipulated using a rich set of methods.

Key concepts

  • HTML::TreeBuilder: builds the parse tree by parsing HTML text.
  • Sigils: use array references (e.g. @nodes) when multiple elements are expected; scalars ($node) for single elements.
  • Context: Many methods behave differently in scalar vs. list context, e.g. find_by_tag_name returns either the first matching element (scalar) or all matches (list).
  • TMTOWTDI (“There’s More Than One Way To Do It”): HTML::TreeBuilder offers multiple ways to navigate the tree, such as look_down, find_by_tag_name, and direct child traversal.

Basic usage example

The example below demonstrates:

  • Retrieving an HTML response using LWP::UserAgent.
  • Parsing the HTML content with HTML::TreeBuilder.
  • Extracting all the links (<a> tags) and printing their URLs and text.
  • Proper tree cleanup using delete to avoid memory leaks.
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;

# Create a user agent and fetch a web page
my $ua = LWP::UserAgent->new;
my $url = 'http://example.com/';
my $response = $ua->get($url);

if ($response->is_success) {
    my $html = $response->decoded_content;

    # Parse the HTML content with HTML::TreeBuilder
    my $tree = HTML::TreeBuilder->new_from_content($html);

    # Find all <a> tags
    my @links = $tree->look_down(_tag => 'a');

    print "Links found on $url:\n";
    foreach my $link (@links) {
        # Extract the href attribute and text inside the link
        my $href = $link->attr('href') // '[no href]';
        my $text = $link->as_text;
        print "  Text: '$text'  URL: $href\n";
    }

    # Always delete the tree to free memory
    $tree->delete;
} else {
    die "Failed to fetch $url: ", $response->status_line, "\n";
}

Explanation:

  • LWP::UserAgent performs the HTTP GET request. Its decoded_content method returns the HTML as a decoded string in Perl’s internal encoding.
  • HTML::TreeBuilder->new_from_content($html) creates a tree from this HTML.
  • look_down is a powerful method for searching nodes by tag name, attributes, or content. Here, (_tag => 'a') finds all anchor elements.
  • We iterate over each anchor and extract the href attribute and the stripped textual content inside the tag (as_text method).
  • $tree->delete is important: it frees all allocated memory inside the parse tree. Forgetting this can lead to memory leaks in long-running programs.

Common pitfalls:

  • Forgetting to call delete on the tree after use.
  • Assuming the href attribute always exists—it may be missing or empty.
  • Not handling character encoding properly (use decoded_content for HTTP response).
  • Using find_by_tag_name in scalar context returns only the first match. Use list context or look_down for complete results.
  • HTML::TreeBuilder expects somewhat valid HTML; very broken HTML may cause incomplete trees.

Version notes:

  • HTML::TreeBuilder has been stable since Perl 5.8.x
  • For Perl 5.10+, you can use the smart match or postderef syntax if desired, but the above example keeps it simple and broadly compatible.

In summary, HTML::TreeBuilder provides a powerful, flexible way to parse and navigate HTML directly from your Perl scripts. Combined with LWP::UserAgent or other HTTP clients for fetching content, it lets you extract information cleanly and reliably.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 1707ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Links found on http://example.com/:
  Text: 'Learn more'  URL: https://iana.org/domains/example
STDERR
(empty)

Was this helpful?

Related Questions