How to parse HTML response with HTML::TreeBuilder in Perl?
Question
How to parse HTML response with HTML::TreeBuilder in Perl?
Parsing HTML responses in Perl is commonly done with the HTML::TreeBuilder module, which is part of the HTML::Tree distribution. This module allows you to build a parse tree from an HTML string, similar to how the DOM works in browsers, enabling easy traversal and extraction of HTML elements.
When you receive an HTML response (for example, from a web request made with LWP::UserAgent), you can feed the HTML content into HTML::TreeBuilder to parse it. The resulting tree can then be searched, traversed, or manipulated using a rich set of methods.
Key concepts
- HTML::TreeBuilder: builds the parse tree by parsing HTML text.
- Sigils: use array references (e.g.
@nodes) when multiple elements are expected; scalars ($node) for single elements. - Context: Many methods behave differently in scalar vs. list context, e.g.
find_by_tag_namereturns either the first matching element (scalar) or all matches (list). - TMTOWTDI (“There’s More Than One Way To Do It”): HTML::TreeBuilder offers multiple ways to navigate the tree, such as
look_down,find_by_tag_name, and direct child traversal.
Basic usage example
The example below demonstrates:
- Retrieving an HTML response using
LWP::UserAgent. - Parsing the HTML content with
HTML::TreeBuilder. - Extracting all the links (
<a>tags) and printing their URLs and text. - Proper tree cleanup using
deleteto avoid memory leaks.
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
# Create a user agent and fetch a web page
my $ua = LWP::UserAgent->new;
my $url = 'http://example.com/';
my $response = $ua->get($url);
if ($response->is_success) {
my $html = $response->decoded_content;
# Parse the HTML content with HTML::TreeBuilder
my $tree = HTML::TreeBuilder->new_from_content($html);
# Find all <a> tags
my @links = $tree->look_down(_tag => 'a');
print "Links found on $url:\n";
foreach my $link (@links) {
# Extract the href attribute and text inside the link
my $href = $link->attr('href') // '[no href]';
my $text = $link->as_text;
print " Text: '$text' URL: $href\n";
}
# Always delete the tree to free memory
$tree->delete;
} else {
die "Failed to fetch $url: ", $response->status_line, "\n";
}
Explanation:
LWP::UserAgentperforms the HTTP GET request. Itsdecoded_contentmethod returns the HTML as a decoded string in Perl’s internal encoding.HTML::TreeBuilder->new_from_content($html)creates a tree from this HTML.look_downis a powerful method for searching nodes by tag name, attributes, or content. Here,(_tag => 'a')finds all anchor elements.- We iterate over each anchor and extract the
hrefattribute and the stripped textual content inside the tag (as_textmethod). $tree->deleteis important: it frees all allocated memory inside the parse tree. Forgetting this can lead to memory leaks in long-running programs.
Common pitfalls:
- Forgetting to call
deleteon the tree after use. - Assuming the
hrefattribute always exists—it may be missing or empty. - Not handling character encoding properly (use
decoded_contentfor HTTP response). - Using
find_by_tag_namein scalar context returns only the first match. Use list context orlook_downfor complete results. - HTML::TreeBuilder expects somewhat valid HTML; very broken HTML may cause incomplete trees.
Version notes:
HTML::TreeBuilderhas been stable since Perl 5.8.x- For Perl 5.10+, you can use the smart match or postderef syntax if desired, but the above example keeps it simple and broadly compatible.
In summary, HTML::TreeBuilder provides a powerful, flexible way to parse and navigate HTML directly from your Perl scripts. Combined with LWP::UserAgent or other HTTP clients for fetching content, it lets you extract information cleanly and reliably.
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 1707ms
Links found on http://example.com/:
Text: 'Learn more' URL: https://iana.org/domains/example
(empty)Was this helpful?
Related Questions
- How to handle gzip compressed responses with LWP in Perl?
- How to use LWP::Simple for quick GET requests in Perl?
- How to use Mojo::UserAgent for HTTP requests in Perl?
- How to set connection keep-alive with LWP::UserAgent in Perl?
- How to use LWP::Protocol::https for HTTPS requests in Perl?
- How to handle HTTP errors and exceptions with LWP in Perl?