How to extract URLs from HTML content in Perl?
Question
How to extract URLs from HTML content in Perl?
Extracting URLs from HTML Content in Perl
Extracting URLs from HTML content is a common task in web scraping or text processing. In Perl, you can approach this in multiple ways, but the key consideration is whether to use a proper HTML parser or rely on regular expressions.
While it's tempting to use regular expressions because they are quick and simple, parsing HTML with regex alone is error-prone due to HTML’s nested and varied structure. For reliable extraction—especially with complex or malformed HTML—using a dedicated parser like HTML::Parser or HTML::TreeBuilder (both core or widely available modules) is preferable. However, because the question is focused on a standalone Perl script without external dependencies, this example uses a regex approach suitable for simple or well-formed HTML.
Key Perl Concepts in this Task
- Sigils:
$for scalars (strings),@for arrays (lists of matches). - Context: Regular expression matching in list context returns all matches.
- TMTOWTDI: In Perl style, you can extract using regex, parsers, or combinations—"There's more than one way to do it."
Regex-Based URL Extraction Example
This example extracts URLs from href attributes in anchor tags and prints them. It assumes HTML is UTF-8 and URL attributes are enclosed in double or single quotes.
#!/usr/bin/perl
use strict;
use warnings;
my $html = q{
<html>
<body>
<a href="https://example.com">Example</a>
<a href='http://www.test.com/page'>Test Page</a>
<a href = "https://perl.org" someattr>Perl</a>
<!-- A comment with a fake url: href="fake" -->
</body>
</html>
};
# Regex to match href="URL" or href='URL'
my @urls = $html =~ /href\s*=\s*["'](.*?)["']/g;
print "Extracted URLs:\n";
print "$_\n" for @urls;
Explanation
- The regex
/href\s*=\s*["'](.*?)["']/gmatcheshref=followed by optional spaces, then either double or single quotes enclosing the URL. - The
(.*?)non-greedy capture group extracts the URL without including the quotes. - The
gmodifier returns all matching URLs in list context, which are stored in@urls. - The loop prints each extracted URL on a new line.
Common Pitfalls
- Malformed HTML: Regex may fail if tags or attributes are broken or incomplete.
- Multiple attributes: The regex assumes URL appears directly after
href=. Whitespace or other attributes might interfere if formatted differently. - JavaScript/Tabs/Encoding: URLs generated via JavaScript or encoded HTML entities won't be captured.
- Quoted attributes: This regex handles single and double quotes but not unquoted attributes (which are invalid HTML5).
Using a Module for More Robust Extraction (optional)
If you are allowed to use external modules, HTML::Parser or HTML::LinkExtor from Perl core can parse correctly. But this example stays within core Perl and simple regex for ease of use.
In summary, extracting URLs from HTML in Perl can be done easily with regex for simple cases but favor proper HTML parsers for robustness in real-world scenarios.
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 9ms
Extracted URLs:
https://example.com
http://www.test.com/page
https://perl.org
fake
(empty)