How to scrape web page content in Perl?

How to Scrape Web Page Content in Perl

Scraping web pages in Perl typically involves two main steps: fetching the HTML content from a URL, and then parsing that content to extract the information you need. Perl's flexibility and rich module ecosystem make web scraping straightforward, especially using core or widely available modules.

Here’s a breakdown of the process, along with a runnable example using core modules only—so you can run it as-is without installing anything extra.

Step 1: Fetch the Web Page

To retrieve the page content from a website, the simplest core option is LWP::Simple, which provides the get() function to fetch the full HTML.

LWP::Simple is based on the larger LWP::UserAgent module, which offers finer control if you need it (e.g., headers, cookies).
It handles HTTP/HTTPS transparently.
If you need more advanced scraping, consider HTTP::Tiny from Perl 5.14+ or WWW::Mechanize (non-core).

Step 2: Parse the HTML

Parsing HTML can range from simple regex for basic tasks (not recommended due to HTML complexity) to using dedicated parsers like HTML::TreeBuilder or HTML::Parser. However, these are not core modules.

The core module HTML::Parser is available and can be used, but it requires more code. For this example, we’ll keep it simple by extracting something straightforward with a regex—for demonstration only. For robust parsing, consider installing Mojo::DOM or HTML::TreeBuilder from CPAN.

Perl Concepts in the Example

$url: a scalar variable (scalar sigil $) holds a string URL.
get($url): function call returning the HTML content or undef if unavailable.
Regular expressions use =~ to apply matching.
Context plays a role: using =~ /pattern/g in list context extracts all matches.

Complete Runnable Example: Extracting All Links from a Web Page

use strict;
use warnings;
use LWP::Simple;

# URL to scrape
my $url = 'http://example.com';

# Fetch HTML content
my $html = get($url);

if (defined $html) {
    print "Fetched content from $url\n\n";

    # Extract all href links (basic regex, not perfect but illustrative)
    my @links = ($html =~ /<a\s+[^>]*href=['"]([^'"]+)['"]/gi);

    if (@links) {
        print "Found ", scalar(@links), " links:\n";
        foreach my $link (@links) {
            print " - $link\n";
        }
    } else {
        print "No links found.\n";
    }
} else {
    print "Failed to fetch $url\n";
}

How this works:

LWP::Simple::get returns the HTML content as a single string.
The regex /<a\s+[^>]*href=['"]([^'"]+)['"]/gi captures all href attribute values from anchor tags.
The use of the global /g modifier and list context extracts every match into @links.

Common Pitfalls and Tips

Regex is brittle for HTML parsing: Real-world HTML can be messy, use an HTML parser like HTML::TreeBuilder or Mojo::DOM for production-quality scrapers.
Respect robots.txt and site policies: Always check a website’s scraping rules and be respectful by limiting request frequency.
User-Agent strings: Some sites reject requests with missing or default user agents. LWP::UserAgent lets you customize headers, unlike simpler modules.
HTTPS support: LWP::Simple supports it if IO::Socket::SSL is installed (often available).

Summary

In Perl, LWP::Simple makes fetching HTML easy, and you can extract page content with regular expressions for simple needs. For robust web scraping, consider CPAN modules dedicated to HTML parsing. Understanding Perl's flexible context and regex features is key to writing effective scrapers. The example above shows a lightweight, runnable starting point to extract links from a web page.

How to scrape web page content in Perl?

Question