text-processing intermediate

How to parse Apache log format in Perl?

Question

How to parse Apache log format in Perl?

Parsing Apache log files is a common task in Perl text-processing. Apache’s default combined log format is a structured text line with multiple fields including IP address, timestamp, HTTP method, URL, status code, and more. Perl’s powerful regular expressions and context-aware variables make it an ideal choice for extracting and manipulating this data.

Understanding Apache Log Format

The default Apache combined log format looks like this:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

It contains several fields:

  • 127.0.0.1: Client IP address
  • -: Remote logname (often - if unused)
  • frank: Authenticated user (if any)
  • [10/Oct/2000:13:55:36 -0700]: Date/time (in square brackets)
  • "GET /apache_pb.gif HTTP/1.0": Request line (method, URI, protocol)
  • 200: HTTP status code
  • 2326: Size of response in bytes
  • "http://example.com/start.html": Referrer URL
  • "Mozilla/4.08 [en] (Win98; I ;Nav)": User agent string

Parsing with Perl

You can leverage the =~ regex operator to capture fields using parentheses. The sigils are important because:

  • $scalar variables hold single values (like captured groups from regex)
  • Context matters: in scalar context you get the number of matches, in list context you get the captured groups.

A typical regex to parse combined log lines might look like this:

^(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]*)" (\d{3}) (\S+) "([^"]*)" "([^"]*)"

Where \S+ matches a non-whitespace sequence, \[([^\]]+)\] captures the date/time inside square brackets, and "([^"]*)" captures quoted strings.

Runnable Perl Example

use strict;
use warnings;

# Sample Apache combined log entry
my $log_line = q{127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"};

if ( $log_line =~ m/^(\S+)             # Remote Host
                   \s+(\S+)            # Remote logname
                   \s+(\S+)            # Remote user
                   \s+\[([^\]]+)\]     # Date
                   \s+"([^"]*)"        # Request line
                   \s+(\d{3})          # Status
                   \s+(\S+)            # Bytes
                   \s+"([^"]*)"        # Referer
                   \s+"([^"]*)"        # User agent
                   $/x
) {
    my ($remote_host, $remote_logname, $remote_user, $date_time, $request_line,
        $status, $bytes, $referer, $user_agent) = ($1, $2, $3, $4, $5, $6, $7, $8, $9);

    print "Remote Host: $remote_host\n";
    print "Remote Logname: $remote_logname\n";
    print "Remote User: $remote_user\n";
    print "Date/Time: $date_time\n";
    print "Request Line: $request_line\n";
    print "Status: $status\n";
    print "Bytes: $bytes\n";
    print "Referer: $referer\n";
    print "User Agent: $user_agent\n";
} else {
    print "Failed to parse log line\n";
}

Explanation and Best Practices

  • Regex with Comments: The /x modifier allows whitespace and comments inside the regex for readability.
  • TMTOWTDI ("There's More Than One Way To Do It"): You could parse the log using split, Text::ParseWords, or a specialized CPAN log parsing module. But regexes remain the most straightforward and core-PERL solution.
  • Handling Optional/Empty Fields: Sometimes bytes or referer may be - instead of actual data, so you should handle that accordingly in your code.
  • Performance: For processing large log files, consider using while (<FILE>) loops and precompiled regex.
  • Time Parsing: The date/time captured is a string; to manipulate it as a Perl Time::Piece object (core module since 5.9), you'll need to parse it further.

Common Pitfalls

  • Not anchoring the regex with ^ and $ may lead to partial matches.
  • Failing to escape special regex characters inside patterns (like brackets).
  • Mixing scalar vs list context when matching regexes, which affects captured groups.
  • Assuming fields like bytes or referer are always present or not -.

In summary, parsing Apache logs in Perl involves carefully crafting a regex to match the log structure, extracting the fields into scalars using capture groups, and then processing or printing them as needed. This approach is very flexible and leverages Perl’s text-processing strengths without needing external modules.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 10ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Remote Host: 127.0.0.1
Remote Logname: -
Remote User: frank
Date/Time: 10/Oct/2000:13:55:36 -0700
Request Line: GET /apache_pb.gif HTTP/1.0
Status: 200
Bytes: 2326
Referer: http://example.com/start.html
User Agent: Mozilla/4.08 [en] (Win98; I ;Nav)
STDERR
(empty)

Was this helpful?

Related Questions