How to parse Apache log format in Perl?

Parsing Apache log files is a common task in Perl text-processing. Apache’s default combined log format is a structured text line with multiple fields including IP address, timestamp, HTTP method, URL, status code, and more. Perl’s powerful regular expressions and context-aware variables make it an ideal choice for extracting and manipulating this data.

Understanding Apache Log Format

The default Apache combined log format looks like this:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

It contains several fields:

127.0.0.1: Client IP address
-: Remote logname (often - if unused)
frank: Authenticated user (if any)
[10/Oct/2000:13:55:36 -0700]: Date/time (in square brackets)
"GET /apache_pb.gif HTTP/1.0": Request line (method, URI, protocol)
200: HTTP status code
2326: Size of response in bytes
"http://example.com/start.html": Referrer URL
"Mozilla/4.08 [en] (Win98; I ;Nav)": User agent string

Parsing with Perl

You can leverage the =~ regex operator to capture fields using parentheses. The sigils are important because:

$scalar variables hold single values (like captured groups from regex)
Context matters: in scalar context you get the number of matches, in list context you get the captured groups.

A typical regex to parse combined log lines might look like this:

^(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]*)" (\d{3}) (\S+) "([^"]*)" "([^"]*)"

Where \S+ matches a non-whitespace sequence, \[([^\]]+)\] captures the date/time inside square brackets, and "([^"]*)" captures quoted strings.

Runnable Perl Example

use strict;
use warnings;

# Sample Apache combined log entry
my $log_line = q{127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"};

if ( $log_line =~ m/^(\S+)             # Remote Host
                   \s+(\S+)            # Remote logname
                   \s+(\S+)            # Remote user
                   \s+\[([^\]]+)\]     # Date
                   \s+"([^"]*)"        # Request line
                   \s+(\d{3})          # Status
                   \s+(\S+)            # Bytes
                   \s+"([^"]*)"        # Referer
                   \s+"([^"]*)"        # User agent
                   $/x
) {
    my ($remote_host, $remote_logname, $remote_user, $date_time, $request_line,
        $status, $bytes, $referer, $user_agent) = ($1, $2, $3, $4, $5, $6, $7, $8, $9);

    print "Remote Host: $remote_host\n";
    print "Remote Logname: $remote_logname\n";
    print "Remote User: $remote_user\n";
    print "Date/Time: $date_time\n";
    print "Request Line: $request_line\n";
    print "Status: $status\n";
    print "Bytes: $bytes\n";
    print "Referer: $referer\n";
    print "User Agent: $user_agent\n";
} else {
    print "Failed to parse log line\n";
}

Explanation and Best Practices

Regex with Comments: The /x modifier allows whitespace and comments inside the regex for readability.
TMTOWTDI ("There's More Than One Way To Do It"): You could parse the log using split, Text::ParseWords, or a specialized CPAN log parsing module. But regexes remain the most straightforward and core-PERL solution.
Handling Optional/Empty Fields: Sometimes bytes or referer may be - instead of actual data, so you should handle that accordingly in your code.
Performance: For processing large log files, consider using while (<FILE>) loops and precompiled regex.
Time Parsing: The date/time captured is a string; to manipulate it as a Perl Time::Piece object (core module since 5.9), you'll need to parse it further.

Common Pitfalls

Not anchoring the regex with ^ and $ may lead to partial matches.
Failing to escape special regex characters inside patterns (like brackets).
Mixing scalar vs list context when matching regexes, which affects captured groups.
Assuming fields like bytes or referer are always present or not -.

In summary, parsing Apache logs in Perl involves carefully crafting a regex to match the log structure, extracting the fields into scalars using capture groups, and then processing or printing them as needed. This approach is very flexible and leverages Perl’s text-processing strengths without needing external modules.

How to parse Apache log format in Perl?

Question

Understanding Apache Log Format

Parsing with Perl

Runnable Perl Example

Explanation and Best Practices

Common Pitfalls

Verified Code

Was this helpful?

Related Questions