text-processing intermediate

How to parse fixed-width column data in Perl?

Question

How to parse fixed-width column data in Perl?

Parsing Fixed-Width Column Data in Perl

Fixed-width column data is common in legacy text files, where each field has a predetermined width and no delimiters like commas or tabs separate values. Parsing such data in Perl is straightforward with the versatile unpack function, which allows you to extract substrings based on specified lengths. This approach leverages Perl’s “There’s More Than One Way To Do It” (TMTOWTDI) philosophy, but unpack is considered the most efficient and readable for fixed-width parsing.

How unpack Works

unpack takes a format string describing the structure of the data and a scalar containing the line to parse. Each element in the format string corresponds to one field, specifying its length and data type. For fixed-width columns, the most common directive is A (ASCII string space padded) followed by a number indicating field width, e.g., A10 for a 10-character wide field.

  • A10: Extract a 10-character string, trimmed of trailing spaces.
  • a10: Extract a 10-character string including spaces.
  • I, N, L: Unpack numbers (integers) — less common for fixed-width text.

Here’s a simple example extracting three fixed-width fields: 5 chars + 8 chars + 3 chars from each line.

Common Pitfalls

  • Line length mismatch: If input lines are shorter than the total expected width, unpack may return empty or truncated fields. Always ensure your input meets the width expectation or handle exceptions.
  • Trailing spaces: The A format strips trailing spaces automatically, but a doesn’t.
  • Character encoding: When dealing with Unicode input, be sure your input is decoded properly (e.g., using binmode(STDIN, ":encoding(UTF-8)")).

Runnable Example

use strict;
use warnings;

# Sample fixed-width lines (total width 20 chars):
# Fields: Name (10 chars), Age (3 chars), Country (7 chars)
my @lines = (
    "John Doe  025USA    ",
    "Jane Smith030Canada ",
    "Bob       045UK     ",
);

# Format: A10 (name), A3 (age), A7 (country)
my $format = "A10 A3 A7";

foreach my $line (@lines) {
    # Unpack fields using the specified widths
    my ($name, $age, $country) = unpack($format, $line);
    
    print "Name: '$name', Age: '$age', Country: '$country'\n";
}

This script defines a simple fixed-width data format and uses unpack to extract each field. Notice how the line length is exactly 20 characters (10 + 3 + 7). The output demonstrates the parsed data with trailing spaces removed thanks to A.

Output

When you run the script, it will print:


Name: 'John Doe', Age: '025', Country: 'USA'
Name: 'Jane Smith', Age: '030', Country: 'Canada'
Name: 'Bob', Age: '045', Country: 'UK'

Summary

The unpack function is your go-to tool for parsing fixed-width column data in Perl. Define a format string representing your fields' widths and types, then call unpack once per line to extract the structured data. This method is efficient, concise, and works seamlessly with Perl’s context-sensitive operations.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 9ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Name: 'John Doe', Age: '025', Country: 'USA'
Name: 'Jane Smith', Age: '030', Country: 'Canada'
Name: 'Bob', Age: '045', Country: 'UK'
STDERR
(empty)

Was this helpful?

Related Questions