How to parse fixed-width column data in Perl?
Question
How to parse fixed-width column data in Perl?
Parsing Fixed-Width Column Data in Perl
Fixed-width column data is common in legacy text files, where each field has a predetermined width and no delimiters like commas or tabs separate values. Parsing such data in Perl is straightforward with the versatile unpack function, which allows you to extract substrings based on specified lengths. This approach leverages Perl’s “There’s More Than One Way To Do It” (TMTOWTDI) philosophy, but unpack is considered the most efficient and readable for fixed-width parsing.
How unpack Works
unpack takes a format string describing the structure of the data and a scalar containing the line to parse. Each element in the format string corresponds to one field, specifying its length and data type. For fixed-width columns, the most common directive is A (ASCII string space padded) followed by a number indicating field width, e.g., A10 for a 10-character wide field.
A10: Extract a 10-character string, trimmed of trailing spaces.a10: Extract a 10-character string including spaces.I,N,L: Unpack numbers (integers) — less common for fixed-width text.
Here’s a simple example extracting three fixed-width fields: 5 chars + 8 chars + 3 chars from each line.
Common Pitfalls
- Line length mismatch: If input lines are shorter than the total expected width,
unpackmay return empty or truncated fields. Always ensure your input meets the width expectation or handle exceptions. - Trailing spaces: The
Aformat strips trailing spaces automatically, butadoesn’t. - Character encoding: When dealing with Unicode input, be sure your input is decoded properly (e.g., using
binmode(STDIN, ":encoding(UTF-8)")).
Runnable Example
use strict;
use warnings;
# Sample fixed-width lines (total width 20 chars):
# Fields: Name (10 chars), Age (3 chars), Country (7 chars)
my @lines = (
"John Doe 025USA ",
"Jane Smith030Canada ",
"Bob 045UK ",
);
# Format: A10 (name), A3 (age), A7 (country)
my $format = "A10 A3 A7";
foreach my $line (@lines) {
# Unpack fields using the specified widths
my ($name, $age, $country) = unpack($format, $line);
print "Name: '$name', Age: '$age', Country: '$country'\n";
}
This script defines a simple fixed-width data format and uses unpack to extract each field. Notice how the line length is exactly 20 characters (10 + 3 + 7). The output demonstrates the parsed data with trailing spaces removed thanks to A.
Output
When you run the script, it will print:
Name: 'John Doe', Age: '025', Country: 'USA'
Name: 'Jane Smith', Age: '030', Country: 'Canada'
Name: 'Bob', Age: '045', Country: 'UK'
Summary
The unpack function is your go-to tool for parsing fixed-width column data in Perl. Define a format string representing your fields' widths and types, then call unpack once per line to extract the structured data. This method is efficient, concise, and works seamlessly with Perl’s context-sensitive operations.
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 9ms
Name: 'John Doe', Age: '025', Country: 'USA'
Name: 'Jane Smith', Age: '030', Country: 'Canada'
Name: 'Bob', Age: '045', Country: 'UK'
(empty)