How to process and deduplicate large text files

Question

How can I remove duplicate lines from a large text file efficiently in Perl?

Use a hash to track seen lines for memory-efficient deduplication:

#!/usr/bin/perl
use strict;
use warnings;

my $input_file = 'large-file.txt';
my $output_file = 'deduplicated.txt';

open my $in, '<', $input_file or die "Cannot open input: $!";
open my $out, '>', $output_file or die "Cannot open output: $!";

my %seen;
my $duplicates = 0;
my $unique = 0;

while (my $line = <$in>) {
    unless ($seen{$line}++) {
        print $out $line;
        $unique++;
    } else {
        $duplicates++;
    }
}

close $in;
close $out;

print "Deduplication complete:\n";
print "Unique lines: $unique\n";
print "Duplicates removed: $duplicates\n";

This script reads a file line by line, uses a hash to track seen lines, and writes only unique lines to the output file.

Was this helpful?

Related Questions