text-processing beginner
How to process and deduplicate large text files
Question
How can I remove duplicate lines from a large text file efficiently in Perl?
Use a hash to track seen lines for memory-efficient deduplication:
#!/usr/bin/perl
use strict;
use warnings;
my $input_file = 'large-file.txt';
my $output_file = 'deduplicated.txt';
open my $in, '<', $input_file or die "Cannot open input: $!";
open my $out, '>', $output_file or die "Cannot open output: $!";
my %seen;
my $duplicates = 0;
my $unique = 0;
while (my $line = <$in>) {
unless ($seen{$line}++) {
print $out $line;
$unique++;
} else {
$duplicates++;
}
}
close $in;
close $out;
print "Deduplication complete:\n";
print "Unique lines: $unique\n";
print "Duplicates removed: $duplicates\n";
This script reads a file line by line, uses a hash to track seen lines, and writes only unique lines to the output file.