How to remove HTML tags from text in Perl?
Question
How to remove HTML tags from text in Perl?
Removing HTML tags from text in Perl is a common task when extracting readable content from HTML markup. The simplest approach uses a regular expression to match and remove tags. While regex isn't perfect for complex or malformed HTML, it works well for straightforward cases.
Using Regex to Remove HTML Tags
HTML tags are enclosed in angle brackets like <tag> or </tag>. A regex pattern can match these tags and remove them using Perl's substitution operator s///g.
Runnable Perl Example
#!/usr/bin/perl
use strict;
use warnings;
my $html = q{Welcome to PerlCode
This example shows how to remove HTML tags.
Clean text extraction!};
# Create a copy and remove all HTML tags
(my $text = $html) =~ s/<[^>]+>//g;
print "Original HTML:\n$html\n\n";
print "Text without tags:\n$text\n";
How It Works
- The regex
<[^>]+>matches a<, followed by one or more non->characters, then a> - The
s///goperator replaces all matches globally with nothing (empty string) - The parentheses
(my $text = $html) =~ s///gcopy$htmlto$textbefore modification, preserving the original - The
q{...}operator creates a string without needing to escape quotes
Common Pitfalls
- This regex cannot properly handle nested tags or malformed HTML
- Attributes containing
>inside quotes may cause incorrect matching - Content inside
<script>or<style>tags is removed but not specially handled - HTML entities like
are not decoded to their character equivalents
When to Use a Parser Instead
For production code or complex HTML, use proper parsing modules from CPAN like HTML::Strip, HTML::TreeBuilder, or the core HTML::Parser module. These handle edge cases, nested structures, and malformed markup correctly. However, for quick scripts with simple, well-formed HTML, this regex approach provides a lightweight, dependency-free solution.
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 14ms
Original HTML:
<h1>Welcome to PerlCode</h1>
<p>This <strong>example</strong> shows how to <em>remove</em> HTML tags.</p>
<div class="content">Clean text extraction!</div>
Text without tags:
Welcome to PerlCode
This example shows how to remove HTML tags.
Clean text extraction!
(empty)Was this helpful?
Related Questions
- How to perform case-insensitive search and replace in Perl?
- How to sort a file numerically by column in Perl?
- How to parse Apache log format in Perl?
- How to generate CSV from array of hashes in Perl?
- How to capitalize first letter of each word in Perl?
- How to extract numbers from mixed text in Perl?