How to remove HTML tags from text in Perl?

Question

How to remove HTML tags from text in Perl?

Removing HTML tags from text in Perl is a common task when extracting readable content from HTML markup. The simplest approach uses a regular expression to match and remove tags. While regex isn't perfect for complex or malformed HTML, it works well for straightforward cases.

Using Regex to Remove HTML Tags

HTML tags are enclosed in angle brackets like <tag> or </tag>. A regex pattern can match these tags and remove them using Perl's substitution operator s///g.

Runnable Perl Example

#!/usr/bin/perl
use strict;
use warnings;

my $html = q{

Welcome to PerlCode

This example shows how to remove HTML tags.

Clean text extraction!
}; # Create a copy and remove all HTML tags (my $text = $html) =~ s/<[^>]+>//g; print "Original HTML:\n$html\n\n"; print "Text without tags:\n$text\n";

How It Works

  • The regex <[^>]+> matches a <, followed by one or more non-> characters, then a >
  • The s///g operator replaces all matches globally with nothing (empty string)
  • The parentheses (my $text = $html) =~ s///g copy $html to $text before modification, preserving the original
  • The q{...} operator creates a string without needing to escape quotes

Common Pitfalls

  • This regex cannot properly handle nested tags or malformed HTML
  • Attributes containing > inside quotes may cause incorrect matching
  • Content inside <script> or <style> tags is removed but not specially handled
  • HTML entities like &nbsp; are not decoded to their character equivalents

When to Use a Parser Instead

For production code or complex HTML, use proper parsing modules from CPAN like HTML::Strip, HTML::TreeBuilder, or the core HTML::Parser module. These handle edge cases, nested structures, and malformed markup correctly. However, for quick scripts with simple, well-formed HTML, this regex approach provides a lightweight, dependency-free solution.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 14ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Original HTML:
<h1>Welcome to PerlCode</h1>
<p>This <strong>example</strong> shows how to <em>remove</em> HTML tags.</p>
<div class="content">Clean text extraction!</div>

Text without tags:
Welcome to PerlCode
This example shows how to remove HTML tags.
Clean text extraction!
STDERR
(empty)

Was this helpful?

Related Questions