regex advanced

How to match Unicode characters in Perl regex?

Question

How to match Unicode characters in Perl regex?

Matching Unicode Characters in Perl Regex

Perl has excellent support for Unicode, enabling you to match characters beyond the ASCII range with ease. Understanding how Perl handles Unicode in regex involves a few key concepts:

  • UTF-8 encoding and the use utf8; pragma — allows Perl source code to include UTF-8 characters directly.
  • Unicode character classes and properties like \p{...} for named Unicode categories.
  • Context of data — whether strings are properly decoded (internal Perl Unicode strings vs. raw byte strings).

1. Preparing Perl Source and Strings for Unicode

In Perl, your source file needs use utf8; pragma if it contains UTF-8 literals:

use utf8;

This tells Perl the script itself is encoded in UTF-8. However, matching Unicode is mostly about the string you operate on being a correctly decoded Unicode string internally. If the string is a raw UTF-8 byte stream, matching might behave unexpectedly.

To handle UTF-8 input/output properly, you often use:

use open ':std', ':encoding(UTF-8)';

This makes STDIN, STDOUT, and STDERR assume UTF-8 encoding transparently.

2. Using Unicode Properties in Regex

The most powerful way to match Unicode characters is via Unicode properties. For example:

  • \p{Letter} (or shorthand \p{L}) matches any Unicode letter.
  • \p{Number} (\p{N}) matches any kind of numeric digit, including those outside 0-9.
  • \p{Script=Greek} matches characters in the Greek script.

You can negate with \P{...} to match anything NOT in that category.

3. The /u Modifier

The /u regex modifier tells Perl to treat the regex and data as Unicode-aware, enabling Unicode semantics for character classes and case folding:


if ($string =~ /\p{Ll}+/u) {
    print "Matched lowercase Unicode letters\n";
}

Often this is not even necessary if your string is decoded, but it enforces Unicode rules.

4. Practical Example: Matching Unicode Letters and Printing Matches

The following example shows matching Unicode letters, including non-ASCII characters like accented letters and Cyrillic characters.


use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';

my $text = "Café привет 123";

print "Text: $text\n";

while ($text =~ /(\p{L}+)/gu) {
    print "Found Unicode letters: $1\n";
}

Explanation:

  • use utf8; allows direct Unicode characters in the string literal.
  • use open ensures STDOUT prints UTF-8 correctly.
  • The regex /(\p{L}+)/gu matches runs of Unicode letters globally.
  • Matches printed show each word containing Unicode letters.

5. Common Pitfalls & Gotchas

  • Undecoded byte strings: If your string contains raw UTF-8 bytes but is not decoded (Encode::decode("UTF-8", $bytes)), regex matching may fail or behave unexpectedly.
  • Missing use utf8; in source files with literal Unicode characters: May cause compile-time errors or wrong interpretations.
  • Ignoring /u modifier: Sometimes certain Unicode behaviors (like case folding) require explicit /u.
  • Confusing POSIX and Unicode character classes: POSIX classes like [[:alpha:]] are locale-dependent and unreliable for Unicode. Prefer Unicode properties.

Summary

To reliably match Unicode characters in Perl regex:

  • Ensure your strings are decoded Perl Unicode strings (not raw bytes).
  • Use \p{...} Unicode property classes for precise matching.
  • Include use utf8; if your source uses UTF-8 characters directly.
  • Optionally add the /u modifier for Unicode semantics.

These combined practices will give you robust Unicode regex matching in Perl.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 6ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
(empty)
STDERR
(empty)

Was this helpful?

Related Questions