How to match Unicode characters in Perl regex?
Question
How to match Unicode characters in Perl regex?
Matching Unicode Characters in Perl Regex
Perl has excellent support for Unicode, enabling you to match characters beyond the ASCII range with ease. Understanding how Perl handles Unicode in regex involves a few key concepts:
- UTF-8 encoding and the
use utf8;pragma — allows Perl source code to include UTF-8 characters directly. - Unicode character classes and properties like
\p{...}for named Unicode categories. - Context of data — whether strings are properly decoded (internal Perl Unicode strings vs. raw byte strings).
1. Preparing Perl Source and Strings for Unicode
In Perl, your source file needs use utf8; pragma if it contains UTF-8 literals:
use utf8;
This tells Perl the script itself is encoded in UTF-8. However, matching Unicode is mostly about the string you operate on being a correctly decoded Unicode string internally. If the string is a raw UTF-8 byte stream, matching might behave unexpectedly.
To handle UTF-8 input/output properly, you often use:
use open ':std', ':encoding(UTF-8)';
This makes STDIN, STDOUT, and STDERR assume UTF-8 encoding transparently.
2. Using Unicode Properties in Regex
The most powerful way to match Unicode characters is via Unicode properties. For example:
\p{Letter}(or shorthand\p{L}) matches any Unicode letter.\p{Number}(\p{N}) matches any kind of numeric digit, including those outside 0-9.\p{Script=Greek}matches characters in the Greek script.
You can negate with \P{...} to match anything NOT in that category.
3. The /u Modifier
The /u regex modifier tells Perl to treat the regex and data as Unicode-aware, enabling Unicode semantics for character classes and case folding:
if ($string =~ /\p{Ll}+/u) {
print "Matched lowercase Unicode letters\n";
}
Often this is not even necessary if your string is decoded, but it enforces Unicode rules.
4. Practical Example: Matching Unicode Letters and Printing Matches
The following example shows matching Unicode letters, including non-ASCII characters like accented letters and Cyrillic characters.
use strict;
use warnings;
use utf8;
use open ':std', ':encoding(UTF-8)';
my $text = "Café привет 123";
print "Text: $text\n";
while ($text =~ /(\p{L}+)/gu) {
print "Found Unicode letters: $1\n";
}
Explanation:
use utf8;allows direct Unicode characters in the string literal.use openensures STDOUT prints UTF-8 correctly.- The regex
/(\p{L}+)/gumatches runs of Unicode letters globally. - Matches printed show each word containing Unicode letters.
5. Common Pitfalls & Gotchas
- Undecoded byte strings: If your string contains raw UTF-8 bytes but is not decoded (
Encode::decode("UTF-8", $bytes)), regex matching may fail or behave unexpectedly. - Missing
use utf8;in source files with literal Unicode characters: May cause compile-time errors or wrong interpretations. - Ignoring
/umodifier: Sometimes certain Unicode behaviors (like case folding) require explicit/u. - Confusing POSIX and Unicode character classes: POSIX classes like
[[:alpha:]]are locale-dependent and unreliable for Unicode. Prefer Unicode properties.
Summary
To reliably match Unicode characters in Perl regex:
- Ensure your strings are decoded Perl Unicode strings (not raw bytes).
- Use
\p{...}Unicode property classes for precise matching. - Include
use utf8;if your source uses UTF-8 characters directly. - Optionally add the
/umodifier for Unicode semantics.
These combined practices will give you robust Unicode regex matching in Perl.
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 6ms
(empty)(empty)Was this helpful?
Related Questions
- How to validate email addresses using Perl regex?
- How to split a string by regex pattern in Perl?
- How to use greedy vs non-greedy quantifiers in Perl regex?
- How to match start and end of string with ^ and $ in Perl?
- How to use alternation | in Perl regex?
- How to use the qr// operator to compile regex in Perl?