serialization intermediate

How to handle Unicode characters in JSON with Perl?

Question

How to handle Unicode characters in JSON with Perl?

Handling Unicode Characters in JSON with Perl

Working with Unicode characters in JSON is a common requirement, especially for internationalized applications. Perl, with its flexible and powerful string handling, makes it straightforward to serialize and deserialize JSON containing Unicode. However, there are some important details around encoding, decoding, and character representation to ensure your JSON data is processed correctly.

Key Concepts

  • UTF-8 Encoding: JSON text is recommended to be encoded as UTF-8.
  • Perl's Internal UTF-8 Flag: Perl strings internally have a flag to indicate if they contain Unicode characters.
  • JSON Module: The JSON CPAN module (or core from Perl 5.14+) provides a flexible interface to encode/decode JSON.
  • Encode/Decode: When dealing with files or network I/O, explicit encoding/decoding between byte streams and characters is necessary.
  • Unicode Escape Sequences: JSON can represent Unicode via literal UTF-8 characters or \uXXXX escapes. Both are valid.

How to Handle Unicode in JSON

  • Use the JSON module's utf8 option to indicate input/output strings are UTF-8 encoded.
  • When printing JSON strings with Unicode, ensure your output handle (STDOUT) is set to UTF-8 to avoid mojibake.
  • Perl's native string handling usually makes decoding and encoding automatic if you manage the I/O layers properly.

Example: Serialize and Deserialize JSON with Unicode

This example demonstrates creating a Perl data structure with Unicode characters, encoding it to JSON, printing it to STDOUT, then decoding it back to Perl data, and printing values out:

use strict;
use warnings;
use utf8;             # Allows Unicode in the script source
use open ':std', ':encoding(UTF-8)';  # Makes STDIN/STDOUT/STDERR UTF-8 encoded
use JSON;

# Example data with Unicode characters
my $data = {
    greeting => "こんにちは",   # "Hello" in Japanese
    emoji    => "😊",          # Emoji character
    name     => "München",     # Contains umlaut
};

# Create JSON object with UTF-8 output enabled
my $json = JSON->new->utf8->pretty;

# Encode Perl data structure to JSON string (UTF-8 bytes)
my $json_text = $json->encode($data);

print "Encoded JSON:\n";
print $json_text, "\n";

# Decode JSON text back to Perl data structure
my $decoded = $json->decode($json_text);

print "Decoded data:\n";
print "Greeting: $decoded->{greeting}\n";
print "Emoji: $decoded->{emoji}\n";
print "Name: $decoded->{name}\n";

Explanation

  • use utf8; tells Perl that your script source contains UTF-8 encoded characters.
  • use open ':std', ':encoding(UTF-8)'; automates UTF-8 encoding/decoding on standard filehandles, so print outputs correctly.
  • JSON->new->utf8 ensures the JSON encoding/decoding functions work with UTF-8 byte streams (not Perl character strings).
  • The pretty method is optional but good for readable JSON output.
  • After encoding, the JSON string contains literal UTF-8 characters, not \u escapes. This is generally preferred for readability and compactness.
  • Decoding restores the Perl internal string with the UTF-8 flag set, so you can manipulate Unicode transparently.

Common Gotchas

  • Without utf8 option, JSON may escape Unicode characters as \uXXXX sequences, or worse, produce invalid UTF-8 bytes.
  • Printing raw UTF-8 bytes to a non-UTF8 STDOUT can cause garbled output (mojibake). Always set output layers.
  • Older Perl versions (pre-5.14) might require explicit Encode handling to convert byte streams.
  • Beware of mixing character strings with byte strings; use utf8::decode or utf8::encode carefully if needed.

Summary

Handling Unicode in JSON with Perl is mostly seamless if you:

  • Include use utf8; for your source.
  • Configure your output for UTF-8.
  • Use JSON->new->utf8 to encode/decode UTF-8 JSON.
  • Remember the Perl internal UTF-8 flag and JSON module's options control the behavior.

This approach leverages Perl’s famous TMTOWTDI philosophy by letting you choose to either output escaped sequences or literal UTF-8 characters, depending on your use case.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 34ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Encoded JSON:
{
   "emoji" : "😊",
   "name" : "München",
   "greeting" : "こんにちは"
}

Decoded data:
Greeting: こんにちは
Emoji: 😊
Name: München
STDERR
(empty)

Was this helpful?

Related Questions