text-processing intermediate

How to parse XML and extract node values in Perl?

Question

How to parse XML and extract node values in Perl?

Parsing XML and extracting node values in Perl can be accomplished efficiently using the core or commonly recommended modules such as XML::LibXML, which provides powerful DOM and XPath support. Although XML::Parser exists, XML::LibXML is more modern, flexible, and easier to use for XPath queries.

Here’s a step-by-step explanation and example:

1. Choosing the Right Module

  • XML::LibXML is one of the most popular modules for XML parsing. It provides a DOM-like interface and robust XPath querying.
  • It is not a core module but is widely available and usually bundled with many Perl installations.
  • If strictly limited to core modules, parsing XML robustly can be challenging; XML::Simple is less powerful and deprecated in many scenarios.

2. Basic Concepts in Perl XML Parsing

  • DOM (Document Object Model): the XML is parsed into a tree structure that you can navigate programmatically.
  • XPath: a powerful syntax to query parts of the XML tree directly.
  • Sigils and context: XPath results can be node objects, arrays of nodes, or string values. You often interact with objects and need to stringify or extract text.

3. Example: Parsing XML and Extracting Node Values

This sample demonstrates:

  • Loading an XML string
  • Extracting a list of nodes via XPath
  • Printing specific node values

use strict;
use warnings;
use XML::LibXML;

my $xml_string = q{
  <library>
    <book id="101">
      <title>Perl Best Practices</title>
      <author>Damian Conway</author>
    </book>
    <book id="102">
      <title>Learning Perl</title>
      <author>Randal L. Schwartz</author>
    </book>
  </library>
};

# Create new parser object
my $parser = XML::LibXML->new();

# Parse the XML string into a document object
my $doc = $parser->parse_string($xml_string);

# Use XPath to find all  elements
my @books = $doc->findnodes('/library/book');

print "Extracted Books:\n";
for my $book (@books) {
    # Extract attribute and child nodes
    my $id     = $book->getAttribute('id');
    my $title  = $book->findvalue('title');
    my $author = $book->findvalue('author');
    
    print "Book ID: $id\n";
    print "Title: $title\n";
    print "Author: $author\n";
    print "-----\n";
}

4. Explanation

  • XML::LibXML->new() creates a parser instance.
  • parse_string loads the XML from a string. You can also parse files with parse_file.
  • findnodes executes an XPath query and returns node objects matching the path.
  • getAttribute fetches XML element attributes.
  • findvalue runs an XPath relative to the current node and returns the node’s text content (flattened to a string).
  • Iterating over nodes and extracting values lets you access specific data like IDs, titles, authors.

5. Common Gotchas

  • Watch for XML namespaces—they require special handling with XML::LibXML (registering namespaces before XPath queries).
  • Using findnodes returns nodes; findvalue returns a string from the first matching node.
  • Try to avoid regex parsing of XML; always use a proper parser to handle complex XML safely.
  • If you have a very large XML, streaming parsers (XML::LibXML::Reader) offer scalable solutions but with more complexity.

6. Alternatives and Core Modules

If you cannot use XML::LibXML, XML::Simple is bundled with Perl’s core on some systems but it converts XML to nested Perl data structures, which may be less straightforward for XPath-like queries.

Overall, XML::LibXML is the go-to choice in Perl for robust XML parsing and node extraction, especially when XPath queries are needed.

Verified Code

Executed in a sandbox to capture real output. • v5.34.1 • 68ms

Tip: edit code and use “Run (Browser)”. Server runs always execute the published, verified snippet.
STDOUT
Extracted Books:
Book ID: 101
Title: Perl Best Practices
Author: Damian Conway
-----
Book ID: 102
Title: Learning Perl
Author: Randal L. Schwartz
-----
STDERR
(empty)

Was this helpful?

Related Questions