How to parse XML and extract node values in Perl?
Question
How to parse XML and extract node values in Perl?
Parsing XML and extracting node values in Perl can be accomplished efficiently using the core or commonly recommended modules such as XML::LibXML, which provides powerful DOM and XPath support. Although XML::Parser exists, XML::LibXML is more modern, flexible, and easier to use for XPath queries.
Here’s a step-by-step explanation and example:
1. Choosing the Right Module
XML::LibXMLis one of the most popular modules for XML parsing. It provides a DOM-like interface and robust XPath querying.- It is not a core module but is widely available and usually bundled with many Perl installations.
- If strictly limited to core modules, parsing XML robustly can be challenging;
XML::Simpleis less powerful and deprecated in many scenarios.
2. Basic Concepts in Perl XML Parsing
- DOM (Document Object Model): the XML is parsed into a tree structure that you can navigate programmatically.
- XPath: a powerful syntax to query parts of the XML tree directly.
- Sigils and context: XPath results can be node objects, arrays of nodes, or string values. You often interact with objects and need to stringify or extract text.
3. Example: Parsing XML and Extracting Node Values
This sample demonstrates:
- Loading an XML string
- Extracting a list of nodes via XPath
- Printing specific node values
use strict;
use warnings;
use XML::LibXML;
my $xml_string = q{
<library>
<book id="101">
<title>Perl Best Practices</title>
<author>Damian Conway</author>
</book>
<book id="102">
<title>Learning Perl</title>
<author>Randal L. Schwartz</author>
</book>
</library>
};
# Create new parser object
my $parser = XML::LibXML->new();
# Parse the XML string into a document object
my $doc = $parser->parse_string($xml_string);
# Use XPath to find all elements
my @books = $doc->findnodes('/library/book');
print "Extracted Books:\n";
for my $book (@books) {
# Extract attribute and child nodes
my $id = $book->getAttribute('id');
my $title = $book->findvalue('title');
my $author = $book->findvalue('author');
print "Book ID: $id\n";
print "Title: $title\n";
print "Author: $author\n";
print "-----\n";
}
4. Explanation
XML::LibXML->new()creates a parser instance.parse_stringloads the XML from a string. You can also parse files withparse_file.findnodesexecutes an XPath query and returns node objects matching the path.getAttributefetches XML element attributes.findvalueruns an XPath relative to the current node and returns the node’s text content (flattened to a string).- Iterating over nodes and extracting values lets you access specific data like IDs, titles, authors.
5. Common Gotchas
- Watch for XML namespaces—they require special handling with
XML::LibXML(registering namespaces before XPath queries). - Using
findnodesreturns nodes;findvaluereturns a string from the first matching node. - Try to avoid regex parsing of XML; always use a proper parser to handle complex XML safely.
- If you have a very large XML, streaming parsers (
XML::LibXML::Reader) offer scalable solutions but with more complexity.
6. Alternatives and Core Modules
If you cannot use XML::LibXML, XML::Simple is bundled with Perl’s core on some systems but it converts XML to nested Perl data structures, which may be less straightforward for XPath-like queries.
Overall, XML::LibXML is the go-to choice in Perl for robust XML parsing and node extraction, especially when XPath queries are needed.
Verified Code
Executed in a sandbox to capture real output. • v5.34.1 • 68ms
Extracted Books:
Book ID: 101
Title: Perl Best Practices
Author: Damian Conway
-----
Book ID: 102
Title: Learning Perl
Author: Randal L. Schwartz
-----
(empty)