XML
Important
Check out the XML snippets page.
Links
What’s New in Python 2.5 Working with XML through ElementTree
ElementTree Overview Python Library Reference- The ElementTree XML API
http://codespeak.net/lxml/ lxml is the most feature-rich and easy-to-use library for working with XML and HTML in the Python language.
Note
The core components (of ElementTree) are also shipped with Python 2.5 and later…
Sample
Read
Note
The XML snippets page has a python 3 example which iterates over attributes and tags.
Tip
The XML snippets page has an example using
xmltodict
. If the file can fit in memory this is probably an
easier option.
An older python 2 example which iterates over tags:
from xml.etree import ElementTree as ET
tree = ET.parse('pom.xml')
r = tree.getroot()
def trav(node, indent=0):
for c in node.getchildren():
print ' '*indent, c.tag, ':', c.text
trav(c, indent+1)
trav(r)
…using the trav
method above, we can iterate over a string:
>>> xml = '<code><detail><userId>79918</userId><totalPoints>8</totalPoints></detail></code>'
>>> tree = ET.fromstring(xml)
>>> trav(tree)
Another (slightly confusing) sample:
>>> testtext = """
... <html><body>hello world. <i>foo!</i>
... </body></html>"""
>>> testtext
'\n <html><body>hello world. <i>foo!</i>\n </body></html>'
>>> tree = ET.fromstring(testtext)
>>> len(tree)
1
>>> tree[0].text
'hello world. '
>>> tree[0][0].text
'foo!'
>>> for italicNode in tree.findall('.//i'):
... print italicNode.text
...
foo!
>>> ET.tostring(tree)
'<html><body>hello world. <i>foo!</i>\n </body></html>'
>>>
Create
from xml.etree import ElementTree as ET
root = ET.Element('html')
head = ET.SubElement(root, 'head')
title = ET.SubElement(head, 'title')
title.text = 'Page Title'
body = ET.SubElement(root, 'body')
body.set('bgcolor', '#ffffff')
body.text = 'Hello World!'
tree = ET.ElementTree(root)
tree.write('temp.xml')
Encoding
e.g. using the tree
object from the Create sample (above):
tree.write('out.xml', encoding="UTF-8")
Introducing ElementTree 1.3, XML Output
Pretty Print
We can produce a pretty print using this method:
def indent(elem, level=0):
i = "\n" + level*" "
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
indent(elem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
e.g. using the tree
object from the Create sample (above):
indent(tree.getroot())
tree.write('pretty.xml', encoding="ISO-8859-1")
#!/usr/bin/env python
import xml.dom.minidom as md
import sys
pretty_print = lambda f: '\n'.join([line for line in md.parse(open(f)).toprettyxml(indent=' '*2).split('\n') if line.strip()])
if __name__ == "__main__":
if len(sys.argv)>=2:
print pretty_print(sys.argv[1])
else:
sys.exit("Usage: %s [xmlfile]" % sys.argv[0])
find
and findAll
For this example we will parse a standard Maven pom.xml
file.
To find elements using XPath like syntax, we first need to know the namespace:
from xml.etree import ElementTree as ET
tree = ET.parse('sample-app/pom.xml')
root = tree.getroot()
for element in root: print element.tag
...:
{http://maven.apache.org/POM/4.0.0}modelVersion
{http://maven.apache.org/POM/4.0.0}groupId
{http://maven.apache.org/POM/4.0.0}artifactId
...
Don’t forget to include the namespace when searching for elements:
e = tree.find('{http://maven.apache.org/POM/4.0.0}artifactId')
e.text
'sample-app'
To find all elements in the xml file, prefix the query with \/\/
:
e = tree.findall('//{http://maven.apache.org/POM/4.0.0}artifactId')
for i in e:
print i.text
....:
sample-app
junit
To search down through a specific path:
e = tree.find('{http://maven.apache.org/POM/4.0.0}dependencies/{http://maven.apache.org/POM/4.0.0}dependency/{http://maven.apache.org/POM/4.0.0}artifactId')
e.text
'junit'