philwilson.org

Parsing Atom with libxml2

26 November, 2007

Whilst trying to parse some Atom (my Blogger backup) with libxml2 I appear to have run into the same problem that Aristotle hit two years ago in XPath vs the default namespace: easy things should be easy, to wit: The story is that you can’t match on the default namespace in XPath.

>> import libxml2
>>> doc = libxml2.parseFile("/home/pip/allposts.xml")
>>> results = doc.xpathEval("//feed")
>>> len(results)
0

Unbelievable.

Immediate potential solutions:

  1. XSLT my Atom document to add “atom:” to all my default-namespaced elements
  2. use an entirely different method of parsing
  3. remove the atom namespace declaration from the top of the file
  4. something else

Option 3 looks like the only sane route to take in this one-off job, but I’m quite surprised that I have to do it at all.

Actually, this turned out to be my fault – I was parsing two documents at the same time, one with a namespace declaration set correctly (for parsing my Atom file), and one with no namespaces set. I used the latter for my xpath query, which clearly didn’t work – many thanks to everyone who left a comment!

See other posts tagged with atom blogger general libxml2 python real and all posts made in November 2007.

Comments

leff
26 November, 2007 at 01:22

I’m not surprised at all. If the xml working group had a slogan it would be “Formalizing simple things until they’re difficult.” It’s always supposed to be easy, but it never is.

Sam Ruby
26 November, 2007 at 01:45

Default namespaces are a serialization artifact. Once read into memory, whether the namespace was a default, or even what prefix was used, doesn’t much matter. So, what you need to do is register a prefix for you to use at runtime, and use it.

xp = doc.xpathNewContext()

xp.xpathRegisterNs(“atom”, “http://www.w3.org/2005/Atom”)

results = xp.xpathEval(“/atom:feed”)

Note: the above works even if somebody uses the default prefix, or a prefix of atom or even a prefix of a. Also note that it is faster not to use // if you know the path.

A more complete example:

http://www.intertwingly.net/code/venus/filters/mememe.plugin

Adam Fitzpatrick
26 November, 2007 at 02:24

You only need to make a small change.

>> import libxml2
>> doc = libxml2.parseFile(“/home/pip/allposts.xml”)
>> ctxt = doc.xpathNewContext()
>> ctxt.xpathRegisterNs(“a”, “http://www.w3.org/2005/Atom”)
>> results = doc.xpathEval(“//a:feed”)

You can reuse the XPath context object for other XPath queries on the same document.

There are two subtle things to note. First, prefix:localname in XPath matches an element with that local name in the namespace referred to by that prefix, but a name without a prefix in an XPath expression always means that name in “the namespace you have when you don’t have a namespace” (or “the null namespace” as Daniel Veillard less whimsically describes it in the email Aristotle Pagaltzis quotes in the blog post you refer to). Like Veillard says, XPath just doesn’t have the “default namespace” concept like XML itself does.

It doesn’t help that the Namespaces in XML specification doesn’t define a practical term for “the null namespace”; it uses cumbersome language like “the namespace name has no value” (see the definition of “expanded name”, or section 6.2 (Namespace Defaulting) for example).

Incidentally, though this characteristic of XPath is very inconvenient for element names, *attribute* names with no prefix in XML are also in the null namespace, so XPath’s behaviour is obviously a much better fit for matching attribute names.

The other issue is that XPath implementations basically never use the document’s namespace prefix bindings (quite reasonably so, for two reasons: those bindings can differ on every element in the document; and, more commonly, different documents can and do use different prefixes, and you basically never want to discriminate between documents on the basis of the prefix).

This means that option 1 won’t work (because the lack of prefix in the source document isn’t the problem), option 2 won’t be necessary, and option 3 won’t be a problem if there turns out to be a next time after all.

anon
26 November, 2007 at 03:49

For the lazy one-shot jobs where you don’t want to write the extra few lines for your own xpathContext to resolve namespaces correctly, can do the lazy pretend-there-aren’t-any idiom:
results = doc.xpathEval(“//*[local-name()=’feed’]”)

Edward O'Connor
26 November, 2007 at 05:18

Why not use the Universal Feed Parser?

Aristotle Pagaltzis
26 November, 2007 at 06:21

>>> import libxml2<br></br>>>> doc = libxml2.parseFile("/tmp/feed.atom")<br></br>>>> xc = doc.xpathNewContext()<br></br>>>> xc.xpathRegisterNs("atom","http://www.w3.org/2005/Atom")<br></br>0<br></br>>>> results = xc.xpathEval("//atom:feed")<br></br>>>> len(results)<br></br>1

Jeni Tennison
26 November, 2007 at 08:47

It’s hardly ideal, but you could use paths like “//*[name() = ‘feed’]”. Really there should be a way of binding a prefix (eg atom) to the namespace before you evaluate any XPaths, so you can do “//atom:feed”.

Asbjørn Ulsberg
26 November, 2007 at 09:12

What would you do with XML nodes in the empty namespace (xmlns=””), then? I kind of agree with you, though. However, I don’t think it’s worth making so much fuzz about; it only requires one more line of code to define the Atom namespace with a prefix and then sprinkle the prefix out in the XPath statements.

Making the default namespace equivalent of the empty namespace should probably be done explicitly anyhow, with an optional parameter to parseFile() or something similar. It has to be under the author’s control whether he wants to access empty namespaced elements or not, and I think the default behaviour should be as it currently is.

alf
26 November, 2007 at 10:14

PHP5’s SimpleXML (based on libxml2) has a registerXPathNamespace function – maybe Python has an equivalent?

Otherwise yes, in the past I’ve just mangled the “xmlns=” bit of the default namespace declaration so that it doesn’t apply any more.

Phil
26 November, 2007 at 10:38

Sam, Adam and Aristotle, I was strongly under the impression that I had tried that. It is possible that I made a typo. Your comments certainly suggest it, although the interpreter didn’t report any problems with my syntax.

I used “//feed” in my example simply to demonstrate that regardless of the base document the query should return something. My actual XPath query was “/feed”.

Asbjørn, you’re right but given that I thought I’d tried the single-line sprinkling suggested by Sam and Aristotle, I was hoping to draw out the XPath experts and it seems to have worked 😉

I did actually look for an equivalent of xpath.flipBozoBit() which would allow me to query the default namespace directly.

Edward – I had libxml2 to hand and wanted to run some very specific atom-based queries. Plus I wanted to increase my understanding of the library.

Jeni and anon, thanks very much for that, it’s a useful selector I’d forgotten about.

Phil
26 November, 2007 at 22:06

I really must have made a terrible typo first time around as setting the namespace context really does work. How very embarrassing and frustrating. Thanks all!