Importing blog posts and comments from Blogger to WordPress

bloggerpressI tried this a year ago only to experience epic fail.

I tried this yesterday and it was a marvellous success.

Around this time last year I was locked out of my Google account and decided to move what I could over to my own server (a process I’ve still not completed!). As part of that move I used BloggerBackup to export all of my blog posts and comments and tried to do an import into WordPress, which didn’t work. I was resigned to writing some script to import it but ran into a WordPress date parsing bug which I had trouble tracking down – however since my old blog was still available as static HTML on my server, I wasn’t really that worried about it.

blogger import Last night I tried the built-in WordPress import from Blogger. It uses OAuth to authenticate and then allows import of your posts and comments from the comfort of a couple of clicks in the WordPress admin interface. All very smooth, all very easy (apart from the slightly worrying disparity between the number of imported elements and the totals). I’ll have to move my images, but that’s no real bother.

My archives now go all the way back to May 2002 when it was a co-blog with my housemate of the time who is now an arty-philoso-programmer in Australia. Before that I maintained my blog by hand and I’m not sure I have copies.

A quick “thanks” to my colleague Tom Natt who helped me fix my .htaccess changes so that old links and Google searches still work (also thanks to Mark Pilgrim’s Cruft-free URLs in Movable Type which I could rather tragically remember as a useful post from five years ago).

Parsing Atom with libxml2

Whilst trying to parse some Atom (my Blogger backup) with libxml2 I appear to have run into the same problem that Aristotle hit two years ago in XPath vs the default namespace: easy things should be easy, to wit: The story is that you can’t match on the default namespace in XPath.


>> import libxml2
>> doc = libxml2.parseFile("/home/pip/allposts.xml")
>> results = doc.xpathEval("//feed")
>> len(results)
0

Unbelievable.

Immediate potential solutions:

  1. XSLT my Atom document to add “atom:” to all my default-namespaced elements
  2. use an entirely different method of parsing
  3. remove the atom namespace declaration from the top of the file
  4. something else

Option 3 looks like the only sane route to take in this one-off job, but I’m quite surprised that I have to do it at all.

Actually, this turned out to be my fault – I was parsing two documents at the same time, one with a namespace declaration set correctly (for parsing my Atom file), and one with no namespaces set. I used the latter for my xpath query, which clearly didn’t work – many thanks to everyone who left a comment!