Full fat

8 years ago The Guardian proudly announced that they were providing full-text feeds of, well, everything:

Today guardian.co.uk rolled out a major upgrade to the RSS feeds. Our feeds now contain the full content of each article so that you can take guardian.co.uk with you wherever you prefer to get your news.

Fast-forward back to today and not only do they not provide the full content of articles in their feeds (those clickthroughs and ad impressions being all-important), but not even their developer blog has been spared. This is pretty disappointing.

My first thought was to run it through the nice full-text RSS feed creator at fivefilters.org but it looks like The Guardian have asked them to not allow this for their domain (damn those ads!). Luckily there are still tools out there which will convert their truncated feed into full text for me, so I was saved from having to write the relevant dozen lines of code myself, although maybe that was the point, eh developer blog?

Readme

When time is short or my brain is full, I have two ways of marking content as worth reading at some point in the future:

  • if it’s in google reader I star it
  • if it’s on the wild wild web then I add it to delicious and tag it ‘readme’

The fact that I have over 600 ‘readme’ items in delicious, going back to 2004 tells us one of two things:

  1. I am not reading those items, or
  2. I am not untagging them once read.

Sadly for me, the answer is (1) and I’ve not previously worked out a way of making serious damage to the number of unread articles without declaring bankruptcy and potentially starting again – except of course that I would still have no strategy for actually reading them!

Enter http://www.tabbloid.com/ – a two-year old (yet new to me) service from HP that lets you add any number of feeds you like and it will, on a daily or weekly schedule, grab those feeds, merge the results, sort by time, select the most recent items and generate a PDF which it will then email to you.

I’m going for a weekly delivery of both my starred items and readme items – my first one arrived in my inbox the other day, I printed it out and am very happy indeed. Of course it means that each week I’m giving myself a job to go through my Tabbloid printout and de-star or remove the tag in delicious, but at least I’m making progress!

For generating PDFs from RSS I’ve previously used http://fivefilters.org/pdf-newspaper/ but it’s been choking on the feeds I want processed. http://www.feedjournal.com/ is also a competitor, but with a less-slick website, and thus I didn’t try it. Yes, I really am that fickle.

That isn’t to say there aren’t any pain points with this whole process – I haven’t yet sussed how to queue up video items tagged with “watchme” for example, or watch videos I’ve starred in Google Reader – presumably there’s something about parsing the feeds, grabbing the video where possible, encoding to a phone-friendly format and then subscribing on a mobile feedreader, but that sounds like a lot of work right now for a relatively small issue and I’m more than happy to be able to have a piece of my online reading experience come offline with me, and be ready whenever I am.

Online ebook catalogs in Atom

As I recently wrote, I have a new-found interest in ebooks (I also bought four new textbooks from O’Reilly using a BOGOF offer to pick up 97 Things Every Programmer Should Know, 97 Things Every Project Manager Should Know, Beautiful Code and The Art of Agile Development).

I mainly read ebooks on my Android device, specifically, using Aldiko.

Aldiko has a built in browser for the feedbooks.com catalog, but also gives you the ability to add your own catalogs. A friend told me that Calibre, a popular ebook management programme, has a web interface which one of the other popular Android ebook readers (WordPlayer) could be pointed at in order to add custom catalogs. After a quick trial and a few Google searches, I realised that WordPlayer actually subscribes to an XML file hosted on http://localhost/calibre/stanza

Opening this file shows it to be Atom, where each entry is a small metadata container and the link element is used to reference the actual book and images that represent it, like this:


    <link type="application/epub+zip" href="/get/epub/3"/>
    <link rel="x-stanza-cover-image" type="image/jpeg" href="/get/cover/3"/>
    <link rel="x-stanza-cover-image-thumbnail" type="image/jpeg" href="/get/thumb/3"/>

Another few searches showed this to be a draft specification called openpub. Aldiko supports this, so adding the /stanza URL to a custom catalog works there too! Voila, custom catalogs in Aldiko. Marvellous!

It should only require a tiny bit of work to write code that serves a catalog straight from the filesystem without the overhead of Calibre (which I found to be quite heavyweight). This is what I have started here.

Importing Nokia podcast subscriptions into Google Reader

Exporting the list of podcasts

  1. load the podcasting application, mark all items and hit “send -> bluetooth”. Contrary to what you might expect, this will send an OPML file listing your subscriptions to your PC

Edit the list ready for import

  1. Open your new Podcasting.opml file in a text editor
  2. Find/replace all instances of url= with xmlUrl=
  3. Immediately after the opening <body> tag put <outline title="podcasts" text="podcasts">
  4. Just before the closing </body> tag put </outline>
  5. (I also duplicated all the text=”blah” attributes with title=”blah” but I don’t know if this is actually necessary)

Import the list of podcasts

The Google Reader Import/Export page
The Google Reader Import/Export page
  1. load Google Reader
  2. Click “Settings” in the top right
  3. Go to the Import/Export tab
  4. Find your Podcasting.opml file and upload!

You should now find that you have a new folder called “podcasts” in your google reader containing all the podcasts from your Nokia device.

Even nicer – if you make the folder public (Settings -> Folders and Tags) you can import the OPML from Google Reader directly into other applications by giving the URL http://www.google.com/reader/public/subscriptions/user/USERID/label/podcasts where USERID is the long number in the URL of the “view public page” link next to your public podcasts folder in Settings -> Folders and Tags.

Google Reader view public page section
Google Reader view public page section

No more NewsGator, no Google Reader API

So everyone heard earlier this week that NewsGator is shutting down their aggregator synchronising service and getting everyone to switch to Google Reader.

There are two big problems for me:

There are also other issues of course, like Google owning yet another piece of monopoly pie, but the others affect me more directly right now.

Both NetNewsWire and FeedDemon now have the ability to sync with Google Reader, but there’s no published API, so I can’t mirror their behaviour in my own Linux desktop aggregator. At best I could use Niall Kennedy’s work from 2005 or maybe code from pyrfeed which has some documentation from 2007. Neither is an option I would choose willingly.

Storing feedparser objects in couchdb

sudo apt-get install python-feedparser
easy_install jsonpickle
sudo apt-get install couchdb
easy_install couchdb
sudo couchdb

Open a new terminal

python
import feedparser, jsonpickle
from couchdb import Server
s = Server('http://127.0.0.1:5984/')
len(s)
db = s.create('feeds')
len(s)
doc = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")
doc['feed']['title']
len(doc.feed.links)
pfeed = jsonpickle.encode(doc)
db.create({'feed1' : pfeed})

outputs DOC_ID

cfeed = db['DOC_ID_HERE']
dfeed = jsonpickle.decode(cfeed['feed1'])
dfeed['feed']['title']
len(dfeed.feed.links)

venus-ng

venus-ng is a fork of Venus which uses Newsgator to provide both the reading list and the feeds.

This means that venus-ng will, at a particular point in time, give you an accurate representation of your currently unread Newsgator feed entries. Here is the output from the newsgator.com web aggregator and venus-ng:

Screenshot of newsgator.com unread feeds Screenshot of venus-ng unread feeds

venus-ng does not mark feeds as read on the Newsgator server when in retrieves them, although that will likely get added when I have a test Newsgator account set up.

It is currently a fork because I’ve had to modify feedparser.py in a few ways which probably stop it working with other data sources:

  1. I’ve changed the way it deals with passed-in urllib2 handlers
  2. I’ve commented out the HTTP 401 response behaviour (since I’m passing it an HTTPBasicAuthHandler already)
  3. It always passes through an additional X-NGAPIToken HTTP header containing a Newsgator API key

As far as I can tell, the handler refactoring should be fine, but the 401-handling and extra HTTP header seem like a deal-breakers.

I have no idea how to stop the 401 handler in _FeedURLHandler() conflicting with that in urllib2.HTTPBasicAuthHandler.

I suspect there is a good solution in subclassing urllib2.HTTPBasicAuthHandler to provide the additional Newsgator HTTP header but I’ve not worked out some of the details yet.

You can get the latest source via bzr get http://philwilson.org/code/venus-ng – there is a sample newsgator.ini file in the /examples directory, but it relies on you already having a Newsgator account and some feeds set up.

Once I’d traced through the Venus code to semi-understand it, this was quite straightforward to do (deal-breaking fork-causers aside) so were Google Reader to introduce an official API it would not take long to integrate.

NewsGator + Venus?

I recently broke the graphics drivers on my Windows Vista installation, so re-partitioned and now run Ubuntu full-time at home.

On Windows I use FeedDemon as my full-time aggregator. It has a degree of speed and polish unmatched by any other web or desktop aggregator.

This means that all my feeds are automatically synced with newsgator.com – a web-based aggregator which is not fast and not particularly polished. Although it might be polished, I don’t know, it’s so slow that I tend to just give up (sync with Google Reader is coming).

FeedDemon has significantly raised the bar for any aggregator I use. Web-based tools no longer cut it, in particular when I have hundreds of feeds and, at times, thousands of unread items.

On Ubuntu the options for a native aggregator are Straw or Liferea. Both are currently undergoing rewrites. Liferea seems like the better option for me, and it has a plugin system which is appealing, but there’s no sync with any online tools.

NewsGator have an HTTP-basd API (PDF reference and sample code which requires a minor tweak to run) which is quite straightforward. It gives back data which can be consumed by the Universal Feed Parser. Venus uses the Universal Feed Parser in planet/spider.py after fetching data to create the cache which powers it.

This time last year I wrote a very very basic wxWidgets tool for browsing the Venus cache. A modification to planet/spider.py to use the NewsGator API would seem like an easy way forward, whilst gaining all the power of the Venus filters, plugins and existing XSLTs.

I might just have to try that.