Recently, I have been neglecting my RSS feeds. I'm a Google Reader user and my unread count is far from zero. I've been distracted and too busy. The problem though, is that there are many excellent Artificial Intelligence and Machine Learning blog posts that are being written every day and I get a lot of enjoyment in reading them and keeping abreast with current research and interesting projects.
An obvious solution would be to simply rely on social media to filter such articles and let the best float to the top on Hacker News, Reddit, and Twitter. Maybe.
I felt like some hacking over my Easter break (in between the numerous family responsibilities), so I decided to have a crack at this problem. This post represents the first of four posts of my exploration of aggregating and filtering RSS feeds from popular blogs in the areas of computer science, data mining, data visualization, machine learning and artificial intelligence. I've called the project AIFeeds and have hosted the code and data for all parts on github if you want to skip ahead.
The objective of this first part is to simply build a list of a large number of blogs that match my loose criteria. Specifically, the objective of this part is to prepare a list of working RSS feeds for blogs on (AI and related fields) without any duplicates. This will then be used as the basis for the subsequent parts in the series.
Step 1: Extract relevant blogs from my Google Reader Account
For starters, I have a modest collection of relevant feeds in my google reader. Going to google reader, I clicked settings, Import/Export, then Export your subscriptions as an OPML file. I opened the OPML file in a text editor and delete all those feeds that I knew were not on topic.
Step 2: Search for lists of relevant blogs
The next step is to Google search for lists of machine learning blogs. I remember seeing such lists on Quora and MetaOptimize before so I kept an eye out for these sites in the search results. Here are some good lists I found:
- Good Machine Learning Blogs (MetaOptimize)
- Machine Learning Resources (Inductio ex Machina)
- Updated Machine Learning/Statistics blog list (Machine Learning, etc)
- Data Mining Blogs (Data Mining Research)
- What are the best blogs about data? (Quora)
- What are the best machine learning blogs? (Quora)
Step 3: Aggregate
The first easy thing to do is to write a script to extract all feeds from the collected OPML files. Using the built-in REXML library in Ruby I bashed out a script to read in all .opml files in an opml/ subdirectory, parsing each entry for the rss attribute and writing a opml_feeds.txt file.
See parseopml.rbI then did something stupid, slow, but effective. I clicked a lot of links, read a lot of blog rolls, viewed source on a ton of blogs, and copy-pasted links to blog RSS feeds into a byhand_feeds.txt. I could have written scripts for this, but there is a lot of pain to deal with and I suspected I could do this once of task by hand faster than I could write code to parse a bunch of pages, extract (the right) links, and then download the pages and extract links to rss feeds. I took an hour or so to do by hand - not a big deal.
Step 4: Filter list of feeds
So now we have two lists of RSS feeds. Some may work, some may not. There are likely many duplicates and even many that are not strictly on topic. An easy quick win is to write a script that opens a connection to each url, parses the contents and builds a list of unique feeds.
The first thing that is needed is a library that can process RSS feeds in ruby, at least RSS1, RSS2 and Atom. Google throws up a few options, such as RubyRSS, Ruby RSS in the stdlib, Ruby-Feedparser, Simple RSS, feed-normalizer, and others. I tried a few and was somewhat frustrated. I ended up going with feed-normalizer (rdoc) because it was really easy to use and provided a consistent data structure for RSS and Atom feeds.
I wrote a small script using feed-normalizer to parse a feed and return the data structure. Importantly, it provided safety for the URL failing to open, the XML being bad, and for socket timeouts. An error is reported to stdout on error and a nil is returned. I used this script to test some rss, atom and even some bad URLs. This file provides a home for helper functions for loading and parsing RSS feeds.
See parserss.rbUsing this parser function, I wrote a script to read in lists of RSS feeds in text files (opml_feeds.txt and byhand_feeds.txt), parse each and build a unique list based on the website name. There are many hundreds of feeds to check in those files and I'm lazy. I wrote a small function to hammer the feeds in the file using lots of threads and then check all of the data structures in memory. Lazy but effective. The result is a list of 236 rss feeds in filtered_feeds.txt.
See filterfeeds.rbStep 5: Summarize list of feeds
A final step is to generate a summary of all the available feeds. An easy way to do that is to generate an HTML page that summarizes each feed in turn, ordered by the name of the website. A script was prepared to parse each feed and list the name of the site, link to the site, and finally link to the actual rss feed. The list was ordered by site name.
See listfeeds.rbThe following provides a screenshot of the generated html page providing the final list of webpages and their rss feeds in feedlist.html.
Improvements and Extensions
There are a number of improvements and extensions that could be made to the first part of this effort, such as:
- The final list contains some oddball and even some off-topic feeds. A high-level human-based curation of the list would be a good start. For example, removing Q&A feeds and delicious feeds would be useful.
- Some of the links to the sites in the generated feedlist.html are suspect, meaning that some feeds do not accurately report the site URL or site title. It may be possible to carry over this information from the source documents, either the website themselves and/or the OPML files. Ideally, the final corpus of feeds could/should be stored in an OPML file for portability and reuse.
- Build a larger corpus of feeds. Naturally, this is the easy extension to the whole project and there are may ways to go about collecting more feeds starting with some simple google searches. Getting more OPML files and including them in the opmlparser.rb script would be a great first step.
- Write a script to extract links of blogs from some of the above listed collections. Once such links to blogs are collected, write a script to download the HTML, parse it and extract the link to the RSS or Atom feed. This is most commonly listed in the head tag of the HTML file as an alternative link (at least it is for blogger and wordpress blogs).
So we now have a list of RSS feeds for blogs, most of which are likely to provide posts on or related to our topics of interest. Each feed has been tested, and duplicates have been removed.
Next, in AIFeeds Part 2 we will use this list of sources to prepare a summary of posts as a simple static RSS reader.
Remember, all data and code is provided on the AIFeeds github project.




0 comments:
Post a Comment