In Part1, we went through the process of aggregating and filtering a bunch of RSS feeds for Machine Learning, Data Mining, Artificial Intelligence and related topics. In this continuation we will explore the properties of articles listed within a feed and provide a way of generating a static list of recent articles.
Step 1: List articles from the past week
A problem with this effort is that there are a lot of inconsistencies in the real world data. Some feeds are missing fields and some articles have fields that do not match expectations. The first step is to explore the data that we have by listing articles from the past week. This includes all fields that we think might be usefully related to an article, a feed, and its source website.
To start with, the final list of feeds prepared in Part1 was taken and curated based on eyeballing the output in feedlist.html (also generated in Part1). The result is the file curated_feeds.txt with some of the more obviously wrong and non-interesting feeds removed.
Next, a script was prepared to generate listing of all articles from the last seven days drawn from the feeds in the curated list. Almost all information for each article is provided such that we can discern what might be appropriate or interesting in further experiments.
It was interesting that some articles had neither a date_published or last_updated field. Some articles would put the content in the description and others in the content field. These may be artifacts of the chosen RSS library, or may be artifacts of the feeds themselves, more investigation is needed. There are also Unicode issues with strange characters all over the place where apostrophes and quotes should be. Again, it is not clear whether this is because of the library or because some custom parsing is needed, nevertheless, as a static RSS the result is very basic.
See viewarticles.rbThe following provides a sample screenshot at the time I last executed the script to produce the articlelview.html file.
Step 2: Summarize articles by day
In recent months I have become a big fan of the Hacker Newsletter. It is basically an email sent once per-week that contains a summary of the top stories of the week on the website Hacker News with links to comments as well as other relevant and interesting sections. I like the layout of the email and I think it provides a good pattern to head towards in this project.
The previous step explored how to filter the feeds and how to extract the relevant information from each article, but it also demonstrated how ugly rendering the raw data can be. One could invest effort into replicating an RSS reader, or in doing something simpler and move toward a better summary view on the same data. This step explores such a simplified view where each article is presented on one line.
A script was prepared to produce an article summary for the last seven days. The same filtering was performed as in Step1, although rather than writing out a long list of articles, the script was updated to group articles by date and only list the article title and source webpage. The result is a compact listing of recent AI and Machine Learning blog posts. The resulting page also provided clearer insight into the nature of the blogs included in the master list. As a result, the curated list of feeds was further culled.
See listarticles.rbThe following figure provides a screenshot from the last time I executed the script to produce the articielist.html file.
Improvements and Extensions
This section lists some potential improvements and extensions to this part in the series.
- Curating the master list of source feeds by hand is a bad idea. It seems feasible that all articles from each feed could be scanned and their appropriateness to the project determined using heuristics on the structure or content (keywords).
- Unicode characters in the feeds need to be rendered correctly. This still effects the titles of the articles and the blogs themselves.
Next in Part3 we explore additional social-web based methods for filtering the list of articles and promoting those that are more likely to be interesting to the reader.
Don't forget, all the code and data for this series can be downloaded from the AIFeeds github project.





0 comments:
Post a Comment