Thursday, April 28, 2011

AIFeeds Part 3: Intelligently filter AI posts by social impact

In Part1 we learned how to prepare a master list of Artificial Intelligence and Machine Learning RSS feeds. In Part2 we learned how to process articles from the list of feeds and present them as either a fire-hose or in a more compact day-based manner. In this part we explore how we may filter articles by their perceived social impact.

As with Part2, the curated_feeds.txt file is used as the basis for locating articles to consider listing.

Step 1: Computing Social Impact
Computing social impact for a give URL involves determining how many times it has been shared, up-voted, stumbled, etc on popular social networking websites. All of the large sites offer APIs and almost all offer JSON APIs.

I started out thinking I would need a ruby gem for each social networking site, one for Twitter, one for Facebook, etc. I started looking around and found that there were some gems here and there and most were quite outdated. I figured that because most APIs were in JSON format, that I could just interact with each in turn manually. I stared reading API docs for various sites and quickly got bogged down. For example, searching for a URL on twitter is not enough because people use URL shortening services. In fact, these services are quite pervasive.

The requirements here are quite specific: For a given article URL, how popular is it in a given social networking website? After a little more searching around, I came across a very useful website on Shared Count that provides a listing of the JSON API calls needed to answer this question across a host of social networks. I combined this with some of the API code I had already written and built a little URL scoring sub-system.

Given that all of the APIs we are dealing with are JSON based, we need some robust functions for downloading and parsing JSON files. I installed the json gem and wrote a robust function for making arbitrary JSON API calls (HTTP GET requests) that used similar socket timeout handling as to that which was used in Part1 for RSS downloading.

See parsejson.rb
Next, the specific query formatting and data processing needed for each social networking website was prepared and tested based on the JSON downloading and parsing code. The scoring of a URL against a a given API was relegated one per file allowing any special handing (like MD5 digests in the case of delicious) and spot checks.

API handling was written for 8 services: (each links to the specific ruby code) Delicious, Digg, Facebook, Google Web Search, Google Buzz, Reddit, Stumbleupon, and Twitter.

Step 2: Article Scoring
The next step is to combine the scoring for a given URL. There are many ways to do this and the more thought and experiment put into the scoring, the more meaningful the scores will become.

The chosen method here is a simple sum of the scores. This first approximation will provide a number that can be compared, but the relation between the scores in the sum from different websites is unknown. For example, a large score on Facebook or Twitter likely has more meaning than a large score on the Google web search. This is an obvious area for further research.

A URL scoring function was prepared that evaluated a given address against each website in a different thread. Digg was dropped from the cannon of scoring functions because the results did not appear to be meaningful and the API was always slow to respond.
See scoreurl.rb
Step 3: List Popular Articles
Now that we can score articles by their impact on the world, we can combine the scoring information with the list of articles posted in the last week to create a list of popular articles.

Starting with the viewarticles.rb and listarticles.rb scripts prepared in Part2, an updated list of popular articles can be created. In this case, the score for each article can be computed and all articles can be listed by their scoring. This promotes those articles that are expected to be interesting to the top of the page with the intent that they may be read first.
See listpopulararticles.rb
The following provides a screenshot from the last time that the script was executed showing an example of the resulting populararticles.html file.

Step 4: Combine List by Day and Popular Articles

The resulting listing of posts is long, it contains 7 days worth of articles. Additionally, taking a closer look at the posts in the list shows that most of them have a score of 0. It is expected that the older a post is, the larger its potential to have a meaningful score. This is because it gets indexed and shared more over that time. It is also expected that most articles posted within the last day or two will not have a score. In fact, it is expected that most AI articles in general will not have a score - it is a niche field of interest after all.

In this step we separate "popular" articles (those with a score) from the "unpopular" articles (those without a score). Popular articles are listed at the top of the page and the remaining articles that have not achieved a score yet are listed below the popular articles, broken down by day.
See listpopulardayarticles.rb
The following screenshot shows an example of the resulting populardayarticles.html generated by the script.

Improvements and Extensions
In this section a number of improvements and extensions to this part are summarized.
  • An important improvement to this part is the assessment of the URLs. Specifically, permalinks reported by some ATOM and feedburner feeds involve a proxy that redirects to the page. The effect is that the popularity of these proxy URLs are assessed against the social networking websites rather than the permalinks themselves, resulting in incorrect scorings. This may be addressed by following the redirect in the proxy URLs or using a different feed parsing library that has knowledge of the different feed types.
  • A natural first extension to this part is to add scoring from more social networks and sources. For example: Hacker News, Google Blog Search, Friendfeed, etc.
  • The scoring used is very simplistic. A good extension would be to make better use of the information available for each URL and combine them in more intelligent ways. Such as a weighted sum of their relative importance, more weighting to comments and up-votes, etc.
  • Additional meta information about a given URL on social networking sites could be used in conjunction with an intelligent parsing of the articles themselves to determine broader categories that articles could be put into. A new list could be constructed with articles by category and popularity.
  • The generated HTML pages will only ever contain AI articles that are actually in the source RSS feeds. A useful extension of this project would be to monitor social networking sites and other news sites for articles on AI and Machine Learning and include them in the mix. 
In this part we looked at the social interest of each article from the last 7 days and used that information to promote "popular" articles to be read above "unpopular" articles. The dynamic nature of the social scoring function means that popularity can be re-calculated each time the script is executed, responding to changes in the spread and sharing of a given story.

In the next and final article in the series we will look at automating the generation of the AIFeed and in disseminating it via email and the web.

Don't forget that all code and data for this series is available for free on the AIFeeds github project.

0 comments: