Tuesday, September 1, 2009

Competition hacking with Ruby and Objective-C

During August, github ran a recommendation engine contest. An objective of the contest was to generate a number of open source projects (on github) providing white label recommendation systems (a noble goal). A suite of github user data was provided including ~120K repository definitions and a corpus of nearly half a million user-repository relationships. Entries had to be hosted on github and released under an open source license. Assessment involved making predictions against a test set of more than 4700 users of between 0 and 10 recommended repositories. Submission involved setting up a post-commit hook for your codebase and committing a results.txt to your repository. The results file was automatically evaluated and you score updated on the competition leaderboard.


I tracked the announcement of the competition and associated commentary. Around mid-August I got an itch an decided to hack on the contest.

After downloading the 4MB dataset, I started by writing some ruby code to parse the files and load them into an RDBMS. Halfway through munging (active record outside of rails) I realized the stupidity of the approach. I was managing for (the unnecessary) scarcity of memory. Even with bloated datatypes, the 4MB of provided data could easily fit into main memory many times over. After having participated in many of these AI-programming competitions in years past, I felt severely out of practice an quickly pushed on.
I chose ruby because I felt I still had much to learn about it and hadn't had a real chance to use it in anger, other than some rails hacking. I bashed out some code to parse the files and build a simple data model in memory. I started processing the data model with a k-NN toward making basic predictions and quickly hit the wall. I got some strange results with my hashmap's and began to mistrust my assumptions about Ruby's dynamic typing, resulting in lots explicit casts and other savage hacking - an artifact of my inexperience.

The real problem was speed. Runs were taking waaaay toooo loooooooong on my 1920's era iBook. I started looking into optimizing the code, compilation options and even tried running on updated and optimized builds of the Ruby interpreter. I got improvements, but not enough.

The thing about prototyping is that you need to hack around to find a stable definition of the problem your solving while defining your solution. I could have hacked together my algorithms in C and called them from ruby, but I was not confident on exactly what I wanted to do. Having been breast fed on ANSI C, I had a feeling for how long these small files should take to load from disk and how long the should take take to crunch for crazy n^2 experiments. I dropped tools and switched to another language I've been looking for an excuse to use in anger, Objective-C.

Writing header files in C++ always felt like duplication of effort to me. Not only did I get this feeling with Objective-C, but it was far worse. Coming directly from ruby, the Objective-C verbosity was distressingly laughable! I had used Objective-C before, but I was improvising then, rather than directed and problem solving. Finally, the Objective-C standard library (Cocoa) is painfully minimal compared to Java or C++ (at least for simple things like string manipulation and collections).

I hung in there. I got used to the bracket syntax and the approach memory management was quite familiar. I aggressively used memory pools in data processing experiments, although I often felt the desire to peer into the framework code to see exactly what kind of wasteful things it was doing in the name of convenience. I had the second edition of Kochan's "Programming in Objective-C 2.0" open on the desk - which I found almost useless. What I needed was a API description and a summary of gotchas - basically an O'Reilly book. As I mentioned, my machine is a piece of shit, so I use VI or textmate, not XCode. I tried to use the Apple interface to the Cocoa API docs online, but their organization was useless (learn from javadocs already!). I resorted to googling class names and my ad hoc functional requirements to figure out how to do things.
Although I'd read some text books on datamining (like 10 years ago) and wrote some plugins for WEKA, I knew very little about recommendation systems and collaborative filtering. I'd followed the Netflix competition, reading the papers and such (I even ran some simple stats over the provided dataset loaded into a Postgresql instance). Also, I'd read up on SVD before for some text processing hacking, and knew it could have a role to play in feature detection.

My general approach involved thinking hard about the data and the domain, proposing indicators and heuristic predictions, and testing them via direct submissions. I partitioned indicators and heuristics at three levels: user, group (using a k-NN), and global. I tried rank based indicators (for example: global repository watch rank) which performed poorly. I then tried some probabilistic indicators (for example: probability of this user watching a repository with this owner) which got much better results. I tried indicators to confirm/disapprove assumptions then played with the weighting factors for combining them.

I was in the competition to hack (string code together outside of work) and to learn. To learn more about these languages, and to learn something about practical collaborative filtering (building one from scratch). The beauty of this competition was the openness. Not everyone checked in code during the competition, submitting just the results.txt. Others checked in most or their entire code base (I sure as hell did), and this provided an opportunity to poke around the indicators and algorithms used by others.

For me, the most useful lessons from poking around others code was how to structure a flexible recommendation pipeline. Algorithms are awesome, and there is a lot of work in tuning them to the dataset, but having well-structured recommendation pipeline allows you to add in new indicators, heuristics, algorithms and pre-processing steps with minimum disruption. The readme associated with jeremybarnes' project was very insightful for both the algorithms one should be thinking about, as well as the concerns on the information processing pipeline one should be building.

The pipeline I converged to was something like the following:
  1. Parse sources, build first order data model
  2. Second order data model processing (stats, clustering, etc)
  3. Process test users one at a time
  4. Generate a candidate set (sub-set of all possible recommendations)
  5. Filter candidate set
  6. Apply indicators (blended, voting, whatever)
  7. Sort by scoring, and recommend
There may be some generality in this pipeline, although it's clearly specific to smalls scale recommendation engines, not for anything of scale (at least not in this form). Also, serializing the results of second order data model processing is also a great speedup.

I thought I could get some gains by building deep prediction models. Objective-C provides a convenient bridge to Java by booting an in-process JVM. I used this to hack in some classifiers I prepared in WEKA into my pipeline with no appreciable success. It was a fun exercise, and I tried a number of decision tree algorithms and configurations. I'm sure with the selection of appropriate predictive indicators at an appropriate level (cluster or user) that the approach would be great.
Toward the end of the project I started reading introductory papers to collaborative filtering and recommender systems. I'd had plenty of contact with vector space processing, so I was aware of dimensionally reduction approaches and also expected an array of vector distance measures for clustering. I was aware of the phrase 'item-based recommendation' but the realities of the approach were a lot more apparent after piecing together my toy system.

Next time I'll hit the ground running:
  • Use C++ and using in line C functions for raw power and maybe script the pipeline, experiments, or indicator weighting using something fun like Ruby.
  • Use SVD or similar to dimensionality reduction and use the resulting features for user and item clustering. I'll also use an off the shelf implementation.
  • Create an explicit offline test set and an automated testing process. Automation is key here, and cross-validation of the offline test set with statistical tests for significance will be a powerful addition.
  • EC2 server time, contingent on an robust test automation process. EC2 is cheap and worth the pocket change needed to cover a competition or pet project like this one.
  • Test the effects of different clustering algorithms (KMeans, raw k-NN, arrays of distance metrics)
  • Be more scientific regarding the testing of indicators and the testing of indicator combinations. I recorded results, but I could have been more careful, tested more things, and deduced trends that were only found by bugs or by reading other peoples submission code.
  • Visualize the dataset and indicator trends, using R or at the very least WEKA or excel.
  • Optimize indicator combination using gradient decent or a genetic algorithm.
As the deadline approached I tried all kinds of random hacks and managed to get a boost to just over 40% (my personal goal), and then on to 48.57% correct recommendations. I managed to place 15th the night before the competition closed (Aust. time), although ultimately finished with a place of 18th. It seemed that a number of people got involved late in the game using blending approaches on the leader's results files. Fun I'm sure, although not in the same league.

You can see my code (with a summary of my strategy in the readme), a history of my submissions, and the overall competition leaderboard. Github have not yet officially finalized the results and selected a winner.

I had fun, and the project was an excellent catalyst for me to get back into my off-hours programming groove. I'm kicking myself for not getting off my ass and at least participating in the Engine Yard competition or the Mario AI competition, the latter of which I believe is still open (second round).

6 comments:

Jason said...

It seems someone was cleaning house on the leaderboard. I've moved from 18'th to 13th.

Jason said...

"Collaborative Filtering with Ensembles" by igvita - who posted some ensemble results himself although appear to have been stripped from the leader board.

Jason said...

Igvita's post hit slashdot "Collaborative Filtering and the Rise of Ensembles". A few nice comments in the post.

igrigorik said...

Jason, nice writeup. I'm still of strong opinion that the contest was structured in a wrong way (optimizing for the wrong thing), which ultimately translated into fairly poor results. I was hoping GitHub guys would update the rules, alas..

I hope others learn from this contest. While the ranking was off, I think GitHub unintentionally created a much more interesting structure (open code, open results) for future competitions.

Jason said...

Github contest was finalized, results summarized in a post: About the GitHub Contest. Lots of great links and code to look at.

Jason said...

For next time, a great overview of ensemble learning.

Also, think about more efficient data structures (KD trees, B+ trees, BK trees). basically load it into mysql!