Wednesday, January 20, 2010

My New iMac: The Five Year Plan

After a lot of deliberation and research, I purchased a new Apple iMac for my home computer. The deliberation was because of the cost - I needed to rationalize the expense and suitability against the need.

I have been using an iBook G4 since late 2005 as my home machine, and now, a little more than 4 years later, the poor little laptop does not cut it. For comparison purposes, my old machine was a 1.33GHz PPC, with 512MB of RAM, a 12" screen, and a 40GB HDD. According to the iBook Wikipedia entry, it was the last PPC model iBook. It didn't even have the capability to write DVDs, and at the time I purchased it, I was downgrading from a beast of a desktop PC to the convenience and elegance of an Apple laptop.

It was time for an upgrade, and so I bought a top-of-the-line Apple all-in-one desktop machine: the 27" iMac. I purcahsed it pre-upgraded with a 2.8Ghz quad-core Intel Core i7 and an extra 4GB RAM, totalling 8GB. I chose the 64-bit Intel Core i7 based on the comparative performance to the Core i5 and Core 2 Duo in the benchmarks (1, 23), primarily because of the so-called Turbo Boost dynamic performance that pushes the cores up to 3.46GHz and Hyper-Threading which results in up to eight virtual cores. Other specifications include the bluetooth keyboard and magic mouse, 1TB HDD, a DVDR/CDR, ATI Radeon HD 4850 graphics card with 512MB of GDDR3 memory, and a gorgeous LED-backlit 27-inch TFT active-matrix liquid crystal display with IPS technology.

The machine looks beautiful and is a computational beast. Needless to say, I'm happy with the purchase. I've been working on it a lot over the last week, re-building my development environments and testing compilation and execution times, and have been very pleased. I've had none of the alleged problems that have been reported about this model iMac.

I'm in it for the long haul. I've purchased Apples extended warranty (Apple Care) to cover me for three years and expect it to be useful until 2014 or 2015, perhaps with a supplementary laptop purchase between now and then.

Friday, December 18, 2009

A Genetic Algorithm in Google's Go Language

Like many programmers, I tracked the announcement of the Go Programming Language by Google last month, and like many programmers, I've been having a play around with it. Because of the newness of the language, there is very little code to look at to get a feeling for the language. I spent what time I could find looking through the code for the golang API in an attempt to gleam useful idioms. I've started a learning-go project for myself on github to collect links and code from tutorials and examples I come across.

Beyond the syntax, I'm interested in exploring how to make effective use of goroutines and channels in program design. I've spent some free time over the last week or so writing a simple genetic algorithm in go, and now that it compiles and looks ok-enough to talk about, I think that making good use of this language requires a slight shift in thinking about an application toward flows of information.

Having spent so many years researching computational intelligence and evolutionary computation, the genetic algorithm is burned into my brain. It's one of the first simple non-trivial programs I like to build to play-test a new language or platform (for example, see a genetic algorithm in lua from earlier this month).

The algorithm implemented is the classical Genetic Algorithm (GA) applied to the onemax problem, where the system must induce a bitstring of unity (all ones) from a cost function that only specifies the number of ones located anywhere in a candidate solution without specifying where the errors are located. The algorithm uses a bitstring representation, one-point crossover, point mutations, and tournament selection. The file is located in my learning go project, or you can access the go file for the genetic algorithm directly. See below for an embedded gist version of the file (sorry to those in RSS feed readers that don't like javascript).



I started out by trying to build an object-oriented version of the algorithm, but shifted to a procedural implementation for simplicity. Go does have objects, but they involve defining a struct for data and associating functions with the struct. Inheritance is not provided in favor for a compositional pattern. The language is young but it already has (what initially feels like) inconsistent rules for things like semi-colons and multiple ways for doing the simplest things like variable declarations. Nevertheless, the script-language like feel quickly becomes comfortable (compared to some ANSI C which I've been using at work recently).

My algorithm implementation is not optimal or efficient, and it's not event that pretty. What it does show is basically a port of a procedural implementation of a genetic algorithm in go language syntax. The code has examples of objects (solution), multiple return values, some library usage (data structures and random numbers) and some of the concurrent features. A genetic algorithm is embarrassingly parallel, so there are numerous ways to implement the algorithm to exploit this. I chose a simple step-wise multiplexing of operations like fitness evaluation, selection, and reproduction that used goroutnes to issue the work to execute in parallel and used a channel in a barrier pattern to wait for and collect the aggregation of these tasks.

I really like the ease with which one can manage tasks in parallel without resorting to explicit synchronisation blocks and patterns or booting threads. What became apparent while lashing my GA together was that this ease of parallelism promotes a different structure to programming. Rather than writing a linear procedural application and marking-up the parallelisable parts of the code after the fact, one should better design a program to exploit the parallel constructs (no shit!). This is obvious in retrospect, and would have been if I'd sat down and actually designed a parallel simple genetic algorithm.

This leads me to thoughts about the flow of information in the program. These observations should resonate if you've worked with message passing or event-driven infrastructure. Consider the prime number filter provided on the go website (or my implementation), it provides a great example of this type of thinking where goroutines manage the branching in the flow of data down through the filters, and channels bring the results back to where they are needed. The same pattern may be adopted for the GA, where each iteration of the algorithm may be represented as a pipeline of channels and data reservoirs where goroutnes consume, compute, and produce moving candidate solutions through to the next iteration. If I find some time, I'll wire up an example.

Enough from me, go play!

Friday, December 4, 2009

Great Science and Technology Podcasts

I've had a few people asking me what podcasts I consume so I thought I'd put it in a post.

I've been a heavy podcast consumer since about the time I bought my first iPod (a first generation back 2GB ipod nano, bought in 2004 or 2005). I've always listened to podcast's on my run (a few times a week). When working on my research, I used to go for an hour long walk each day for lunch and consume. These days, I'll consume on my daily commute to and from my day job.

In no particular order of priority, here are the awesome podcasts I consume:

Science
I removed Science, and the Naked Scientist feeds from by diet about 2-3 years ago, there was far too much redundancy in the science stories when listened in conjunction with Nature, The Science Show, SciFri, and Science Times.

  • Nature. Weekly science news, and interviews with authors from nature articles. One of the best science podcasts around!
  • Science Show. Australian science show with a veteran science reporter. My favourite because it's great quality with a local twist.
  • Science. Similar format to Nature, although not quiet as good. Format may have changed as I haven't listened to it in quite some time.
  • The Naked Scientists. Great format with 'live' experiments at schools, science news, and interviews. At the time when I was listening to this (a few years ago), there was a lot of overlap with nature, I think because some of the same people were working on both podcasts. Things may be different now.
  • NY Science Times. I am really into the format of this brief podcast. They have science news, health information, and reporting with columnists for the science part of the NYTimes. They also don't just focus on hard science and include health and social sciences.
  • NPR Science Friday. A callback science show on NPR with another veteran science reporter. I have always enjoy this show because of the mixture of hard and softer science shows. This weekly show typically consists of interviews with the scientists involved with the breaking science stories of the week or upcoming science books and shows. The call in questions are typically cringe-worthy.
Technology
  • ITConversations. This is a fire hose of technology conferences. I subscribe to the master feed and filter ruthlessly with my skip-track button. I've also back-listened on many of the O'Reilly conferences. Some (weekly?) fixtures I enjoy the most include: Jon Udell's Interviews With Innovators, Phil Windley's Technometria, and Moira Gunn's tech nation.
  • Robots. News and interviews in the field of robotics. I'm not a huge robot freak, but I really do appreciate the application of artificial intelligence and machine learning techniques in physical systems. These guys are a little amateurish (English as a second language?), although the science is hard the way I like it and their intro soundtrack is awesome!
  • Software Engineering Radio. Interviews and discussions on software engineering topics. I'm new to this one, so I cannot accurately comment. It seems good so far, although sometimes a little dry or textbook (maybe a little too close to my undergrad).
  • CSIRO HAIL Seminar Series. CSIRO is a premiere government research institution in Australian and HAIL stands for Human Factors, Artificial Intelligence, Language Technology. There are occasional gems in this stream on AI and interesting research, although many lectures are in topics I am not that interested in listening through on a run. CSIROpod is a new podcast on the scene from the same organization and I'm itching to give it a try.
  • Singularity Summit. Talks from the annual conference on the technological singularity. The talks provide interesting insight into the current frontiers of Artificial Intelligence and I always get a lot out of them! They can be slow at releasing the audio for the talks, but here are links to the files for 2006, 2007 and 2008.
Other
I also remember listening to a few semesters of biology lectures one year. I can't remember which school, but it was likely somewhere in the mess of MIT Open Courseware. I remember the content being very detailed and learning a lot about plant reproduction, genomes, and detailed critique of Darwin's origin of species.

There may be others, but these are all the great one's that come to mind. If you listen to any similarly awesome podcasts on science or technology, please spreed the word!

Thursday, December 3, 2009

A Genetic Algorithm in Lua

I've been evaluating the Lua Programming Language at work for some embedded project. I really like this scripting language, if for nothing else other than it is so small and lightweight, yet almost as expressive as Ruby.

Anyway, I've created a small 'learning lua' project on github and collected a bunch of useful resources for developers to get started. I also made an attempt to use the language in anger and wrote a simple genetic algorithm in lua. If there are any Lua pro's out there, I'd love to hear about tweaks and idioms I could use to improve the example. I've copied the code into a gist and embedded it below, enjoy:



The genetic algorithm uses tournament selection, one-point crossover, point mutations and a binary string representation. The algorithm is applied to the classical OneMax problem. Some exercises for the reader my include changing the bitstring representation to tables, and to change the organization of the algorithm to use objects in a strategy pattern (for the problem and the algorithm).

Monday, November 30, 2009

Random Project Ideas

I've done very little on the side project front since coming back from china. The trip shifted my focus to more tangible things and I've had trouble finding motivation to work on more abstract projects, especially after putting in eight hours at my day job. I've been using twitter (follow me!) as a release valve, but I yearn for a meaty side project again.

Anyway, I thought I'd describe some moderately sized projects that have been jostling around in my head over the last month:

  • Bot Book. A friend recently posted a list of skills he'd like to acquire before finishing his PhD and one of them was to write a Quake bot using a neural network. I pointed out my old ecosystem project in Quake2, Neural Bot, and UN Bot as places to get started. Briefly looking at those projects got me thinking about how much I was into trying to understand and build quake bots all those years ago. I've recently been working in the area of BDI intelligent agents and I started thinking about comparing and contrasting these two approaches of autonomous control. My vision is to dig up the source code for all of the Quake, Quake2, and Quake3 bots I can find and study each in turn, describing the main sub-systems and most importantly the structures used to realise autonomous control (most commonly as hierarchical finite state machines). The book would be a summary of each of these studies and finally an abstraction of the general approach. The title might be something like: Bots: Lessons about autonomous control from computer game bots (or something more catchy). Each bot represents a different individual programmers approach to the problem, and there were a lot of bots made for these games. I'm really interested in the abstraction of all of these studies contrasted to more academically rigorous BDI, are there some great ideas in there not being exploited as much as they should in other domains?
  • Bioinformatics. I cannot shake my interest in the field. I've been poking around computational biology for many years now, hacking on protein folding prediction, and sniffing around related problems and suitable machine learning algorithms. I've been thinking pretty seriously about getting into a night course such as a graduate diploma or masters program in bioinformatics. The cost is AU$20K upwards, and I look back on the fist full of degrees I've already accumulated and think that maybe a more direct approach would be warranted. Although a night class can provide the structure and commitment to acquire the new skills required to work in the field, I'm sure self education and on the job learning would be enough, given good mentoring. There is no doubt to me that this field is a rising star (unlike applied computational intelligence), even in the technology backwater where I live. I think a reasonable first step would be to bash out a series of blog post (reports) on the field in general, open problems, learning materials, and local/world renown institutions, just to boost general familiarity with the field before more substantial (costs) steps are taken.
  • Canvas (HTML 5.0). The new canvas element in HTML 5 is a graphical context allowing free-form 2D and 3D graphics in the browser (written in javascript). The demos have had me excited for a while now, but it is libraries like processing.js that have pushed me over the edge. Rather than hacking on native iPhone apps, I would much rather realize my dreams of a distributed creature morphology evolution simulator (game) in the browser capable of running of any web-capable device. I'm sure there would be some hard work getting the physics engine (collision detection solver) running fast enough in interpreted javascript, but I believe the problems are solvable, even if one resorts to faster interpreters and faster CPU's to drive them.
  • LISP. Frankly, if you're a programmer you will eventually come to realise LISP is awesome. Although I've started my pilgrimage, I am a long way from mastering, let alone being productive in a dialect. I need to stop dancing around and either build something big and real in the language or to work through a book end-to-end (practical common lisp of course). Both of these strategies have worked for me before in rapidly acquiring new languages. I was thinking of dedicating the Christmas break to one of these tasks, we'll see...
  • Stackoverflow. I consume a lot of podcasts during my commute and while running. I subscribe to itconversations, and occasionally a stackoverflow podcast will find it's way into my playlist. I just about always skip it. The guys maybe smart dudes, or a the very least write about interesting things on occasion, but they produce a really crap podcast. Anyway, their joint venture - the stackoverflow site - is a pretty kick ass programmer Q/A repository. I have not contributed to it at all, but I have started searching various keywords and tags and I think I could contribute a lot, some of which might even be coherent or useful. What's holding me back is the investment. Each answer would be about the same investment as would be required for a small wikipedia entry (which I've been known to write). My problem is whether or not there is any return on that investment (as I think there is with wikipedia). Is it ego-fulfilling research (like making a slashdot/reddit/hackernews comment) or is it truly a contribution to the a community (however niche)?
  • Algorithms Book. I feel guilty. I want to write and publish a book (thesis doesn't count) before I turn 30. My inspired algorithms project was my concerted effort this year towards that goal, a project I started this time last year while job searching. I still have a bunch of copy (in latex) and code (in ruby) on my drive, some of which is on the web. I am less convinced anyone will want (pay for) the book, even after my efforts to optimize its content. Nevertheless, I do still believe it's a book I would want to have on my desk about 5-10 years ago. A book for a hack of a programmer interested in AI. I need to get off my ass, cut the scope of the book, and push out a complete 1.0 to friends (the web?) for comment, or weasel out and kill the project.
  • MapReduce and Evolutionary Computation. The map reduce paradigm for distributed problem solving is interesting enough, but I've noticed more and more machine learning algorithms being ported to the approach to run on platforms like Hadoop. I am specifically interested in the approaches to map evolutionary algorithms to the paradigm as my intuition tells me there are a myriad of ways this could be achieved. I want to read the literature and bash out a few ways in small case studies (ruby) and see if there is something truly interesting to it, or if it really is equivalent to the classical parallel approach. New computational paradigms typically open up interesting avenues on classical approaches, and this has happened before with EC with the implementation of the algorithm to multiple-CPU and multiple-machine computing platforms in the 80's and 90's resulting in a sub-field of diffuse and parallel evolutionary algorithms.
There are some good ideas here, but I need to pick one or two and get to work (thoughts?). I don't want to sit around thinking up cool project ideas forever and end up doing nothing. I'm a maker, I need to build things!

Wednesday, October 21, 2009

Back from China Trip

Well I'm back from a month long holiday in China. The goal was to spend time with my wife's family and then relax on a sort of honeymoon.

We spent the first two weeks in Beijing (east of china), staying in a family apartment in Nanmencang Hutong (near ghost street, the military hospital, and the old grain warehouses from the Yuan Dynasty), only 20 minutes walk from Tiananmen Square and the Forbidden City. We visited many of the sights including the Great Wall at Badaling, the Summer Palace, the Forbidden City, Beihai Park, the Temple of Heaven, the Bird's Nest and Water Cube, and lots of markets (I bought mostly clothes, gifts, and a cool microlight helicopter for pocket change).

The markets were really interesting from the perspective of the selling culture. Every purchase involved a little dance of the seller trying to talk up the product and extract as much cash as possible from the buyer, and the buyer faining disinterest in the product, questioning the quality, and attempting to achieve the lowest price possible. Too aggressive, and the dance abruptly finished. From trinkets to bags of fake polo shirts to fake watches, a balance had to be negotiated to ensure the seller got their margin and the customer walked away happy. Once you achieved a purchase, tested each others metal, you could then bargain for other items the seller was pushing with a lot less resistance.

The interesting part for me was the aggressiveness of the sellers to push their wares, from yelling at you to grabbing you, to putting products in your bag and asking for the money. A 'get a sale at any cost' mentality. I was also very impressed that these young girls could switch languages and be functional (persuasive) so rapidly. Mandarin and English were the main two, but I was cases of girls trying Russian, German, and Spanish. More than that, I could clearly see differences in persuasive strategy between nationalities. More coy and playful with European men, child like to Russian men, confrontational with Chinese women, etc.

Besides the the tourist thing I was watching out for technology and captured some observations:

  • Mobile phones were everywhere from the tourists, to the shoppers in department stalls, to grandmas riding bicycles through the hutong
  • Most phones were simple handsets, although i did see a handful of unrecognizable smart phones and even a few iPhones.
  • Sales staff in small shops and department stalls glued to PC's either watching divx TV shows or on chat clients. I learned that chat is used a lot for direct Q&A with stores. We even used it ourselves to ask questions about flights and tours.
  • A friend spent some time in a poor village where a boy used the hold music on his telecom service number as a radio (an interesting case of exaptation)
  • I saw a heap of fake iPhones in the markets although failed to make the effort to see any in action
October 1st was the national day and marked 60 years since liberation (communists oust the nationals in the Chinese Civil War A.K.A. war of liberation). In the preceding weeks, security at train stations, tourist sites, and in the CBD (down town) was ramping up with x-ray bag checks and soldiers on street corners. Around the square where the action took place there were a clear and strong military presence - I'm talking big guys waring armor with big scary guns. The city went into lock down the day before and even the airport was shutdown on the day of the celebration.

We watched the parade on TV (city residents were instructed to stay indoors, or so I was told) and could see the jets and helicopters buzzing through the city just before they appeared on the telecast. It was quite a thing to watch, and we all had fun watching, eating, and drinking the local beer.

After Beijing we went to Xian (middle of china) and saw the Terracotta Warriors, climbed Hua Shan, and played around in the city for a few days including a bicycle ride around the old city wall which was great fun.

Finally, we spent just under a week in Lijiang (south west near Tibet) in Yunnan province. We stayed in a quiet hotel right in old town. Yunnan has some of the most beautiful geography in China, and Lijiang has one of the most beautiful vistas in the country (at the black dragon pool). Old town is an ancient village that has been converted into a tourist town with old cobbled roads, each with a little canal. At night, the town is lit up with lanterns, and the restaurants are accessed with little bridges - all very romantic and relaxing. We climbed the Jade Dragon Snow Mountain (4500 of its 5500 meters, yes they sell oxygen up there), visited the Tiger Leaping Gorge, and went boating and horse riding in a near by valley.

A thoroughly relaxing time all in all, and I found some time to catch up on some reading and science podcasts which I should blog about soon enough. It's good to be home.

Tuesday, September 1, 2009

Competition hacking with Ruby and Objective-C

During August, github ran a recommendation engine contest. An objective of the contest was to generate a number of open source projects (on github) providing white label recommendation systems (a noble goal). A suite of github user data was provided including ~120K repository definitions and a corpus of nearly half a million user-repository relationships. Entries had to be hosted on github and released under an open source license. Assessment involved making predictions against a test set of more than 4700 users of between 0 and 10 recommended repositories. Submission involved setting up a post-commit hook for your codebase and committing a results.txt to your repository. The results file was automatically evaluated and you score updated on the competition leaderboard.


I tracked the announcement of the competition and associated commentary. Around mid-August I got an itch an decided to hack on the contest.

After downloading the 4MB dataset, I started by writing some ruby code to parse the files and load them into an RDBMS. Halfway through munging (active record outside of rails) I realized the stupidity of the approach. I was managing for (the unnecessary) scarcity of memory. Even with bloated datatypes, the 4MB of provided data could easily fit into main memory many times over. After having participated in many of these AI-programming competitions in years past, I felt severely out of practice an quickly pushed on.
I chose ruby because I felt I still had much to learn about it and hadn't had a real chance to use it in anger, other than some rails hacking. I bashed out some code to parse the files and build a simple data model in memory. I started processing the data model with a k-NN toward making basic predictions and quickly hit the wall. I got some strange results with my hashmap's and began to mistrust my assumptions about Ruby's dynamic typing, resulting in lots explicit casts and other savage hacking - an artifact of my inexperience.

The real problem was speed. Runs were taking waaaay toooo loooooooong on my 1920's era iBook. I started looking into optimizing the code, compilation options and even tried running on updated and optimized builds of the Ruby interpreter. I got improvements, but not enough.

The thing about prototyping is that you need to hack around to find a stable definition of the problem your solving while defining your solution. I could have hacked together my algorithms in C and called them from ruby, but I was not confident on exactly what I wanted to do. Having been breast fed on ANSI C, I had a feeling for how long these small files should take to load from disk and how long the should take take to crunch for crazy n^2 experiments. I dropped tools and switched to another language I've been looking for an excuse to use in anger, Objective-C.

Writing header files in C++ always felt like duplication of effort to me. Not only did I get this feeling with Objective-C, but it was far worse. Coming directly from ruby, the Objective-C verbosity was distressingly laughable! I had used Objective-C before, but I was improvising then, rather than directed and problem solving. Finally, the Objective-C standard library (Cocoa) is painfully minimal compared to Java or C++ (at least for simple things like string manipulation and collections).

I hung in there. I got used to the bracket syntax and the approach memory management was quite familiar. I aggressively used memory pools in data processing experiments, although I often felt the desire to peer into the framework code to see exactly what kind of wasteful things it was doing in the name of convenience. I had the second edition of Kochan's "Programming in Objective-C 2.0" open on the desk - which I found almost useless. What I needed was a API description and a summary of gotchas - basically an O'Reilly book. As I mentioned, my machine is a piece of shit, so I use VI or textmate, not XCode. I tried to use the Apple interface to the Cocoa API docs online, but their organization was useless (learn from javadocs already!). I resorted to googling class names and my ad hoc functional requirements to figure out how to do things.
Although I'd read some text books on datamining (like 10 years ago) and wrote some plugins for WEKA, I knew very little about recommendation systems and collaborative filtering. I'd followed the Netflix competition, reading the papers and such (I even ran some simple stats over the provided dataset loaded into a Postgresql instance). Also, I'd read up on SVD before for some text processing hacking, and knew it could have a role to play in feature detection.

My general approach involved thinking hard about the data and the domain, proposing indicators and heuristic predictions, and testing them via direct submissions. I partitioned indicators and heuristics at three levels: user, group (using a k-NN), and global. I tried rank based indicators (for example: global repository watch rank) which performed poorly. I then tried some probabilistic indicators (for example: probability of this user watching a repository with this owner) which got much better results. I tried indicators to confirm/disapprove assumptions then played with the weighting factors for combining them.

I was in the competition to hack (string code together outside of work) and to learn. To learn more about these languages, and to learn something about practical collaborative filtering (building one from scratch). The beauty of this competition was the openness. Not everyone checked in code during the competition, submitting just the results.txt. Others checked in most or their entire code base (I sure as hell did), and this provided an opportunity to poke around the indicators and algorithms used by others.

For me, the most useful lessons from poking around others code was how to structure a flexible recommendation pipeline. Algorithms are awesome, and there is a lot of work in tuning them to the dataset, but having well-structured recommendation pipeline allows you to add in new indicators, heuristics, algorithms and pre-processing steps with minimum disruption. The readme associated with jeremybarnes' project was very insightful for both the algorithms one should be thinking about, as well as the concerns on the information processing pipeline one should be building.

The pipeline I converged to was something like the following:
  1. Parse sources, build first order data model
  2. Second order data model processing (stats, clustering, etc)
  3. Process test users one at a time
  4. Generate a candidate set (sub-set of all possible recommendations)
  5. Filter candidate set
  6. Apply indicators (blended, voting, whatever)
  7. Sort by scoring, and recommend
There may be some generality in this pipeline, although it's clearly specific to smalls scale recommendation engines, not for anything of scale (at least not in this form). Also, serializing the results of second order data model processing is also a great speedup.

I thought I could get some gains by building deep prediction models. Objective-C provides a convenient bridge to Java by booting an in-process JVM. I used this to hack in some classifiers I prepared in WEKA into my pipeline with no appreciable success. It was a fun exercise, and I tried a number of decision tree algorithms and configurations. I'm sure with the selection of appropriate predictive indicators at an appropriate level (cluster or user) that the approach would be great.
Toward the end of the project I started reading introductory papers to collaborative filtering and recommender systems. I'd had plenty of contact with vector space processing, so I was aware of dimensionally reduction approaches and also expected an array of vector distance measures for clustering. I was aware of the phrase 'item-based recommendation' but the realities of the approach were a lot more apparent after piecing together my toy system.

Next time I'll hit the ground running:
  • Use C++ and using in line C functions for raw power and maybe script the pipeline, experiments, or indicator weighting using something fun like Ruby.
  • Use SVD or similar to dimensionality reduction and use the resulting features for user and item clustering. I'll also use an off the shelf implementation.
  • Create an explicit offline test set and an automated testing process. Automation is key here, and cross-validation of the offline test set with statistical tests for significance will be a powerful addition.
  • EC2 server time, contingent on an robust test automation process. EC2 is cheap and worth the pocket change needed to cover a competition or pet project like this one.
  • Test the effects of different clustering algorithms (KMeans, raw k-NN, arrays of distance metrics)
  • Be more scientific regarding the testing of indicators and the testing of indicator combinations. I recorded results, but I could have been more careful, tested more things, and deduced trends that were only found by bugs or by reading other peoples submission code.
  • Visualize the dataset and indicator trends, using R or at the very least WEKA or excel.
  • Optimize indicator combination using gradient decent or a genetic algorithm.
As the deadline approached I tried all kinds of random hacks and managed to get a boost to just over 40% (my personal goal), and then on to 48.57% correct recommendations. I managed to place 15th the night before the competition closed (Aust. time), although ultimately finished with a place of 18th. It seemed that a number of people got involved late in the game using blending approaches on the leader's results files. Fun I'm sure, although not in the same league.

You can see my code (with a summary of my strategy in the readme), a history of my submissions, and the overall competition leaderboard. Github have not yet officially finalized the results and selected a winner.

I had fun, and the project was an excellent catalyst for me to get back into my off-hours programming groove. I'm kicking myself for not getting off my ass and at least participating in the Engine Yard competition or the Mario AI competition, the latter of which I believe is still open (second round).