While I've been slogging through
my current book project I've have had a number of follow-on book project ideas. Rather than letting them rot in my head, I thought I'd leave a note to future me (or anyone else looking for a good side project to tackle).
One of these ideas is to write a workbook on machine learning that uses the
Netfix Prize dataset as the focus. The objective of the book is to learn how to apply a number of techniques at a number of different levels of complexity on a real problem domain.
The beauty of the netfix prize dataset is that there is a lot of information out there that can be drawn together, from detailed
forum posts to
peer reviewed publications. This information can be located, sifted and reduced into the book structure, highlighting findings and strategies for addressing the difficult classification/prediction tasks.
The second great aspect of the competition is that it had a clear objective, a 10% improvement in RMSE over the baseline technique. The narrative of the book can lead the reader from basic 'getting to know the dataset', to technique application (ensembles), to the development and application of advanced tuning regimes, with pit stops in necessary areas such as cross-validation test harness development. Hopefully, the work would climax by putting all the knowledge learned throughout the journey together and building the final classifier system that achieved the winning result (or close to it!).
The book would be fun to write (and read!!!) because there is so much to do and to learn. A practical workbook means that the chapters would be littered with functional explanation of how techniques work, how to analyze the results that are produced and relate them to the broader problem, and most importantly (complete!?) sample code for achieving each step along the way.
I'd expect samples would be in SQL (initial dataset explorations), Ruby or Python (for exploring different techniques), perhaps a heavier language like C++ or Java for core/slow techniques that are built up towards achieving the 10% improvement, and perhaps R or similar for result analysis and interpretation. Unfortunately, the source code for the winning approaches is not available, although a best possible approximation of the systems used would sufficient for demonstration purposes.
Regarding the book structure: I'd like to see a staged hill climb of complexity, where the limit of a direction or an approach is reached by the end of a section of chapter and a new approach picks up in the following chapter or section where the previous left off. It's really hard to come up with a preliminary table of contents without first immersing myself in the facts that were explored in this competition, so instead I'll highlight some topics I'd like to see covered in the book:
- The Prize : background on the competition, who participated, and how it played out
- Recommendation: background of collaborative filtering / recommendation systems, approaches typically used and how they work.
- Preliminary Analysis : preliminary stats on the dataset using SQL and/or a scripting language.
- Classical : Application of classical approaches: SVD, rule systems, correlations, etc - perhaps a range of parametric approaches
- Mainstream : Application of mainstream approaches: SVM, kNN, etc - perhaps a range of non-parametric approaches
- Ensembles : Combining classifiers, tuning the mixture of experts, etc
- Meta : Boosting, bagging and related approaches and their benefit.
- Lessons : Extracting key points that could contribute towards building a competitive recommendation or machine learning system (or at the very least for addressing machine learning competitions)
Amongst the topics you can tease out a general structure of: Data Analysis->Classical Approaches->Mainstream Approaches->Meta Techniques->Tuning. I would rather the technique sections focus on actual approaches adopted by teams during the competition. To find this out, I believe it will require a lot of questioning of participants. From memory there was heavy use of SVD based systems, kNN, ensembles, and optimization of ensemble systems.
"
Movies by the numbers: Practical Machine Learning with the Netfix Prize dataset" (a title plucked from thin air) - coming to a book store near you in 2012, possibly.
If you would like to see this book exist or have an opinion about its content, leave a comment and let me know!