Saturday, March 22, 2008

Mapping 'no free lunch'

An idea I have been rolling around recently with my colleague (Dan) is to race a stack of machine learning algorithms (starting with computational intelligence algorithms on optimization problems). This basic idea has expanded to include notions of full automation of case integration and execution (upload your own systems and problems), and beyond.

Firstly, racing is a bad idea. We have No Free Lunch, and we have 60+ years of advice from related disciplines telling us it is fundamentally anti-intellectual. Frankly, we are not interested in a winner (silver bullet black box algorithm), rather in the distribution of winners. Specifically, in using automated statistical tools to collect, maintain, provide, and promote the relationships between quantifiable measures, problems, and algorithms. It's an ambitious goal, as the scope of things to assess in the literature is massive (although finite)! The notion was born out of the general observations of the difficulty of reproducing results and lack of consistency in methods results. For example, the state is so bad in CI, that one would be considered crazy these days to refer to the results from another paper (trust is very low).

Many have tried to do similar things. There was statlog (in fact I was inspired by Duch's result listing), there are competitions, and there are awesome libraries of standard measures and algorithms. Two reasons why I think this has not been addressed on a large scale are (1) there's no money in it, and (2) it's hard, real hard.

Regarding funding, I think the data is valuable. First tier, you make and the software and results freely available and seek simple advertisement revenue. Popularly is promoted bottom-up through publications and symposiums. Maybe there's grant funding that could be funneled in if the 'right perspective' can be devised for a given project. The competitive advantage of the first tier is weak, generation and availability based on a mishmash of in house and open source software (academic software is generally public domain). Second tier, is about consulting based on the skills and knowledge acquired developing the resource. Specifically consulting on the application of the technology using in house tools and methods, and most likely custom build solutions. A hard business to enter, and all the money is in the custom builds. The alternative route is academic, which given the amount of information that could be mined would no doubt provide a steady stream of publications.

Regarding difficulty, I think the best way to address the complexities and scope is decomposition with aggressively iteration. With a good design, algorithms, problems, and measures could be added in dynamically, where all execution is farmed in small jobs, and all human analysis is performed with RDBMS queries. The importance is that results are permanent, in that once a measure is calculated for a run-algorithm combination, it is available for all future analysis. This addresses the enormous problem (as I see it) of duplicated effort, in particular software engineering by scientists (poor software), and research by software engineers (poor method). Further, a good design also instills longevity and self-maintenance into the system. For example, full automation of the addition of the measure/algorithm/problem submission could be achieved with code-reviews, publication evidence, and reputation systems to promote decentralised trust and control in the system.

Beyond the funding and the difficulty, the contribution of such an effort would not only be extremely fun, it would raise the level of quality and accessibility of knowledge in the field immediately and permanently, potentially (given the uptake) influencing the way contributions are made or perceived.

3 comments:

Jason said...

I want to clarify the vision a little bit.

Consider the results of a suite of algorithms on a suite of problems with a range of configurations for each system. Such results may be exploited by those professionals or research scientists that choose to not use all available information.

An example is that of a scenario in which the time and or expertise is simply not available in which to either specialise a system for the problem, or transform the problem for a given system. In such a scenario, you want a generalised algorithm that is going to give a a good result, quickly. A database of results provides a basis for making an informed (better than random) decision about which system to use, or more importantly, that a choice even exists.

Jason said...

It occurs to me that Amazon's Elastic Compute Cloud (EC2) would an excellent platform in which to deploy this project, particularly given the proposed cost effective strategy to drive traffic to CPU in cloud computing.

The project requires a lot of storage as well, although results can be compressed and summarised using statistical tools, pushing effort into problem-algorithm executions.

Jason said...

The project would also make extensive use of the Google Visualisation API.