The recent announcement of the Google Prediction API caught my attention. The service is interesting in that it is a business model that focuses on providing a scalable machine learning black box that can be used directly or integrated into an application. The services may work by the user first uploading a dataset to the also newly announced Google storage service, training an opaque model from the data, and deriving predictions from the prepared model.
One interacts with the service using a RESTful API, performing POST and GET HTTP operations in order to invoke training and prediction functions. Data must be provided in a CSV format of comma separated line-based records. Training seems to be per data-bucket and it is unclear whether models can be updated once trained, whether the models can be retrieved, or even what types of machine learning algorithms and algorithm parameters will be used. The service describes only support for supervised classification tasks at this stage.
Data types are limited for now to categorical prediction (classification) with real and textual inputs. Naturally (this is google) data records can be comprised of very long lists of attributes and dataset sizes can be enormous. The status of model training can be queried and some basic statistics from the trained model can be retried - classification accuracy which is determined using cross validation on the provided training data. Predictions can be made using a query interface passing in the input attributes and retrieving the classification. Presumably there is a batch mode where multiple records could be passed in for classification.
All this information was distilled from the service page and is more than likely to change. The service is not available yet, but I signed up to the waiting list of people to get early access burning it in. Billing will likely converge to a factor of storage size, maybe even model compute time, and the volume of retrieved predictions.
To me it feels like they have abstracted the process used to build the language translation service or spell checker/corrector, simplified it, and are turning it into a commodity. Big data, rather than fancy algorithms is 'where it is at' (see Norvig's Theorizing from Data from 2007).
The service is loosely related to two other services out in the wild. The first is TunedIT which is an algorithm/dataset/challenge website launched in September 2009 (see a press release). The site allows the uploading of data sets and/or algorithms and more importantly the design of data set challenges like the Netflix Prize. This seems to be the primary function of the site and to me it is trying to exploit the success of the Netfix Prize by abstracting it and providing the management of such challenges as a service (not a terrible idea). The other site is MLcomp which launched in April 2010 (see a press release) and is focused on users either uploading datasets to find the algorithm that performs the best, or to upload algorithms and have the system automatically evaluate it against all previously uploaded datasets. To me, it feels like an online version of the WEKA machine learning workbench (not a terrible idea if your market is other grad students). Both sites are really focused at machine learning practitioners, and unlike the announced Google service don't seem to offer a useful way to exploit the algorithms for private data sources.
I had some similar ideas to this while studying as a graduate student, although I had lofty scientific ambitions of automatically mapping the performance of a large suite of function optimization algorithms rather than function approximation machine learning algorithms - something like an optimization version of MLcomp. I even blogged a little on it after I completed my dissertation (see Mapping 'no free lunch'). The targeted value proposition in the Google prediction service is an excellent approach, and people may even pay to use it.
Although the algorithm hackers researchers will want to know all about the algorithms and parameterization of said algorithms, I hope that the service remains a black box (shock, horror!). Maybe not, but I would hate to see this devolve into an algorithm free-for-all that would confuse users and muddy the value this service could deliver. With Google-level infrastructure, they can run a suite of the top 20 techniques for a given problem type and deliver the best (or an ensemble) to provide the predictions and keep the specific details of the magic that produced the model a secret. That is what I would do. And if this is indeed the adopted strategy, then I doubt we will see a "download model" API call anytime soon.
Friday, May 21, 2010
Machine Learning as a Service: The Google Prediction API
Wednesday, May 19, 2010
Aspects as a programming aid
I have recently been playing around with Aspect Oriented Programming, specifically AspectJ, as a tool for Java development. Like many Java programmers, I played around with AOP what it was going through the hype cycle in 2002 and 2003. Since then, it seems that the use of Aspects as a production-level tool is on the rise, even if its use is indirect such as in a JEE container like Spring or framework like Spring Roo.
The initial idea of Aspects sounds cool: you can weave in code that queries the dynamically created abstract syntax tree of a Java application intercepting method calls and injection new functionality. Your aspects are modular, they are reused by your pattern-matching based application (interception), and you don't have to touch (pollute) your Java source. The coolness wears off when you realize your pattern matching queries are limited to high-order abstractions like packages, classes and methods, and that the ability to play with the actual code and context that you intercept is practically not available.
I had been tinkering with some tracing aspects on and off for the last month or two, but with no real enthusiasm. The other day I listened to a podcast on AspectJ and Spring AOP with Ramnivas Laddad, the author of AspectJ in Action. The podcast got me excited again, and I realized that it may in fact be directly useful to help with some sticky race condition bugs on which I've been burning brain cycles.
One aspect in particular that I'm sure others will find useful is a tracer that will detect violations of the 'single thread rule' in Swing. Specifically, those cases when threads other than the Event Dispatch Thread are touching swing objects. See below for a snippet I've been playing around with, based on a code example provided by Anders Prisak in Using AspectJ to detect violations of the Swing single thread rule. Alexander Potochkin provides some additional details in Debugging Swing, the final summary.
The cases the aspect catches are crude, but provide a good starting point for digging around. I use Eclipse with the AspectJ Eclipse plug-in AJDT. In such a configuration, simply configure your existing Swing project as an AspectJ project, drop in the aspect into the default package and run your application to see all the violations printed to stdout. See the AspectJ docs for additional details for weaving the aspect under other circumstances.
Progress has been good, although the subtle and creative use of the syntax may take some time to master. Some resource I found useful to grep while hacking together some tracing aspects included the The AspectJ Programming Guide, The AspectJ 5 Development Kit Developer's Notebook, and the AspectJ FAQ.


