Friday, May 21, 2010

Machine Learning as a Service: The Google Prediction API

The recent announcement of the Google Prediction API caught my attention. The service is interesting in that it is a business model that focuses on providing a scalable machine learning black box that can be used directly or integrated into an application. The services may work by the user first uploading a dataset to the also newly announced Google storage service, training an opaque model from the data, and deriving predictions from the prepared model.

One interacts with the service using a RESTful API, performing POST and GET HTTP operations in order to invoke training and prediction functions. Data must be provided in a CSV format of comma separated line-based records. Training seems to be per data-bucket and it is unclear whether models can be updated once trained, whether the models can be retrieved, or even what types of machine learning algorithms and algorithm parameters will be used. The service describes only support for supervised classification tasks at this stage.

Data types are limited for now to categorical prediction (classification) with real and textual inputs. Naturally (this is google) data records can be comprised of very long lists of attributes and dataset sizes can be enormous. The status of model training can be queried and some basic statistics from the trained model can be retried - classification accuracy which is determined using cross validation on the provided training data. Predictions can be made using a query interface passing in the input attributes and retrieving the classification. Presumably there is a batch mode where multiple records could be passed in for classification.

All this information was distilled from the service page and is more than likely to change. The service is not available yet, but I signed up to the waiting list of people to get early access burning it in. Billing will likely converge to a factor of storage size, maybe even model compute time, and the volume of retrieved predictions.

To me it feels like they have abstracted the process used to build the language translation service or spell checker/corrector, simplified it, and are turning it into a commodity. Big data, rather than fancy algorithms is 'where it is at' (see Norvig's Theorizing from Data from 2007).

The service is loosely related to two other services out in the wild. The first is TunedIT which is an algorithm/dataset/challenge website launched in September 2009 (see a press release). The site allows the uploading of data sets and/or algorithms and more importantly the design of data set challenges like the Netflix Prize. This seems to be the primary function of the site and to me it is trying to exploit the success of the Netfix Prize by abstracting it and providing the management of such challenges as a service (not a terrible idea). The other site is MLcomp which launched in April 2010 (see a press release) and is focused on users either uploading datasets to find the algorithm that performs the best, or to upload algorithms and have the system automatically evaluate it against all previously uploaded datasets. To me, it feels like an online version of the WEKA machine learning workbench (not a terrible idea if your market is other grad students). Both sites are really focused at machine learning practitioners, and unlike the announced Google service don't seem to offer a useful way to exploit the algorithms for private data sources.

I had some similar ideas to this while studying as a graduate student, although I had lofty scientific ambitions of automatically mapping the performance of a large suite of function optimization algorithms rather than function approximation machine learning algorithms - something like an optimization version of MLcomp. I even blogged a little on it after I completed my dissertation (see Mapping 'no free lunch'). The targeted value proposition in the Google prediction service is an excellent approach, and people may even pay to use it.

Although the algorithm hackers researchers will want to know all about the algorithms and parameterization of said algorithms, I hope that the service remains a black box (shock, horror!). Maybe not, but I would hate to see this devolve into an algorithm free-for-all that would confuse users and muddy the value this service could deliver. With Google-level infrastructure, they can run a suite of the top 20 techniques for a given problem type and deliver the best (or an ensemble) to provide the predictions and keep the specific details of the magic that produced the model a secret. That is what I would do. And if this is indeed the adopted strategy, then I doubt we will see a "download model" API call anytime soon.

5 comments:

Jason said...

There have been a number of posts on this new google service by machine learning people, including:

- Prediction Services on Inductio ex Machina
- Google Predict on Machine Learning (Theory)
- Google Prediction API: Commoditization of Large-Scale Machine Learning?.

Jason said...

Some good comments about this on hacker news Google Prediction API.

Jason said...

Some more interesting and related articles:

- An image for AWS called Cloud1305- Machine Learning On Demand that provides a range of machine learning algorithms ready to run.
- A web service called predict.i2pi that provides a machine learning web service backed by a range of machine learning algorithms in R (see this post for more details).
- A tutorial entitled Parallel Machine Learning for Hadoop/Mapreduce – A Python Example
- An interview with a guy from FlightCaster that uses machine learning to predict which flights will be late entitled How FlightCaster Squeezes Predictions from Flight Data

Jason said...

face recognition, a specific form of a machine learning service (pattern recognition, feature extraction, and classification): http://developers.face.com/

there are two clear paths here: domains applications with applied machine learning services (like face) and infrastructure services to do it all yourself (like google)

Marcin Wojnarski said...

Jason, thanks for the post and for mentioning TunedIT. I'd like to clarify that challenges are not primary functionality of TunedIT - Research part is equally important as challenges and actually it was the first one, in the very beginning of the system development. Maybe challenges are more visible, because - what's worth to notice - they're the most enjoyable form of making research in machine learning. :)

As to the main topic of your post, release of Google API is indeed a very important event in the ML world. I wonder how this will work in practice, because for me it seems that the most difficult part of making ML applications is not the technical stuff (launching a training procedure), but rather to gain enough understanding of the problem to know how to approach it, e.g. problem formulation, what data should be on input, ways of preprocessing etc. Google API can't answer these types of questions, so the user should anyway be skilled enough in ML to know what he expects from the API. Cheers.