Thursday, February 16, 2012

Hosted Data Analysis Project Workflow

I have been thinking a lot about the previously mentioned "Data Analysis Project Management: SaaS".

Such a system could be focused on the project management side, say a Basecamp for Data Analysis projects. But I think it could something different, something more. I think there are pain points in the data analysis workflow that could be 1) systematized and 2) automated. I also suspect that large data platforms from IBM and Oracle may offer solutions, but I question whether the any such solutions exist for smaller scale/cost projects. I think there is an opportunity for a Hosted Data Analysis Project Workflow system.

I've been trying to think hard about possible pain points, and here's what I got:

  • Revision control - keeping track of changes to project files.
  • Research Journal - keeping track of what questions have been asked and what findings have been made.
  • Reproducibility - ensuring there is a recipe to recreate a past result.
  • Collaboration - working with others on a particular on all aspects of the analysis workflow.
  • Interpreting Data - Looking at tables and graphs and generating hypotheses to go and test.
  • Executing Models - configuring, running, tuning models.
  • Verifying Models - assessing models on test and verification datasets.
  • Blending Models - comparing and combining model outputs.
I have also been thinking about solutions in the market that address specific pain points. This is much harder - the Internet is a big place. Here is what I could put together quickly:
  • Many Eyes - Collaborate on interpretation from visualization.
  • Google Predict - Hosted models and data
  • Github (and similar) - Hosted project files, forking, collaboration
  • Kaggle, TunedIT - Data competitions, competitive community around problems, Data Spec Work / R&D Outsourcing.
  • Cross Validated - Q/A community around statistics and machine learning
  • Google data explorer - Visualization and interpretation public datasets
Do we need a system that combines all or some of these together? Why can't this problem be solved with existing systems, why isn't it such?

Wednesday, February 15, 2012

Data Analysis Project Management: SaaS

The idea for developing a Data Analysis Workflow SaaS as been percolating in my brain for a few days.

I read a post on the Kaggle HHP forum entitled "Project Management software for Data Analysis" by Dan Becker and it got me thinking. With almost every data or machine learning competition I enter, I end up solving the wrong problem. Rather than solving the problem presented by the competition, I solve the problem of automating and optimizing my work flow by writing scripts and software.


It makes sense, it's what I do for a living. I write and maintain software. Why not expose the problem for what it is and solve it. The real question: Is there a market for such a solution? and secondarily, What features would the software have?

My ad hoc competition-wise solutions focus on result reproducibility (good science), summary statistics, visualization, research journal, generating webpages of experiments performed and above all, automation. I figure that the more I can automate, the faster I can test ideas and the higher my velocity will be on the problem of the competition. Except, most of my effort goes into the automation rather than ideas for the subject data.

Whether I build the software or not, it is a good research topic to consider.

The aforementioned forum post lists some interesting resources, specifically:
Some google'ing resulted in some additional useful links: How to efficiently manage a statistical analysis project? on CrossValidated which provides some excellent answers and great links to go off and read. Other CV links that were pretty useful included Best Way to Aggregate and Analyze Data and What is a practically good data analysis process?

I also came across Organizing Your Approach to a Data Analysis by Scott Emerson which provides some excellent motivating questions.

Know any other good resources?

I'm keen for feedback, should such a software (web-based) platform exist in the world?
Are their data analysis people that would use this, pay for this, beyond competition patrons?

Tuesday, February 14, 2012

Preview of Clever Algorithms: Statistical Machine Learning Recipes

I am feverishness working to complete a first draft of my next book in the Clever Algorithm series:

Clever Algorithms: Statistical Machine Learning Recipes
You can take a sneak peek at some early chapters online. I am developing the project in plain view on github, take a look at my source R files or LaTeX files if you're into that sort of thing.

The book will provide a treatment of the field of Machine Learning much like the first book provided a treatment of the field of Computational Intelligence and Biologically Inspired Computation.

Here is a preliminary blurb:

Implementing an Machine Learning algorithms is difficult. Algorithm descriptions may be incomplete, inconsistent, and distributed across a number of papers, chapters and even websites. This can result in varied interpretations of algorithms, undue attrition of algorithms, and ultimately bad science.
This book is an effort to address these issues by providing a handbook of algorithmic recipes drawn from the field of Machine Learning, described in a complete, consistent, and centralized manner. These standardized descriptions were carefully designed to be accessible, usable, and understandable.
An encyclopedic algorithm reference, this book is intended for research scientists, engineers, students, and interested amateurs. Each algorithm description provides a working code example in R.

I have started to distribute chapters to copy editors and technical editors. I'm also trying hard to nail down the table of contents and specifically which algorithms will appear in the text.

If you know anyone who might be interested in technical editing or any editing, get into contact: jasonb@CleverAlgorithms.com