I have been thinking a lot about the previously mentioned "Data Analysis Project Management: SaaS".
Such a system could be focused on the project management side, say a Basecamp for Data Analysis projects. But I think it could something different, something more. I think there are pain points in the data analysis workflow that could be 1) systematized and 2) automated. I also suspect that large data platforms from IBM and Oracle may offer solutions, but I question whether the any such solutions exist for smaller scale/cost projects. I think there is an opportunity for a Hosted Data Analysis Project Workflow system.
I've been trying to think hard about possible pain points, and here's what I got:
- Revision control - keeping track of changes to project files.
- Research Journal - keeping track of what questions have been asked and what findings have been made.
- Reproducibility - ensuring there is a recipe to recreate a past result.
- Collaboration - working with others on a particular on all aspects of the analysis workflow.
- Interpreting Data - Looking at tables and graphs and generating hypotheses to go and test.
- Executing Models - configuring, running, tuning models.
- Verifying Models - assessing models on test and verification datasets.
- Blending Models - comparing and combining model outputs.
- Many Eyes - Collaborate on interpretation from visualization.
- Google Predict - Hosted models and data
- Github (and similar) - Hosted project files, forking, collaboration
- Kaggle, TunedIT - Data competitions, competitive community around problems, Data Spec Work / R&D Outsourcing.
- Cross Validated - Q/A community around statistics and machine learning
- Google data explorer - Visualization and interpretation public datasets



4 comments:
Can't forget Wolfram Alpha Pro for all kinds of automatic data analysis.
One more resource, and one more pain point:
Resource/solution: stats.stackoverflow.com is a great place to ask questions and research whatever problems you have. I'm constantly amazed at the high quality of (free) help on the stackoverflow sites.
Pain point: Cloud computing should be easier. I've found it surprisingly frustrating to get set up on Amazon EC2. I've used cloudnumbers.com, which resells EC2 instances with an easier to use interface. I don't mind paying a little more for the computing time, but there's still stuff that doesn't work as well as it should.
I also wish I could set up dropbox style filesharing between a folder on my local computer and my Amazon (or cloudnumbers) workspace? I assume you could set something up so the cloud workspace pulled from a repository every time before it did anything interesting... but I haven't been able to set that up.
Email comment from Terrence (bioinformatics student)
That's really interesting.
Within bioinformatics, there is a really big problem in that when someone publishes an article about doing a whole heap of sequencing and discovering a novel mutation, there is no real information on their data analysis pipeline.
If you are lucky you can infer from there references list which tools they have used, but you get no information on what steps they tool to clean the data, filter it, run the analysis, re-run the analysis or anything like that. This means not only am I unable to replicate their findings (which is bad for science), but I am also unable to try running the same pipeline on my own data. The key thing here is knowing what tools were run with what parameters along every step of the process - which should be as simple as a bash script, but never is!
My guess is that a lot of the "pipeline" information is deliberately secret sauce - to provide them with a competitive advantage.
If a system existed that acted as a proxy and recorded all my commands and the output of those commands before sending them to the cluster (kind of like a drop box for data analysis), then that would be very interesting. The trouble being that capturing the output of commands is difficult due to the sensitivity of data (e.g. its not allowed to leave the servers) and also due to its size.
Hope this is useful in some way!
Tweet from @bmabey
Interesting idea! Other reference pts:iPython's notebook aims to aide research http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html Also see vistrails http://www.vistrails.org/index.php/Main_Page
Post a Comment