Tuesday, March 6, 2012

DailyAIFeed: Your daily dose of Artificial Intelligence news

I've been running a little email list for about a year called DailyAIFeed. I started it because I got sick of checking my RSS and email each morning, preferring to just check email.

I've recently cleaned it up a little and opened it up to general consumption at www.DailyAIFeed.com

The list provides a daily email of the top interesting news in the fields of Artificial Intelligence, Machine Learning, Natural Language Processing, Computational Intelligence, Infographics and Data Science.

You can see examples of the types of messages on the blog at blog.DailyAIFeed.com

Go and check it out, and if you have any suggestions on how to further improve it, please get in touch.

Thursday, February 16, 2012

Hosted Data Analysis Project Workflow

I have been thinking a lot about the previously mentioned "Data Analysis Project Management: SaaS".

Such a system could be focused on the project management side, say a Basecamp for Data Analysis projects. But I think it could something different, something more. I think there are pain points in the data analysis workflow that could be 1) systematized and 2) automated. I also suspect that large data platforms from IBM and Oracle may offer solutions, but I question whether the any such solutions exist for smaller scale/cost projects. I think there is an opportunity for a Hosted Data Analysis Project Workflow system.

I've been trying to think hard about possible pain points, and here's what I got:

  • Revision control - keeping track of changes to project files.
  • Research Journal - keeping track of what questions have been asked and what findings have been made.
  • Reproducibility - ensuring there is a recipe to recreate a past result.
  • Collaboration - working with others on a particular on all aspects of the analysis workflow.
  • Interpreting Data - Looking at tables and graphs and generating hypotheses to go and test.
  • Executing Models - configuring, running, tuning models.
  • Verifying Models - assessing models on test and verification datasets.
  • Blending Models - comparing and combining model outputs.
I have also been thinking about solutions in the market that address specific pain points. This is much harder - the Internet is a big place. Here is what I could put together quickly:
  • Many Eyes - Collaborate on interpretation from visualization.
  • Google Predict - Hosted models and data
  • Github (and similar) - Hosted project files, forking, collaboration
  • Kaggle, TunedIT - Data competitions, competitive community around problems, Data Spec Work / R&D Outsourcing.
  • Cross Validated - Q/A community around statistics and machine learning
  • Google data explorer - Visualization and interpretation public datasets
Do we need a system that combines all or some of these together? Why can't this problem be solved with existing systems, why isn't it such?

Wednesday, February 15, 2012

Data Analysis Project Management: SaaS

The idea for developing a Data Analysis Workflow SaaS as been percolating in my brain for a few days.

I read a post on the Kaggle HHP forum entitled "Project Management software for Data Analysis" by Dan Becker and it got me thinking. With almost every data or machine learning competition I enter, I end up solving the wrong problem. Rather than solving the problem presented by the competition, I solve the problem of automating and optimizing my work flow by writing scripts and software.


It makes sense, it's what I do for a living. I write and maintain software. Why not expose the problem for what it is and solve it. The real question: Is there a market for such a solution? and secondarily, What features would the software have?

My ad hoc competition-wise solutions focus on result reproducibility (good science), summary statistics, visualization, research journal, generating webpages of experiments performed and above all, automation. I figure that the more I can automate, the faster I can test ideas and the higher my velocity will be on the problem of the competition. Except, most of my effort goes into the automation rather than ideas for the subject data.

Whether I build the software or not, it is a good research topic to consider.

The aforementioned forum post lists some interesting resources, specifically:
Some google'ing resulted in some additional useful links: How to efficiently manage a statistical analysis project? on CrossValidated which provides some excellent answers and great links to go off and read. Other CV links that were pretty useful included Best Way to Aggregate and Analyze Data and What is a practically good data analysis process?

I also came across Organizing Your Approach to a Data Analysis by Scott Emerson which provides some excellent motivating questions.

Know any other good resources?

I'm keen for feedback, should such a software (web-based) platform exist in the world?
Are their data analysis people that would use this, pay for this, beyond competition patrons?

Tuesday, February 14, 2012

Preview of Clever Algorithms: Statistical Machine Learning Recipes

I am feverishness working to complete a first draft of my next book in the Clever Algorithm series:

Clever Algorithms: Statistical Machine Learning Recipes
You can take a sneak peek at some early chapters online. I am developing the project in plain view on github, take a look at my source R files or LaTeX files if you're into that sort of thing.

The book will provide a treatment of the field of Machine Learning much like the first book provided a treatment of the field of Computational Intelligence and Biologically Inspired Computation.

Here is a preliminary blurb:

Implementing an Machine Learning algorithms is difficult. Algorithm descriptions may be incomplete, inconsistent, and distributed across a number of papers, chapters and even websites. This can result in varied interpretations of algorithms, undue attrition of algorithms, and ultimately bad science.
This book is an effort to address these issues by providing a handbook of algorithmic recipes drawn from the field of Machine Learning, described in a complete, consistent, and centralized manner. These standardized descriptions were carefully designed to be accessible, usable, and understandable.
An encyclopedic algorithm reference, this book is intended for research scientists, engineers, students, and interested amateurs. Each algorithm description provides a working code example in R.

I have started to distribute chapters to copy editors and technical editors. I'm also trying hard to nail down the table of contents and specifically which algorithms will appear in the text.

If you know anyone who might be interested in technical editing or any editing, get into contact: jasonb@CleverAlgorithms.com

Tuesday, May 17, 2011

House Price Regression: Vermont South, Melbourne

While looking for a house I maintained statistics on the main suburbs we were visiting, and more specifically, on the houses we looked at.

Each house we visited was about 4 bed rooms and generally had the same kinds of attributes - attributes we thought we wanted in a house. For each house we visited, I recorded the address, size of land in square meters, date of sale, sale type (auction, private), asking price, sale price, and other assorted details. An additional contrived measure was the driving distance to main shops reported by Google Maps, in kilometers.

I also supplemented the dataset with additional matching houses in the area when data was available. I found prices were sometimes available from the auction results, although in other cases I had to call to find out, and.or scourer the web.

Rather than let this information go to waste, I thought I would share some of the collected data. This post provides data I collected for the Melbourne suburb of Vermont South.

The following graph simply shows sale price by date, quite boring.

The following graph shows the sale price by land size in square meters.
The following graph shows the sale price by distance to a specific set of shops in kilometers.
I found the data generally useful for plugging in new places and using simple linear regression to help answer questions about expected price at auction or private sale.

Some of this data may be available for purchase from various retail data providers, but I found collecting and entering the data myself made it a lot more personal and gave me some additional focus when inspecting properties and talking to agents about trends.

Monday, May 16, 2011

So, we bought a house

So we finally bought a house. We've been looking on and off for about 12 months although things got serious about 3 months ago.

We first looked at the place we bought last Saturday, and walking in the door I knew it was a strong contender. We looked at three other places that day, and they all paled in comparison.

The place had been passed in at auction nearly three months before, and we were told that initially the vendors expectations were too high. We saw this as a good opportunity to negotiate and try to broker a good deal. The market had been slumped for a few months and the early figures for the quarter had indicated a ~2.5% drop in median house price for the city.

We did another inspection on the following Tuesday and enlisted all of the troops (extended family) to give the place a good once over. We then sat down and signed a formal offer. It was rejected. We upped the offer $5K and it was accepted, although at the insistence of my wife we made the offer contingent on the outcome of a builders inspection.

A found a company in the yellow pages and had the inspection done on the last day of the 3 day cooling off period. The building and pest report was incredibly detailed, providing photos and a room-by-room summary, inside and out. We learned a lot about the types of preventative maintenance the place will need over the next 5-10 years, and more importantly, we learned that the upstairs balcony had some major structural problems.

The report said that the wood used was popular in the decade that the balcony was built and was known to rot unless property treated. Rather than expecting the vendor to return the balcony to new condition, we made an offer to split the difference, deducting half of the cost of the repair from our offer.

All of the negotiating occurred on the last day of cooling off period, a Friday. I had my wife on one had, adamant that she didn't want to pay a thing to have the balcony fixed, and the agent on the other hand threatening to open the property for inspection on the next day. I really liked the place and I was feeling totally strung out (to say the least).

We managed to broker a deal in the end and initial the final amendment on the Saturday, one week from our first inspection. With previous auctions and negotiations, I tried to remain emotionless, time was on our side and we could wait for a deal. I really liked this place and it was beginning to dawn on me that our remaining time to find a place (before the baby came) had shrunk to a matter of a few months. We're both happy we finally got there and have high hopes for turning the property into our home.

We learned a lot throughout the process. My analysis of median house prices, suburb selection, crime rates, and even travel time studies months ago were interesting, although in the end did not directly affect the outcome. Even the detailed suburb house price regressions I was building up were not used, as we ended up buying in a completing different suburb, inspecting the house on a whim.

I was told early on that buying a home is different from buying an investment, and it bit me in the end, because its emotional. If/when there is a next time, at least our expectations - that it is a long hard emotional roller coaster - will mean we'll be better prepared. Hopefully.

Rather than letting them go to waste, I'll post some regression analysis for a selected suburb soon.

Tuesday, May 3, 2011

Quake AI Programming Book

I intend to write a follow-up book to the Nature-Inspired Clever Algorithms book on Machine Learning. I have a lot going on this year, so I was thinking of postponing it until 2012. If I do decided to go down this road, I was thinking of taking on a different project in 2011 that would be smaller in scope, less taxing, although still interesting and rewarding.

I have been thinking of writing a book about the AI in the Quake series of computer games. I was thinking of either writing a book that analysed the Artificial Intelligence architecture in each game in the series, or analyze the AI in the bot modifications. Perhaps both. The book would walk through monster or bot case studies and describe how they fit together, think, and behave. Perhaps with small experiments and demonstrations along the way. The kind of book that would have captured me as a game programming hacker 15 years ago.

In pondering this idea, I thought it prudent to explore other books written on or related to this idea. The following is a list of books that I found:

Quake Series Programming Books

Related Programming Books
These are by no means the cream of the crop of game AI programming, and there are in fact many level design books in there as well.

All of these books are focused on teaching some form of programming or game development using an existing game as a medium. The advantage of the Quake series is that the source code is released under the GPL. The Unreal series and the Half-Life (Source Engine) series are not released as open source, although do provide access to some aspects of the source under restricted licence for the modding community.

It is clear that there is interest/demand for books on game development based on the Unreal series, which makes a lot of sense given their general success in licencing the technology.

Some concerns about tackling such a project include:
  • Interest: The games in the Quake series are old (10-15 years). The methods may be outdated, they may not be relevant to modern computer games, and it is more than likely that no one will care
  • Low Barrier: It is more than likely that no one has undertaken such a project because the barrier is so low. One can simply read the code and understand what is happening, no analysis is necessary. 
  • Copyright: Although the source code is released under the GPL, the game assets are not. One may have to acquire a licensed copy of the game to do any meaningful development. Additionally, my use of game screenshots may be restricted (fair use!?).
There is some effort required to produce such a work. Getting each project setup may be involved, especially across the three main platforms (Windows, Mac, Linux). The work would be primarily analysis: reading source code, experimenting and communicating what is happening with diagrams and descriptions. This tinker-write cycle is slightly more relaxed than the deep research needed for each algorithm in a machine learning text.

Is there interest in the market? Would you read or skim such a book?
Let me know what you think in a comment or email.