I have been thinking a lot about statistical hypothesis testing on the web recently, likely given the popularisation of A/B testing (more specifically, people talking about the approach). Anyway, I came across a great talk by Ron Kohavi from Microsoft Research (formally Amazon) entitled "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO" (2007).
My notes from the watching the video:
- HiPPO: the guy with money who makes decisions (Highest Paid Person's Opinion)
- Case - Amazon shopping card: test whether showing recommended items when a product is added to a basket promotes larger basket size and increased cross-selling or decreases conversion due to distraction. Not expected to work, wildly successful.
- Case - Checkout page: test two version of a check out page to see which results in more conversions. New version was no good, voucher code distracted customers from completing purchase
- Case - Office online help articles: changing 'was this helpful' message to stars and reason. Simple version was worse (response rate), stars with dynamic text box much better, dynamic box with customized messages much better again.
- Case - Selling Sewing machines: test crazy ideas of selling two machines with 10% off. Tried anyway and resulted in an increases in sales.
- Data trumps Intuition: the less data the stronger the opinions (guesses), collect data through experimentation. Intuition is poor.
- Define OEC: what to measure (Overall Evaluation Criteria), hard to define, not click through but sales for example. Selection of criteria to Optimize for customer lifetime value. Measure lots of things for post-hoc data mining.
- OEC defines whether or not a feature is launched. Drill down into the data with neutral or negative results, attempt to quantify the why, isolate segments, even find bugs in implementation.
- Controlled Experiments: classical statistical hypothesis testing, control (existing version) and treatment (new version), randomized experimental design (strategy for eliminating variables)
- Advantages: find causal relationships (correlation is not cause), insulate external factors (randomness)
- Problems: hard to agree on what to measure, what is being optimized (OEC), quantitative results do not explain why, Primacy effect (experienced user expectations), multiple concurrent experiments (potential of confounding results - interaction), inconsistent trial assignment (cleared cookies if cookie based, etc), not used during critical periods (press)
- Measure Variance: run A/A test (two populations on control) to measure standard variance in the data, run at same time as A/B, avoid conclusions based on random variation (there are statistical tests for this - confidence intervals), power calculations (minimum sample size)
- Strategy: run 50/50 to get peak efficiency, run small for long period, run small and scale up (ensure consistent trial assignment), cancel test if treatment is clearly worse than control (big problems appear even in small samples)
- Trial assignment is important: consistent, independent experiments, consistent ramp up
- Statistics: take the time to execute the more complicated equations, increase reliability, computer speed - computation is what computers are for (lots of books)
- Automated systems: make it easy to run experiments, data-driven culture, near real-time optimizations, automate widget placement - automated A/B testing (data driven selection of best content)
- Lessons: listen to customers, intuition is terrible, replace HiPPO's with OEC, careful and accurate statistics, experiment often to optimize value
The most thought provoking aspect for me was the automation of A/B testing to the point of real-time selection of content for Amazon's home page. Feels like a statistical version of the human-voting systems like Digg, Reddit, and Hacker News. I'd love to through some generative algorithms into the mix to dynamically vary treatments towards fine-grained optimizations (automated feature hill climbing).
I get the feeling that such methods are standard practice at big places like Amazon, Microsoft, Google, eBay, etc. Although I'm sure there will be a lot of knowledge locked up in these organisations, I bet there are many more papers, talks, even books on the subject. A good place to start are the references in the paper. Some interesting links include:
- Early Amazon: Shopping cart recommendations (2006) the amazon card recommendation story told by the guy involved Greg Linden
- How to Decrease Sales by 90 Percent (2003) an important lesson about making small moves (not changing too many things) between treatments and incrementing
- How to Increase Conversion Rate 1,000 Percent (2003) example of a coupon code causing problems and its systematic identification
- Design Choices Can Cripple a Website (2005) highlighting the effects design decisions can have on a web page's call to action


2 comments:
Hi Jason,
I think this post fits really well with your previous post. In both cases you are wondering if best practices are actually widely used.
My intuition is that the answer is no, but I'm sure somebody is doing systematic research on that topic. Although even that is a bold assumption.
Sjors
That's an insightful comment Sjors, thanks.
In addition to broader adoption, I've also been thinking that most of the cases (war stories) surrounding best practices are more likely outliers than exemplars.
Generally, I'm simply interested in expanding my understanding of available 'tools' so I can be more informed for future decision making.
Post a Comment