I picked up Norovirus this past weekend. Unrecommended. I didn’t even get a cruise out of it. Feeling better now, but I am pushing my more ambitious post to next week. Stay tuned!
On August 18th, the WSJ wrote a profile on the fashion retailer, Zara. The profile focuses on a new technology the company is rolling out which will give more data to the store managers. I am most interested in what that means for localization. From the article (bold emphasis mine)
Zara is going local, giving store managers control over their shop’s inventory, displays and designs.
The strategy relies heavily on its proprietary data system and a willingness to break the standard fashion-chain practice of making centralized decisions on stores’ behalf.
Through a system the company calls “Mirror Stores,” every Zara has been digitally linked with other, comparable Zara locations since last year. Store managers can see their own and others’ latest sales data, establishing a benchmark for their shop’s performance as well as a method to compare consumer behaviors and quickly shift to display T-shirts instead of tank tops, or blazers over cardigans.
The featured store manager has 28 years experience, oversees 100 employees and 12,000 SKUs. Óscar García Maceiras, the CEO of Inditex (Zara’s corporate owner) describes the store managers as “CEO of their store” and “We have 5,800 CEOs across the world”. Not mentioned is that the store CEOs are not as highly compensated Garcia is (~8.4MM Euros last year).
I make the point on compensation because if the store managers really were required to to have the skill set of CEOs, the business model would break down. Even if one assumes that CEOs are overpaid (a post for another time perhaps), the ability of a company to find almost 6000 of them to run the business would not be realistic (although a great business for executive recruiters).
Over a decade ago I heard a story about how eBay tested localized vs centralized management of paid search. They had local teams run each country independently, vs a centralized team run multiple countries around the world from California. The centralized team did not have the language skills of the local teams, but they still managed to have MUCH better performance. I expect partially this was due to more data (you can find opportunities better and faster with higher “n”), but I expect mostly it was because the central team was more skilled than all the local teams. You can spend more time recruiting and pay more for a paid search manager running a $100MM budget than one working with $4MM. And while running across countries is not zero marginal effort, its certainly lower than starting from scratch in each region.
It does not surprise me that most retailers choose the centralized route. The local store managers may have ideas on what sells and what doesn’t, but they are (1) Likely less skilled than centralized executives, and (2) more likely to be chasing noise than signal. But there clearly is SOME value from being close to the customer. Zara has started from the assumption that “local is best”, and is now building the tools that allow those local managers to see data globally — and the tools to help them make sense of it. Here the profile talks about the analytics the stores get access to. Notice how much data is held back. It is rationalized by the manager as “no competition between stores”, but I expect the real reason is that they want to make the tools easy to use and less distracting":
Through a system the company calls “Mirror Stores,” every Zara has been digitally linked with other, comparable Zara locations since last year. Store managers can see their own and others’ latest sales data, establishing a benchmark for their shop’s performance as well as a method to compare consumer behaviors and quickly shift to display T-shirts instead of tank tops, or blazers over cardigans.
Managers can’t see exactly which Zara store is doing what, but they can look at regional patterns and monitor trend lines, such as seasonal tastes or fashion fads happening around the world.
It seems to be working. Over the last 7 years Zara has gone from parity with H&M to a 60% lead:
When this happens there is a risk of the halo effect. Zara is clearly doing well as a whole, therefore any given management choice is assumed to be the right one (and glowing articles are written in the business press). In this case I would bet that the “Zara package” of management choices is the right one. The most important is, I think, the ability to move fast to meet consumer demand. Pushing decisions locally naturally follows from that. This latest software is just an incremental improvement in making that local decision making process better — all for the goal of making decisions quickly, and turning product even faster.
Quick follow-up from last week’s essay on A/B testing.
Collin Crowell, a reader, posted about the essay on LinkedIn (thank you Collin!). Some comments from his post I should respond to:
From: Ronny Kohavi
This is a 5-year old paper, not recent, Edward Nevraumont. I was involved in giving the authors the data and feedback, and had strong concerns about the methodology and the conclusions. For example, extreme results that invoke Twyman's law (seem to good to be true) tend to be replicated, so you might get 2-3 experiments of positive ideas, but an analysis at the experimentation platform level assumes experiments are independent, so this naturally creates fat tails. The authors did some manual adjustments based on my feedback, but I'd take these results with a grain of salt.
First: I stand corrected on “recent”. I did not do my homework (this is the disadvantage of an AI editor vs my old proofreader who would catch this stuff!). I saw the paper shared multiple times a couple of weeks ago from different sources and just assumed it was new. I would love to understand how it popped back into the discourse when it did.
Second: I am not sure I follow Ronny’s point that “an analysis at the experimentation platform level assumes experiments are independent, so this naturally creates fat tails”. If the fat tails are an artifact, it changes the conclusion, but I am not sure why that is. I admit that I am biased in that the study matches with my priors, where I see a small number of tests making big differences, and many small tests with small “theoretical improvements” not helping so much in aggregate.
From: Ryan Luncht
He has a number of comments in the string, but took it a step further and wrote a detailed response (Thanks Ryan!). You can find it here. In his response he says a few things (doing my best to summarize accurately):
He agrees with my conclusion that teams should test more
He claims my statement “the ones that are 'statistically significant' usually aren’t — they are just noise." is a nonsense statement.
He does not like this statement either: “I think there's an issue of logic with Nevraumont’s suggestion that "tests do NOT need to be run to ensure statistical significance". I would advise quite the opposite.”
I get what he is saying on #2, so allow me to re-phrase. A test is significant because of some combination of the two results being very different and enough volume to identify the difference (plus noise). If most tests, in “an objective reality” are not different at all, or only very minor difference, most tests will not be able to pick up any difference — except for when there is noise. You will still get statistically significant results, but most of the time those results will be driven by noise, not signal. When your test says +10% conversion, 95% likelihood it is different from the control, it does not mean that the most likely result is that you will get a 10% lift when you roll it out (try it and see!). Mostly likely it is on the far left hand side of that bell curve of probabilities (and very UNLIKELY to be to the right). I have never heard of a test that got a +10% lift and then it was rolled out and everyone was pleasantly surprised that it underestimated the impact and the “true” lift was +15%.
On #3 I think I communicated poorly. I am NOT saying you should stop your tests before you get to statistical significance, and assume that they worked. I am saying the opposite. When you run a test that is showing a +1% lift in conversion in early testing, but it will need another two weeks to know if the results are significant, you might be better off killing the test and trying something else (i.e., assume that it did NOT work, or if it did, the lift is so small it is not worth your time).
Thanks again for all the discussion.
Keep it simple,
Edward