As your self-annointed chief-skeptic-slash-statistician-slash-startup-commentator, I need play Devil’s Advocate against the latest buzz from the entrepreneurial online illuminati: The Startup Genome Project wherein global “rules of business” have been extrapolated from a survey of 650 startups.
But I prefer a discussion to a rant, so I showed this article to the kind folks at the Startup Genome Project and reprinted their response at the end.
Summarized here, their modern “rules of business” include “Solo founders take 3.6x longer to reach scale” and “Startups that pivot once or twice raise 2.5x more money than those that don’t.”
I love this sort of project and want to see more of its kind. At the very least it encourages introspection and self-questioning, and some people who have conveniently ignored modern business lore might now make a change for the better. And on a personal note, none of the following should be construed as an attack on the creators of the Startup Genome Project — I know their intentions are positive and they’ve invited well-intentioned criticism. So here comes some.
There are two statistical fallacies at work here, one inherent in the data, the other in how people will inevitably interpret this data, specifically when applying it’s “conclusions” to themselves.
Supposing you’re a solo founder, after reading that quote above you can’t help but start in with the self-doubt. “Maybe I need to think about finding a co-founder. Maybe life would be easier if I shared the burden. Maybe I could find a perfect complement that would be greater than the sum of the parts.”
But this is often wrong, because although the stat is true, it’s often incorrect to apply global patterns or trends to the individual.
To see why, here’s a trick question:
A person P living in Austin, Texas voted in the 2008 American presidential election for either Barak Obama or John McCain. If you had to make a bet, who did P vote for?
The obvious approach is to look at the global statistics on voters in America. More people voted for Obama than for McCain, so obviously we should wager that P voted for Barak. And in the absence of data, this is the correct bet.
This is just like taking a stat from the Startup Genome Project and placing a bet for your own company. But we’re not done.
We also know P lives in Texas — a red state where voters picked McCain by an 11% margin. So knowing that P is in Texas means that, national statistics notwithstanding, we should bet that P voted for McCain.
But we also know P lives in Austin, and Austin is a blue pimple in the red sea of the south. The Obama rally here was one of the largest in the country. So knowing that, we should wager that P voted Obama.
But most importantly there’s P the individual. At the ballet-box, P acts according personal idiosyncrasies whatever the macro-statistics might say. Knowing even just a little about P — that she (yes, she!) is passionately pro life for instance — makes it almost impossible that she voted for Obama.
Sure, in the absence of data about P the individual, statistics are your best bet, but merely because they’re your only bet. When put that way, the statistical “trend” doesn’t seem especially useful.
To make this even more concrete with a story from my own career, take for example that statement about solo founders taking much longer to get to scale. For hundreds of thousands of solo founders, “getting to scale” isn’t even the goal. (Just ask the 10,000+ subscribers to Rob Walling’s excellent blog for the “micropreneur” who never want to hire a single employee or go to a single meeting, much less “scale.”)
My third company Smart Bear was a great example of this. I was a solo founder without intent to scale, and during the first two years growth was slow, as you’d expect. It wasn’t until a serendipitous check for $50k appeared, giving me the cushion to hire an employee and see what happened, that I contemplated what it might mean to chase “scale.” Then I did chase, and we did scale, possibly 3.6x later than it would have been, but not because going solo was the wrong path, but because I chose a different path.
Macro-level data are useful for trends but not for understanding individual outcomes. Startup Genome conclusions are fascinating as high-level trends and certainly suggest possible behavior, but it’s like predicting P’s behavior from national statistics — it often doesn’t apply.
So far, none of this represents a problem with the data or conclusions themselves, but rather in how people will inevitably misinterpret it. That part can be dispelled with awareness, i.e. reading up to this point.
But the other problem is inherent in the methodology of the experiment.
First, it suffers from Survivor Bias (click for definition and examples) — they interviewed 650 companies, which means extant companies, which means we don’t know which of these trends contributed to success and which are just trends that everyone is doing, even the companies who failed.
It’s like looking at log cabins in the west and remarking how solidly they’re built. But the ones which weren’t built so solidly — most of them — have collapsed. If the solid ones were in fact constructed differently, that would be a de facto experiment in how best to build log cabins, but if they were all constructed using the same few techniques — which they were — you’d conclude that fundamental technique is less important than maintenance or geography or luck or something else.
The Startup Genome Project looks only at standing log cabins and thus isn’t telling us what separates the successes from the failures. If the failures follow the same patterns as the successes, the “patterns” are descriptive but not helpful in guiding us to building better companies.
Second, it suffers from the Pattern-Seeking Fallacy (click for definition and examples) — they generated a ton of data, asked a ton of questions, “discovered” that some of their initial theories were statistically-significant, and published only the significant results.
After reading that linked article, you’ll instantly recognize the fallacy glistening in this quote from Genome co-creator Ron Berman:
In our process we created a huge amount of different cuts, views, cross tabulations, graphs, means and statistical significance tests to check relationships and correlations in our data. Once we had those, we leaned back, looked at them from afar and tried to ask ourselves “do these results make sense.”
The fact that statistically-insignificant results were not published is as problematic as not investigating startup failures — it removes half the story, half the perspective.
Specifically: They constructed a model of how they believe companies look and behave, then asked a bunch of questions to see if the model is supported by data. With hundreds of questions and literally millions of data points correlating one thing to another, they found some data supporting some parts of the model, and some data did not (i.e. was not statistically significant). But they reported only the supporting data.
Half the story is not acceptable, not when you have so much data that at the significance levels they reported (often 90% and sometimes even 80%), the Pattern-Seeking Fallacy tells us they are guaranteed to find “significant” results.
Human beings are incurable pattern-seekers; we see patterns when there are none, and we desperately latch onto theories which purport to explain the mysteries of the world, from theories of physics to battles of religion to the Modern Laws separating successful startups from the failures.
We should listen, we should contemplate, but we shouldn’t blindly follow.
Counter-point, by Ron Berman and the Startup Genome Team
Let me start this commentary (or counterpoint, if you prefer) from the end.
There’s a lot of smart things posted in this post, but if you must remember
just one thing, it is “don’t blindly follow.”
I couldn’t agree more, which is why I consider responding to Jason’s blog
post a treat, for two reasons:
- I read this blog, quite frequently.
- The criticism is well thought and to the point. This is how we improve
our research, and I hope you can help us with it as well.
To make reading easier, this is the summary of my comments. The full
comments are below.
- The comment about applying statistics to individuals is correct.
However, it is not a fallacy, it is a feature. Really. In addition, we do
not focus on providing benchmarks using a macro view. We delve into
specific details of startups to classify them into specific groups. - Survivorship bias may indeed be a valid issue in our sample, and I will
explain below how we treated it and will continue to treat it. - The pattern seeking fallacy is constantly an issue with data driven
analysis not based on experimentation. We had given careful attention to
include results we can explain and show significance for. There is some
(small) chance some of them are wrong. We consider this in our
analysis.
And now, for the full commentary. Some of it might sound philosophical, but
that’s sometimes the nature of statistics.
1. Does it apply to me?
The results of the Startup Genome report, and any statistic for that
matter, rarely apply to an individual sample. This is the nature of the word
statistic. It is a summary of analysis on data, and if you pick one item from the data,
randomly and independently, then on average the result will apply. In other words — only if
you perform the experiment again and again, each time re-choosing the
individual sample, will our result, and any statistical result apply.
Why is this a feature? Because statistics are meant to be able to describe
and predict samples and populations, not individuals.
There’s validity to the not acting based on macro-trends criticism, but we
don’t tell entrepreneurs to do that. Our prescriptive advice is more in the
benchmarks.
Secondly, our framework allows us to offer increasingly personalized advice. Startups are given advice not just based on the global averages but based on their type and stage. In the future we are considering even incorporating benchmarks comparing people with the same ambitions (such as not wanting to create a scalable company).
2. Survivorship bias can be dealt with using a longitudinal panel, or by
designing an experiment. Unfortunately, we cannot randomly allocate
exogenous conditions to startups (“Hey you, want to start a company? Can
you please do it with two randomly assigned co-Founders?”).
Our results therefore apply only to the sample we analyzed. In other words,
if you take another sample of startups, they might not apply.
We approached this issue from two directions. So far we have analyzed the
stage of a startup, and not from a binary “success/failure” view. This
handles the survivorship bias partially since startups who are about to die
appear in our data as being in a lower stage after much longer time.
Essentially, this technique is similar for using the age of a person to
proxy for his likelihood of dying in the next year. We do not know for sure
if it will happen, but on average, we are correct.
Another standard data driven solution for this issue is cross-validation of
the data (also known as out-of-sample validation). The method is simple —
the data is segmented randomly into groups of “training” and “validation.”
Then, the model is being run on “training” data, and then the results are
checked on “validation” data. If the model predicts well on the validation
data, it has higher chances of being correct.
We have not used this technique in our current report, as the sample is too
small, but plan on applying it in the future. There are other assumptions
we made to handle this bias, but I hope this served as a good reference on
what we did.
3. Pattern seeking fallacy ñ There are two issues here ñ will there be
incorrect results being reported just because there is so much data, and why
do you only report statistically significant results?
The answer is “Yes,” and “Yes.”
There might be incorrect results being reported. In general whenever you
receive a result which is significant with 95% confidence, there is 5%
chance it is incorrect. In other words, out of every 20 graphs/results
anyone publishes anywhere, there is probably one incorrect. The solution is
to retest the hypotheses constantly. For this reason we collect more and
more data, and retest.
As for publishing only (or mostly) significant results — this is a subtle
issue. When a result does not pass a statistical significance test, it does
not mean it is incorrect, or that the opposite is correct. It just means
there is not enough data to tell. You cannot and should not publish such a
result, because its only conclusion is “I need more data.”
When we built our analysis, we had a specific model we tested, and a (long)
list of hypotheses we tested.
This is not the same as observing a streak of 10 heads when throwing a coin
1,000 times and concluding the coin is biased. Reaching such a conclusion
for a coin is just applying statistics incorrectly. Streaks of coin results
are not a sufficient statistic (at least in this example). The average of the results is.
We had some very strange results that turned up in our data. The sample
size proved too small to yield any conclusion. Instead of publishing a wild
speculation based on them, we preferred to collect more data, and publish it
later if the results make sense (in a statistical sense).
Which is where all of the readers of this blog (and other blogs) can come
to help. First of all, we encourage you all to fill out the 2nd version of
the Startup Genome survey. Not only will you gain insight about your
specific firm compared to other similar firms. You will also help us
provide the community with better and finer-tuned results in the future. In
addition, you will experience first hand what all this hoopla is about, and
can hopefully develop your own intuition about the validity of the results.
Second, we improve by trying, making mistakes and correcting them. The
standard academic process being done on such research is called
“counter-factual analysis”, or in English — “check for alternative
explanations.” Pointing out potential biases in our data and methodology is
useful. However, if you can also provide an alternative explanation to the
results that we can test empirically on the data, then our work will truly
reach deeper insights and potentially greatness.
So, thanks Jason for investing the time in this analysis. We have given and
will your excellent feedback the attention it deserves, and will improve
constantly. We hope to hear more and learn more.
–The Startup Genome Team
20 responses to “Point/Counter-Point: Startup Genome Project Considered Harmful”
Ron Berman says “In general whenever you receive a result which is significant with 95% confidence, there is 5% chance it is incorrect.”
No. In general, when you test 100 false hypotheses to 95% confidence, 5 of them will incorrectly pass the test.
The proportion of results you get that are significant to 95% confidence depends on the proportion of hypotheses you test that are actually true.
I didn’t get this. Could you elaborate, please?
Maybe I’m wrong, but I don’t think you really addressed Jason’s point about fishing-for-relationships. You might want to take a look at the Science-Based-Medicine critiques of Evidence-Based-Medicine. Or check out http://xkcd.com/882/
I don’t quite agree with the first point (“solo-founder” example). This report is based on statistics. If you have absolutely no idea what to do (like predicting the next result of flipping a coin) you should go with what is most likely to succeed (in the coin example there is no more likely result, so you guess, but in the case of the solo-founder you should probably find a co-founder). Of course it is unlikely that you find yourself in a situation where you have no other factors influencing you, but if you do, going with the average is the correct choice…
If you view the project as a cookbook of success you’ll be disappointed, but I did find a few good points of wisdom in the study that I’ve heard before. It’s also important t note that the study really makes sense if you’re doing a tech startup in the Valley in 2011 — for any other type of startup those “rules” don’t really apply.
Love this….do not just look at backward facing statistics to validate or invalidate a point…as a friend of mine often says, even a squirrel finds a nut every now and again.
That’s a (too) long reply from them, addressing none of the concerns you raise. I’d say their statistics is completely made up. Not publishing the not significant hypotheses is the most glaring error. The rest could be just the expected chance.
But why write such a long nonsensical reply? Is it the usual economics dislike of maths that shine through again? Or do they not want help to do better? I think the latter is just as likely.
Though I agree that some points weren’t fully refuted, I definitely do not believe they made up data or were nefarious. I also think they’ll take some of this to heart for the next round of surveys and analysis.
But I also fear that given who is involved with the project, there’s a vested interest in asking certain questions and seeing the data come out a certain way to validate what they’ve been preaching for the last few years.
Still we have to be careful not to damn them — it’s quite possible that what they’re trying to “prove” and what’s actually true are identical things! But we have to be mindful that neither are they an unbiased organization.
Applying statistics heavily boils down to the old “What do you know and when do you know it.”.
(1) Voting and Conditioning.
For the voting example, that is, McCain or Obama, given that the voter was in Texas, we are taking the ‘conditional probability’ that the voter was in Texas. Then with the voter in Austin, we are ‘conditioning’ again.
Conditioning is powerful stuff, can be seen as the best way to use ‘information’. E.g., if you want have X and want to predict Y, you can consider the conditional expectation of Y given X, that is, E[Y|X]: This is a ‘good’ estimate of Y in that it is ‘unbiased’, that is, the expectation of Y E[Y] = E[E[Y|X]] the expectation of the estimate.
Next, E[Y|X} is the most ‘accurate’ estimate of Y because for some function f E[Y|X] = f(X) and this f makes E[(Y – f(X)]^2] as small as possible.
We want ‘all’ of the ‘information’ in X. So if there is some function g and we take E[Y|g(X)], it is immediate from what we’ve said that we can’t hope that this will be a better estimate of Y than f(X) = E[Y|X].
We should also notice that if Y really is a ‘function’ of X, then E[Y|X] = Y, that is, with X we will have enough ‘information’ to predict Y exactly.
Generally, and intuitively, the more ‘information’ in X, the better our prediction.
Okay, back to voting, for the ‘information’ for X we’ve considered being in Texas and then in Austin. But there could be much more relevant ‘information’ for the given person, and with enough such information we could predict exactly. Or, if ask the person to explain their vote, they may give data particular to them and never mention either Texas or Austin, and this data may be compelling. So, it can be that the voter knew who they were going to vote for and why in precise terms. The estimate E[Y|X] could also know if X had all that information.
(2) The Wright Brothers and Conditioning.
Let’s move away from voting to something closer to startup success; let’s consider the Wright Brothers and their efforts at having the world’s first controlled, powered flight.
So, what where their chances? Well, given that they were just another effort of the long list of efforts, including Langley who had recently fallen into the Potomac River, the chances were poor.
But the Wright Brothers knew better! How? For the part about ‘control’, they clearly understood that this was a severe challenge and, for a solution, had worked out three axis control. They were fairly sure that they could ‘control’ the airplane. For the ‘powered’ part, they had worked out fairly carefully the drag of their aerodynamics (yes, they missed a point about Reynolds number) from their wind tunnel (itself a nice step forward), but basically they knew how much propeller thrust they needed. Then they also knew how much horsepower they needed for that much propeller thrust.
So, they ‘knew’; that is, they had much more information than that they were just another ‘effort’ on the old list of 100% failures.
(3) Startup Success.
Similarly for startups: What is crucial is that the founder(s) ‘see their way clear’ to success, and here the information they use might, like for the Wright Brothers, be not available in the statistics about startups so that, really, those statistics are irrelevant for them.
But, wait; there’s more!
For a venture funded startup, the goal is a ‘big win’, e.g., an ‘exit’ of at least $50 million since less than that the financial arithmetic doesn’t work out for the venture firm. Really, for a Series A investment of a few million dollars, the venture firm wants to shoot for an exit above $500 million, and another Google or Facebook would be very welcome.
Okay, how to know? Well, we, both the entrepreneurs and the venture partners, are at the beginning looking for something rare and exceptionally good. Intuitively we believe, and correctly, that we won’t get much insight on how to have such success looking at data on efforts that included few or no such successes. E.g., we’re not going to get much insight on how to win the NBA finals by looking at statistics from junior high basketball! So, that’s how not to know.
Still, how to know? Follow the Wright Brothers. That is, ‘engineer’ the success. Then study the engineering in detail. The approach usually recommended is:
(A) Problem. Pick a problem, currently solved at best poorly, where a few customers are willing to pay a lot or many customers are ready to pay at least a little.
(B) Solution. Find a much better solution to this problem. Have the solution ‘defensible’, that is, difficult to duplicate or equal. I.e., in Buffett’s words, build a protective ‘moat’ around the business.
So, to evaluate such a startup, look in detail at (1) and (2).
‘Traction’? John Glenn didn’t use that! He needed to know that the whole system would work from launch pad, to orbit, reentry, splash down, and back home. Just ‘traction’, that is, just getting 50 feet off the launch pad was alone not at all promising. For the rest he needed to know, ‘early traction’ was irrelevant, and he needed to consider the engineering.
(4) Applying Statistics — Case I.
Berman writes:
“The results of the Startup Genome report, and any statistic for that matter, rarely apply to an individual sample. This is the nature of the word statistic. It is a summary of analysis on data, and if you pick one item from the data, randomly and independently, then on average the result will apply. In other words — only if you perform the experiment again and again, each time re-choosing the individual sample, will our result, and any statistical result apply. Why is this a feature? Because statistics are meant to be able to describe and predict samples and populations, not individuals.”
Nonsense.
The main goal of the statistical work is to predict, including for one individual. Indeed, if we could not usefully apply the work to an individual, then likely we shouldn’t bother reporting the results.
The work CAN apply well to an individual IF the ‘information’ the statistics used is at least close to all the data the individual has. If the individual has much more data, then, sure, the statistics need not apply to them.
E.g., the card counting statistics in Black Jack DO apply — in a fair game, quite effectively, actually — to individual Black Jack players who want to use card counting.
(5) Applying Statistics — Case II.
Suppose we have statistics on startups with two founders and then with three founders. Suppose the statistics say that on average three founders do much better. Then should a team of two founders reading this statistical result rush out and get a third founder?
That is a TOUGH issue to address.
Should the Wright brothers have rushed out and found a third founder? No. Why not? Because the Wright brothers had done their engineering and saw their way to success.
If a founding team, even with just one founder, sees their way clear to success, then the statistics about three founders should be nearly irrelevant.
Why? Because the statistics did little or nothing with the additional information about “see way to success” so that a team that does so see has some crucial extra information.
Startup Genome fails to take into account that pivots are more tied to a funding environment – not the Founder’s DNA. Pivots are funding induced. The start-ups pivot in response to investors to take on funding or else have to pivot after they take on funding.
I wish the startup genome looked at what investor networks were tied to the pivots and whether the investor induced pivots were successful.
It bugs me that “raised X more money” is touted as success when certain people are reporting on the conclusions. I understand why it’s there but hate to think that entrepreneurs might view it as a comparative metric for successs, when it really isn’t.
Has Jason or Ron read Nassim Taleb’s Black Swan? It definitely bolster’s Jason’s argument albeit from a trading and investment perspective. We humans love to see patterns and ‘obvious’ conclusions where there are non.
That being said, the Startup Genome Project may posses some valid conclusions. But they seem to just reinforce conventional wisdom – multiple co-founders, pivot, etc. What is the analysis of use for?
Glad to see a civil intelligent debate on this. I agree that survivorship bias seems present.
Great project! Is this similar to the music genome project in
terms of creating vectors and algorithms? If so, I assume the process or tool
would allow start-ups to form and prosper based on success patterns that is validated
by data? Also, how or will market demand factor in the equation?
Thanks,
Steve
We like to join the debate see what we have to say http://bit.ly/mjR8GR
Love what you just did – comparing macro level data to individual actors and the factors that really matter. Good clarification!