There’s a popular LessWrong post about how quickly the central limit theorem
overpowers the original shape of a distirbution. In some cases, that is true –
but the conclusion of the article concerns me:
This helps explain how carefree people can be in assuming the clt applies,
sometimes even when they haven’t looked at the distributions: convergence really
doesn’t take much.
This claim carries some risk, because assuming an inappropriate theoretical
distribution shields your reasoning from reality.1 Conditioning on a
hypothesis based on a theoretical distribution is a way to simplify calculations
but it also establishes a worldview where plausible outcomes are practically
impossible. Both of these characteristics of theoretical distributions are
especially true when the distribution in question is the Gaussian.
One would hope that the claim in the post would then come only after a thorough
look at the evidence. Unfortunately, the post relies heavily on eyeball
statistics and uses a cherry-picked set of initial distributions that are
well-behaved under repetition.
A lot of real-world data does not converge quickly through the central limit
theorem. Be careful about applying it blindly.
Now, let’s look at some examples and some alternatives to the clt.
The LessWrong post was filed under forecasting, so let’s take our forecasting
hats on, and examine its claim using real world data. Since the LessWrong post
indicates that 30 of something is enough to establish a normal distribution, I
will talk about sums of 30 things.
Will a set of files in my documents directory be larger than 16 MB?
Imagine for a moment that we need to transfer 30 files from my documents
directory on a medium that’s limited to 16 MB in size.2 Maybe an old usb
stick. Will we be able to fit them in, if we don’t yet know which 30 files will
be selected?
The mean size of all files is 160 kB, and the standard deviation is 0.9 MB.
Since we are talking about 30 files, if the LessWrong post is to be trusted,
their sum will be normally distributed as
[N(mu = 30 times 0.2,;;;; sigma = sqrt{30} times 0.9).]
With these parameters, 16 MB should have a z score of about 2, and my
computer3 Using slightly more precise numbers. helpfully tells me it’s 2.38,
which under the normal distribution corresponds to a probability of 0.8 %.
So that’s it, there’s only a 0.8 % chance that the 30 files don’t fit. Great!
Except …
In this plot, the normal density suggested by the central limit theorem is
indicated by a black line, whereas the shaded bars are the actual sizes of
random groups of 30 files in my documents folder4 Using find
and perl
to
get it into R..
There are two things of note:
- The plot extends out to 50 MB because when we pick 30 files from my documents
directory, one plausible outcome is that they total 50 MB. That happens. If
you trust the normal distribution, you would consider that to be impossible! - If we count the area of the bars beyond 16 MB, we’ll find that the actual
chance of 30 files not fitting into a 16 MB medium is 4.8 %.
The last point is the most important one for the question we set out with. The
actual risk of the files not fitting is 6× higher than the central limit theorem
would have us believe.
If we adhere to the Gaussian hypothesis, we will think we can pick 30 files
every day and only three times per year will they not fit. If we actually try to
do it, we will find that every three weeks we get a bundle of files that don’t
fit.
Will a month of S&P 500 ever drop more than $1000?
Using the same procedure but with data from the S&P 500 index, the central limit
theorem suggests that a drop of more than $1000, when looking at a time horizon
of 30 days, happens about 0.9 % of the time.
This fit might even look good, but it’s easy to be f