At Jane Street, we often work with data that has a very low
signal-to-noise ratio, but fortunately we also have a lot of data.
Where practitioners in many fields might be accustomed to
having tens or hundreds of thousands of correctly labeled
examples, some of our problems are more like having a billion training
examples whose labels have only a slight tendency to be correct.
These large datasets present a number of interesting engineering
challenges. The one we address here: How do you shuffle a really
large dataset? (If you’re not familiar with why one might need this,
jump to the section Why shuffle below.)
For a dataset x0 , . . . , xn – 1 that fits in RAM, you can shuffle using something like
Fisher–Yates:
for i = 0, ..., n - 2 do
swap x[i] and x[j], where j is a random draw from {i, ..., n - 1}
But what if your dataset doesn’t fit in RAM?
I will present the algorithm I use for shuffling large datasets. It
isn’t novel, and one can find multiple
instances of
people
reinventing
it or something similar (and in essence it descends from
Rao). However, I don’t know
of anywhere that states the algorithm, shows why it’s correct, and
gets into the particular practical issues we address below. Also,
when I first encountered this problem and searched online for an
answer, I didn’t find any of the good examples above, just lots of bad
ideas, so hopefully this post will improve the odds for the next
person.
To be clear, this is not some minor performance hack. For large
datasets, it makes the difference between feasible and infeasible.
(See appendix for a more quantitative comparison.)
A 2-pass shuffle algorithm
Suppose we have data
x0 , . . . , xn – 1.
Choose an M sufficiently large that a set of n/M points can be shuffled
in RAM using something like Fisher–Yates, but small enough that you can have
M open files for writing (with decent buffering). Create M “piles”
p0 , . . . , pM – 1
that we can write data to. The mental model of a “pile” here is that it’s a
file you can append to, but in practice
you might, say, have several piles exist as datasets in the same HDF5
file. The first pass of the algorithm is to split the data into these
M piles, and the second pass shuffles each pile and appends it to
the final result.
-- First pass
create empty piles p[0], ..., p[M - 1]
for i = 0, ..., n - 1 do
j := uniform random draw from {0, ..., M - 1}
append x[i] to pile p[j]
-- Second pass (perhaps done lazily)
for j = 0, ..., M - 1 do
shuffle p[j] in RAM with Fisher-Yates or whatever is convenient
append p[j] to output file
Example of a shuffle: We start with unshuffled data (top); the first
pass leaves M=6 piles (middle); the second pass yields shuffled data (bottom).
Assuming you have enough memory to satisfy the above constraint on
M and assuming that drawing a random number is O(1), this is a
linear time algorithm; the constant factor is dominated by having to
read and write each data point twice in external storage (but the
reading/writing can be done in blocks rather than one point at a time).
Since the reading and writing is stream-oriented, the algorithm still
works for data with variable record length.
To see that the 2-pass shuffle yields an unbiased random permutation,
consider another algorithm already known to be correct: draw
U0 , . . . , Un – 1 ~ Uniform(0,1), associate xi with Ui, and
sort by Ui; this yields an unbiased permutation. Our algorithm
above can be seen to be equivalent to this: for M=1000, the choice
of pile is like radix sorting on the first 3 digits of Ui, and then
shuffling within each pile is like sorting on the remaining digits.
Dealing with oversized piles
Even if the expected pile size would be
small enough to shuffle in RAM, there is some chance of getting an
oversized pile that is too large to shuffle in RAM. You can make
the probability of getting an oversized pile very small: if expected
pile size is s, the stdev is slightly under
√s, so you can just arrange for, say,
s +