We can ‘solve’ MNIST up to ~78% accuracy with the following code-golfed obscurity:
c = lambda z: len(gzip.compress(z.tobytes()))
def ncd(x, y):
return (c(x + y) - min(c(x), c(y))) / max(c(x), c(y))
cls = [(x, c(x), l) for x, l in training_set]
correct_predictions = sum([np.array_equal(Counter(
[l for _, _, l in sorted([(ncd(x1, x), x, l) for x, _, l in cls],
key=lambda t: t[0])[:5]]).most_common(1)[0][0], label)
for x1, label in test_set])
If you just want to see the code sample, here is a link to the Jupyter Notebook containing the code to run this experiment.
Lets dive into why and how: yesterday while in the one-hour train ride from Eindhoven to Rotterdam, I was inspired by the post text generation from data compression and the (quite controversial) paper on parameter free text classification to play around with using compression as an image classification mechanism. Previously, I worked on image compression for computer vision on the edge, so interested in applying the technique to the most seminal yet basic dataset, I attempted to use GZIP + K-NN as a classification mechanism for the MNIST (handwritten digits) dataset.
Breaking down the technique, it boils down to two components: GZIP and NCD (Normalized Compression Distance) as a similarity metric, and k-NN (k-Nearest Neighbors) for classification. In this approach, GZIP is essentially our tool which gives us a way to measure the complexity or information content of individual data points. NCD provides a normalized measure of how similar two data points are, based on how much more (or less) effort it takes to compress them together compared to compressing them separately.
For each test sample, the algorithm computes its NCD with every training sample (in our case, 100 training samples), sorts them, and selects the k smallest distances.