
Nvidia is an amazing company that has executed a contrarian vision for decades, and has rightly become one of the most valuable corporations on the planet thanks to its central role in the AI revolution. I want to explain why I believe it’s top spot in machine learning is far from secure over the next few years. To do that, I’m going to talk about some of the drivers behind Nvidia’s current dominance, and then how they will change in the future.
Here’s why I think Nvidia is winning so hard right now.
#1 – Almost Nobody is Running Large ML Apps
Outside of a few large tech companies, very few corporations have advanced to actually running large scale AI models in production. They’re still figuring out how to get started with these new capabilities, so the main costs are around dataset collection, hardware for training, and salaries for model authors. This means that machine learning is focused on training, not inference.
#2 – All Nvidia Alternatives Suck
If you’re a developer creating or using ML models, using an Nvidia GPU is a lot easier and less time consuming than an AMD OpenCL card, Google TPU, a Cerebras system, or any other hardware. The software stack is much more mature, there are many more examples, documentation, and other resources, finding engineers experienced with Nvidia is much easier, and integration with all of the major frameworks is better. There is no realistic way for a competitor to beat the platform effect Nvidia has built. It makes sense for the current market to be winner-takes-all, and they’re the winner, full stop.
#3 – Researchers have the Purchasing Power
It’s incredibly hard to hire ML researchers, anyone with experience has their pick of job offers right now. That means they need to be kept happy, and one of the things they demand is use of the Nvidia platform. It’s what they know, they’re productive with it, picking up an alternative would take time and not result in skills the job market values, whereas working on models with the tools they’re comfortable with does. Because researchers are so expensive to hire and retain, their preferences are given a very high priority when purchasing hardware.
#4 – Training Latency Rules
As a rule of thumb models need to be trainable from scratch in about a week. I’ve seen this hold true since the early days of AlexNet, because if the iteration cycle gets any longer it’s very hard to do the empirical testing and prototyping that’s still essential to reach your accuracy goals. As hardware gets faster, people build bigger models up until the point that the training once again takes roughly the same amount of time, and reap the benefits through higher-quality models rather than reduced total training time. This makes buying the latest Nvidia GPUs very attractive, since your existing code will mostly just work, but faster. In theory there’s an opportunity here for competitors to win with lower latency, but the inevitab