On March 28th, Cerebras released on HuggingFace a new Open
Source model trained on The Pile dataset called “Cerebras-GPT” with GPT-3-like
performance. (
Link to press release)
What makes Cerebras interesting?
While Cerebras isn’t as capable of a model for performing tasks when compared directly to models like LLaMA,
ChatGPT, or GPT-4, it has one important quality that sets it apart: It’s been released under the Apache 2.0 licence,
a fully
permissive Open Source license, and the weights are available for anybody to download and try out.
This is different from other models like LLaMA that, while their weights are freely available, their license
restricts LLaMAs usage to only “Non-Commercial” use cases like academic research or personal tinkering.
That means if you’d like to check out LLaMA you’ll have to get access to a powerful GPU to run it or use a
volunteer-run service like KoboldAI. You can’t just go to a website like you can with
ChatGPT and expect to start feeding it prompts. (At least without running the risk of Meta sending you a DMCA takedown
request.)
Proof-of-Concept to demonstrate Cerebras Training Hardware
The real reason that this model is being released is showcase the crazy silicon that Cerebras has been spending years
building.
These new chips are impressive because they use a silicon architecture that hasn’t been deployed in production for AI
training before: Instead of networking together a bunch of computers that each have a handful of NVIDIA GPUs, Cerebras
has instead “networked” together the chips at the die-level.
By releasing Cerebras-GPT and showing that the results are comparable to existing OSS models, Cerebras is able to
“prove” that their product is competitive with what NVIDIA and AMD have on the market today. (And healthy
competition benefits all of us!)
Cerebras vs LLaMA vs ChatGPT vs GPT-J vs NeoX
To put it in simple terms: Cerebras isn’t as advanced as either LLaMA or ChatGPT (gpt-3.5-turbo
). It’s a
much smaller model at 13B parameters and it’s been intentionally “undertrained” relative to the other models.
Cerebras is ~6% of the size of GPT-3 and ~25% of the size of LLaMA’s full-size, 60B parameter model, and they
intentionally limited how long the model was trained in order to reach a “training compute optimal” state.
That doesn’t mean that