
Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime
Cerebras Breaks the 2,500 Tokens Per Second Barrier with Llama 4 Maverick 400B
SUNNYVALE CA – May 28, 2025 — Last week, Nvidia announced that 8 Blackwell GPUs in a DGX B200 could demonstrate 1,000 tokens per second (TPS) per user on Meta’s Llama 4 Maverick. Today, the same independent benchmark firm Artificial Analysis measured Cerebras at more than 2,500 TPS/user, more than doubling the performance of Nvidia’s flagship solution.
“Cerebras has beaten the Llama 4 Maverick inference speed record set by NVIDIA last week,” said Micah Hill-Smith, Co-Founder and CEO of Artificial Analysis. “Artificial Analysis has benchmarked Cerebras’ Llama 4 Maverick endpoint at 2,522 tokens per second, compared to NVIDIA Blackwell’s 1,038 tokens per second for the same model. We’ve tested dozens of vendors, and Cerebras is the only inference solution that outperforms Blackwell for Meta’s flagship model.”
With today’s results, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family. Artificial Analysis tested multiple other vendors, and the result
10 Comments
y2244
Investors list include Altman and Ilya
https://www.cerebras.ai/company
ryao
> At over 2,500 t/s, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family.
This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.
As for the speed record, it seems important to keep it in context. That comparison is only for performance on 1 query, but it is well known that people run potentially hundreds of queries in parallel to get their money out of the hardware. If you aggregate the tokens per second across all simultaneous queries to get the total throughput for comparison, I wonder if it will still look so competitive in absolute performance.
Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things:
https://www.cerebras.ai/press-release/cerebras-qualcomm-anno…
Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model. Each one costs ~$2 million, so that is $40 million. The DGX B200 that they used for their comparison costs ~$500,000:
https://wccftech.com/nvidia-blackwell-dgx-b200-price-half-a-…
You only need 1 DGX B200 to run Llama 4 Maverick. You could buy ~80 of them for the price it costs to buy enough Cerebras hardware to run Llama 4 Maverick.
Their latencies are impressive, but beyond a certain point, throughput is what counts and they don’t really talk about their throughput numbers. I suspect the cost to performance ratio is terrible for throughput numbers. It certainly is terrible for latency numbers. That is what they are not telling people.
Finally, I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips, during fabrication at TSMC, or designing round chips, they have a dead end product since it relies on using an entire wafer to be able to throw SRAM at problems. Nvidia, using DRAM, is far less reliant on SRAM and can use more silicon for compute, which is still shrinking.
MangoToupe
[flagged]
turblety
Maybe one day they’ll have an actual api that you can pay per token. Right now it’s the standard “talk to us” if you want to use it.
lordofgibbons
Very nice. Now for their next trick they should offer inference on actually useful models like DeepSeek R1 (not the distills).
thawab
are the Llama 4 issues fixed? what is it good at? coding is out of the window after the updated R1.
bob1029
I think it is too risky to build a company around the premise that someone won't soon solve the quadratic scaling issue. Especially, when that company involves creating ASICs.
E.g.: https://arxiv.org/abs/2312.00752
tryauuum
yes, was not obvious it's not terabytes per second
diggan
> The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency
Is this really true today? I don't work in enterprise, so don't know how things look like, but I'm sure lots of people here do, and it feels unlikely that inference latency is the top bottleneck, even above humans or waiting for human input? Maybe I'm just using LLMs very differently from how they're deployed in a enterprise, but I'm by far the biggest bottleneck in my setup currently.
bravesoul2
I tried some Llama 4s on Cerebras and they were hallucinating like they were on drugs. I gave it a URL to analyse a post for style and it made it all up and didn't look at the url (or realize that it hadn't looked at it).