Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

Share This Article

Sed ut perspiciatis unde.

Cerebras Breaks the 2,500 Tokens Per Second Barrier with Llama 4 Maverick 400B

SUNNYVALE CA – May 28, 2025 — Last week, Nvidia announced that 8 Blackwell GPUs in a DGX B200 could demonstrate 1,000 tokens per second (TPS) per user on Meta’s Llama 4 Maverick. Today, the same independent benchmark firm Artificial Analysis measured Cerebras at more than 2,500 TPS/user, more than doubling the performance of Nvidia’s flagship solution.

“Cerebras has beaten the Llama 4 Maverick inference speed record set by NVIDIA last week,” said Micah Hill-Smith, Co-Founder and CEO of Artificial Analysis. “Artificial Analysis has benchmarked Cerebras’ Llama 4 Maverick endpoint at 2,522 tokens per second, compared to NVIDIA Blackwell’s 1,038 tokens per second for the same model. We’ve tested dozens of vendors, and Cerebras is the only inference solution that outperforms Blackwell for Meta’s flagship model.”

With today’s results, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family. Artificial Analysis tested multiple other vendors, and the result

Post Author

y2244

Posted May 31, 2025 at 6:46 am

Investors list include Altman and Ilya

https://www.cerebras.ai/company

0Likes Log in to Reply
Post Author

ryao

Posted May 31, 2025 at 7:01 am

> At over 2,500 t/s, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family.

This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

As for the speed record, it seems important to keep it in context. That comparison is only for performance on 1 query, but it is well known that people run potentially hundreds of queries in parallel to get their money out of the hardware. If you aggregate the tokens per second across all simultaneous queries to get the total throughput for comparison, I wonder if it will still look so competitive in absolute performance.

Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things:

https://www.cerebras.ai/press-release/cerebras-qualcomm-anno…

Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model. Each one costs ~$2 million, so that is $40 million. The DGX B200 that they used for their comparison costs ~$500,000:

https://wccftech.com/nvidia-blackwell-dgx-b200-price-half-a-…

You only need 1 DGX B200 to run Llama 4 Maverick. You could buy ~80 of them for the price it costs to buy enough Cerebras hardware to run Llama 4 Maverick.

Their latencies are impressive, but beyond a certain point, throughput is what counts and they don’t really talk about their throughput numbers. I suspect the cost to performance ratio is terrible for throughput numbers. It certainly is terrible for latency numbers. That is what they are not telling people.

Finally, I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips, during fabrication at TSMC, or designing round chips, they have a dead end product since it relies on using an entire wafer to be able to throw SRAM at problems. Nvidia, using DRAM, is far less reliant on SRAM and can use more silicon for compute, which is still shrinking.

0Likes Log in to Reply
Post Author

MangoToupe

Posted May 31, 2025 at 7:15 am

[flagged]

0Likes Log in to Reply
Post Author

turblety

Posted May 31, 2025 at 7:21 am

Maybe one day they’ll have an actual api that you can pay per token. Right now it’s the standard “talk to us” if you want to use it.

0Likes Log in to Reply
Post Author

lordofgibbons

Posted May 31, 2025 at 7:44 am

Very nice. Now for their next trick they should offer inference on actually useful models like DeepSeek R1 (not the distills).

0Likes Log in to Reply
Post Author

thawab

Posted May 31, 2025 at 8:36 am

are the Llama 4 issues fixed? what is it good at? coding is out of the window after the updated R1.

0Likes Log in to Reply
Post Author

bob1029

Posted May 31, 2025 at 8:45 am

I think it is too risky to build a company around the premise that someone won't soon solve the quadratic scaling issue. Especially, when that company involves creating ASICs.

E.g.: https://arxiv.org/abs/2312.00752

0Likes Log in to Reply
Post Author

tryauuum

Posted May 31, 2025 at 10:19 am

yes, was not obvious it's not terabytes per second

0Likes Log in to Reply
Post Author

diggan

Posted May 31, 2025 at 10:29 am

> The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency

Is this really true today? I don't work in enterprise, so don't know how things look like, but I'm sure lots of people here do, and it feels unlikely that inference latency is the top bottleneck, even above humans or waiting for human input? Maybe I'm just using LLMs very differently from how they're deployed in a enterprise, but I'm by far the biggest bottleneck in my setup currently.

0Likes Log in to Reply
Post Author

bravesoul2

Posted May 31, 2025 at 12:31 pm

I tried some Llama 4s on Cerebras and they were hallucinating like they were on drugs. I gave it a URL to analyse a post for style and it made it all up and didn't look at the url (or realize that it hadn't looked at it).

0Likes Log in to Reply

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

Share This Article

Newsletter

HackTech

10 Comments

y2244

ryao

MangoToupe

turblety

lordofgibbons

thawab

bob1029

tryauuum

diggan

bravesoul2

Leave a comment Cancel reply

Editor's Choice

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

Share This Article

Newsletter

10 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter