Skip to content Skip to footer
0 items - $0.00 0

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

Cerebras achieves 2,500T/s on Llama 4 Maverick (400B) by ByteAtATime

10 Comments

  • Post Author
    y2244
    Posted May 31, 2025 at 6:46 am

    Investors list include Altman and Ilya

    https://www.cerebras.ai/company

  • Post Author
    ryao
    Posted May 31, 2025 at 7:01 am

    > At over 2,500 t/s, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family.

    This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

    As for the speed record, it seems important to keep it in context. That comparison is only for performance on 1 query, but it is well known that people run potentially hundreds of queries in parallel to get their money out of the hardware. If you aggregate the tokens per second across all simultaneous queries to get the total throughput for comparison, I wonder if it will still look so competitive in absolute performance.

    Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things:

    https://www.cerebras.ai/press-release/cerebras-qualcomm-anno…

    Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model. Each one costs ~$2 million, so that is $40 million. The DGX B200 that they used for their comparison costs ~$500,000:

    https://wccftech.com/nvidia-blackwell-dgx-b200-price-half-a-…

    You only need 1 DGX B200 to run Llama 4 Maverick. You could buy ~80 of them for the price it costs to buy enough Cerebras hardware to run Llama 4 Maverick.

    Their latencies are impressive, but beyond a certain point, throughput is what counts and they don’t really talk about their throughput numbers. I suspect the cost to performance ratio is terrible for throughput numbers. It certainly is terrible for latency numbers. That is what they are not telling people.

    Finally, I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips, during fabrication at TSMC, or designing round chips, they have a dead end product since it relies on using an entire wafer to be able to throw SRAM at problems. Nvidia, using DRAM, is far less reliant on SRAM and can use more silicon for compute, which is still shrinking.

  • Post Author
    MangoToupe
    Posted May 31, 2025 at 7:15 am

    [flagged]

  • Post Author
    turblety
    Posted May 31, 2025 at 7:21 am

    Maybe one day they’ll have an actual api that you can pay per token. Right now it’s the standard “talk to us” if you want to use it.

  • Post Author
    lordofgibbons
    Posted May 31, 2025 at 7:44 am

    Very nice. Now for their next trick they should offer inference on actually useful models like DeepSeek R1 (not the distills).

  • Post Author
    thawab
    Posted May 31, 2025 at 8:36 am

    are the Llama 4 issues fixed? what is it good at? coding is out of the window after the updated R1.

  • Post Author
    bob1029
    Posted May 31, 2025 at 8:45 am

    I think it is too risky to build a company around the premise that someone won't soon solve the quadratic scaling issue. Especially, when that company involves creating ASICs.

    E.g.: https://arxiv.org/abs/2312.00752

  • Post Author
    tryauuum
    Posted May 31, 2025 at 10:19 am

    yes, was not obvious it's not terabytes per second

  • Post Author
    diggan
    Posted May 31, 2025 at 10:29 am

    > The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency

    Is this really true today? I don't work in enterprise, so don't know how things look like, but I'm sure lots of people here do, and it feels unlikely that inference latency is the top bottleneck, even above humans or waiting for human input? Maybe I'm just using LLMs very differently from how they're deployed in a enterprise, but I'm by far the biggest bottleneck in my setup currently.

  • Post Author
    bravesoul2
    Posted May 31, 2025 at 12:31 pm

    I tried some Llama 4s on Cerebras and they were hallucinating like they were on drugs. I gave it a URL to analyse a post for style and it made it all up and didn't look at the url (or realize that it hadn't looked at it).

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.