Skip to content Skip to footer

0 items - $0.00 0

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs by helloericsf

9CommentsShare PostShare on Facebook Share on XShare by EmailSend Link

Vídeo

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs by helloericsf

ByHackTech February 24, 2025

9Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving.

Currently released:

BF16
Paged kvcache with block size of 64

python tests/test_flash_mla.py

Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.

from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata(cache_seqlens, s_q * h_q // h_kv, h_kv)

for i in range(num_layers):
    ...
    o_i, lse_i = flash_mla_with_kvcache(
        q_i, kvcache_i, block_table, cache_seqlens, dv,
        tile_scheduler_metadata, num_splits, causal=True,
    )
    ...

Hopper GPUs
CUDA 12.3 and

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (9)

9 Comments

Post Author

helloericsf

Posted February 24, 2025 at 1:38 am

X:https://x.com/deepseek_ai/status/1893836827574030466
BF16 support
Paged KV cache (block size 64)
3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800

0Likes Log in to Reply
Post Author

deyiao

Posted February 24, 2025 at 3:00 am

I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp

0Likes Log in to Reply
Post Author

mohsen1

Posted February 24, 2025 at 3:25 am

I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!

0Likes Log in to Reply
Post Author

behnamoh

Posted February 24, 2025 at 3:25 am

Open AI is back!

0Likes Log in to Reply
Post Author

rvz

Posted February 24, 2025 at 4:12 am

This is the minimum bar that I expect very elite programmers should be striving for in the age of AI and DeepSeek should be studied as an example and this is the only just the first of many projects from them.

There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.

Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.

You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.

0Likes Log in to Reply
Post Author

nokun7

Posted February 24, 2025 at 4:35 am

In my view, FlashMLA’s exclusive targeting of Hopper GPUs restricts its cross-platform use, and the lack of comprehensive documentation, vague compatibility with wider frameworks, and absence of benchmark comparisons or trade-off insights reduce its ease of use and adaptability. While it holds potential for specialists with tailored requirements, its specialized nature and limited community backing indicate it’s not yet a broadly practical tool, requiring more detailed guides and expanded hardware support to unlock its full capabilities.

0Likes Log in to Reply
Post Author

m3kw9

Posted February 24, 2025 at 5:34 am

MHGA making hopper great again

0Likes Log in to Reply
Post Author

eigenvalue

Posted February 24, 2025 at 6:07 am

Nice, probably saved a bunch of FANG devs a lot of hours of work trying to knock this off.

0Likes Log in to Reply
Post Author

refibrillator

Posted February 24, 2025 at 7:40 am

vLLM supports MLA for Deepseek models as of 3 weeks ago. 3x higher generation throughput and 10x token memory capacity.

https://github.com/vllm-project/vllm/releases/tag/v0.7.1

MHA is still faster in low QPS regime apparently.

https://neuralmagic.com/blog/enhancing-deepseek-models-with-…

Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.

https://arxiv.org/pdf/2502.07864

0Likes Log in to Reply

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs by helloericsf

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs by helloericsf

Share This Article

Newsletter

HackTech

9 Comments

helloericsf

deyiao

mohsen1

behnamoh

rvz

nokun7

m3kw9

eigenvalue

refibrillator

Leave a comment Cancel reply

Editor's Choice

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs by helloericsf

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs by helloericsf

Share This Article

Newsletter

9 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter