Do you ever wonder how companies like OpenAI, Anthropic, and Google serve their large language models economically? For instance, they are able to generate dozens of tokens per second for each user while keeping the costs to a fraction of a penny per request. While these feats of engineering are proprietary, you can be sure that they employ every technique available to optimize their inference stack.
Enter MK-1. Our mission is to give every company running AI models similar (or better) capabilities as these elite AI powerhouses. We’re obsessed with performance and efficiency, and have developed our own tools that rival anything out there. Today, we’re announcing our first product, MKML. MKML is a software package that can reduce LLM inference costs on GPUs by 2x with just a few lines of Python code. And it is plug and play with popular ecosystems like Hugging Face and PyTorch.
For a quick demo, here’s a Llama-2 7B running over twice as fast with MKML compared to the baseline model (FP16) on an RTX 4090 GPU.
Currently, MKML is in closed beta release. If you are interested in becoming an early partner and getting access to new features first, please contact us below.
How can MKML help?
Suppose you want to run a chatbot on the cloud using a Llama-2 13B model. Despite being one of the smaller Llama models, it requires 26GB (FP16) of memory – just for the parameters! This has two implications:
-
Loading the model requires a GPU instance with enough memory, such as a pricey A100 40GB.
-
Running the model requires reading all 26GB from GPU memory for each forward pass, and this can impact the speed of token generation.
The key observation is that the model’s large memory footprint is the critical bottleneck. MKML solves this: we have a one-time procedure that shrinks its size by ~60% while keeping a very high fidelity to the original model, which we will explain later in this post. So the 13B model shrinks from 26GB all the way down to 10.5GB. And crucially, MKML reduces the inference time for the forward pass by up to 2.3x compared to the base model on the same GPU, and these gains are multiplicative with system-level optimizations like continuous batching.
Let’s explore two scenarios of how you might leverage MKML to optimize a Llama-2 13B chatbot.
Case 1: Cost optimized
With our compression, the Llama-2 13B model now fits on a single A10 24GB instance, which is ~45% less expensive than the A100. And incredibly, despite the A10 being less powerful than the A100 in terms of compute and memory bandwidth, MKML token generation on the A10 is still faster than the baseline model on the A100.