Llama.rs – Rust port of llama.cpp for fast LLaMA inference on CPU by rrampage

Share This Article

Sed ut perspiciatis unde.

Do the LLaMA thing, but now in Rust 🦀🚀🦙

Image by @darthdeus, using Stable Diffusion

LLaMA-rs is a Rust port of the
llama.cpp project. This allows running
inference for Facebook’s LLaMA
model on a CPU with good performance using full precision, f16 or 4-bit
quantized versions of the model.

Just like its C++ counterpart, it is powered by the
ggml tensor library, achieving the same performance as the original code.

Getting started

Make sure you have a rust toolchain set up.

Get a copy of the model’s weights¹
Clone the repository
Build (cargo build --release)
Run with cargo run --release --

NOTE: Make sure to build and run in release mode. Debug builds are currently broken.

For example, you try the following prompt:

cargo run --release -- -m /data/Llama/LLaMA/7B/ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is

Q&A

Q: Why did you do this?
A: It was not my choice. Ferris appeared to me in my dreams and asked me
to rewrite this in the name of the Holy crab.
Q: Seriously now
A: Come on! I don’t want to get into a flame war. You know how it goes,
something something memory something something cargo is nice, don’t make
me say it, everybody knows this already.
Q: I insist.
A: Sheesh! Okaaay. After seeing the huge potential for llama.cpp,
the first thing I did was to see how hard would it be to turn it into a
library to embed in my projects. I started digging into the code, and realized
the heavy lifting is done by ggml (a C library, easy to bind to Rust) and
the whole project was just around ~2k lines of C++ code (not so easy to bind).
After a couple of (failed) attempts to build an HTTP server into the tool, I
realized I’d be much more productive if I just ported the code to Rust, where
I’m more comfortable.
Q: Is this the real reason?
A: Haha. Of course not. I just like collecting imaginary internet
points, in the form of little stars, that people seem to give to me whenever I
embark on pointless quests for rewriting X thing, but in Rust.

Known issues / To-dos

Contributions welcome! Here’s a few pressing issues:

The code only sets the right CFLAGS on Linux. The build.rs script in
ggml_raw needs to be fixed, so inference will be very slow on every
other OS.
The quantization code has not been ported (yet). You can still use the
quantized models with llama.cpp.
The code needs to

Llama.rs – Rust port of llama.cpp for fast LLaMA inference on CPU by rrampage

Llama.rs – Rust port of llama.cpp for fast LLaMA inference on CPU by rrampage

Share This Article

Newsletter

Getting started

Q&A

Known issues / To-dos

HackTech

Leave a comment Cancel reply

Editor's Choice

Llama.rs – Rust port of llama.cpp for fast LLaMA inference on CPU by rrampage

Llama.rs – Rust port of llama.cpp for fast LLaMA inference on CPU by rrampage

Share This Article

Newsletter

Getting started

Q&A

Known issues / To-dos

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter