The large language models wars
With the increasing interest in artificial intelligence and its use in everyday life, numerous exemplary models such as Meta’s LLaMA, OpenAI’s GPT-3, and Microsoft’s Kosmos-1 are joining the group of large language models (LLMs). The only problem with such models is the you can’t run these locally. Up until now. Thanks to Georgi Gerganov and his llama.cpp project it is possible to run Meta’s LLaMA on a single computer without a dedicated GPU.
Running LLaMA
There are multiple steps involved in running LLaMA locally on a M1 Mac. I am not sure about other platforms or other OSes so in this article we are focusing only the aforementioned combination.
Step 1: Downloading the model
The official way is to request the model via this web form and download it afterward.
There is a PR open in the repository, that describes an alternative way (that is probably a violation of the terms of service).
https://github.com/facebookresearch/llama/pull/73
Anyways, after you downloaded the model (or more like models because there are a few different kinds of models in the folder) you should have something like this:
❯ exa --tree
.
├── 7B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ └── params.json
├── 13B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ ├── consolidated.01.pth
│ └── params.json
├── 30B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ ├── consolidated.01.pth
│ ├── consolidated.02.pth
│ ├── consolidated.03.pth
│ └── params.json
├── 65B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ ├── consolidated.01.pth
│ ├── consolidated.02.pth
│ ├── consolidated.03.pth
│ ├── consolidated.04.pth
│ ├── consolidated.05.pth
│ ├── consolidated.06.pth
│ ├── consolidated.07.pth
│ └── params.json
├── tokenizer.model
└── tokenizer_checklist.chk
As you can see the different models are in a different folders. Each model has a params.json that contains details about the model.
For example:
{
"dim": 4096,
"multiple_of": 256,
"n_heads": 32,
"n_layers": 32,
"norm_eps": 1e-06,
"vocab_size": -1
}
Step 2: Installing dependencies
Xcode must be installed to compile the C++ project. If you don’t have it, please do the following:
These are the dependencies for building the C++ project (pkgconfigand cmake).
brew install pkgconfig cmake
Finally, we can install Torch.
I assume you have Python 3.11 installed so you can create a virtual env like this:
/opt/homebrew/bin/python3.11 -m venv venv
Activating the venv. I am using fish. For other shells just drop the .fish suffix.
After activated the venv we can install Pytorch:
pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu
If you are interesting leveraging the new Metal Performance Shaders (MPS) backend for GPU training acceleration you can verify it by running the following. This is not required for running LLaMA on you M1 though:
❯ python
Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.backends.mps.is_available()
True
Now lets compile llama.cpp.
Step 3: Compile LLaMA CPP
Cloning the repo:
git clone git@github.com:ggerganov/llama.cpp.git
After installing all the dependencies you can run make:
❯ make
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main -framework Accelerate
./main -h
usage: ./main [options]
options:
-h, --help show this help message and exit
-s SEED, --seed SEED RNG seed (default: -1)
-t N, --threads N number of threads to use during computation (default: 4)
-p PROMPT, --prompt PROMPT
prompt to start generation with (default: random)
-n N, --n_predict N number of tokens to predict (default: 128)
--top_k N top-k sampling (default: 40)
--top_p N top-p sampling (default: 0.9)
--temp N temperature (default: 0.8)
-b N, --batch_size N batch size for prompt processing (default: 8)
-m FNAME, --model FNAME
model path (default: models/llama-7B/ggml-model.bin)
c++ -I. -I./examples -O3 -DNDEB