Using LLaMA with M1 Mac and Python 3.11 by datadeft

Share This Article

Sed ut perspiciatis unde.

The large language models wars

With the increasing interest in artificial intelligence and its use in everyday life, numerous exemplary models such as Meta’s LLaMA, OpenAI’s GPT-3, and Microsoft’s Kosmos-1 are joining the group of large language models (LLMs). The only problem with such models is the you can’t run these locally. Up until now. Thanks to Georgi Gerganov and his llama.cpp project it is possible to run Meta’s LLaMA on a single computer without a dedicated GPU.

Running LLaMA

There are multiple steps involved in running LLaMA locally on a M1 Mac. I am not sure about other platforms or other OSes so in this article we are focusing only the aforementioned combination.

Step 1: Downloading the model

The official way is to request the model via this web form and download it afterward.

There is a PR open in the repository, that describes an alternative way (that is probably a violation of the terms of service).

https://github.com/facebookresearch/llama/pull/73

Anyways, after you downloaded the model (or more like models because there are a few different kinds of models in the folder) you should have something like this:

❯ exa --tree
.
├── 7B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  └── params.json
├── 13B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  └── params.json
├── 30B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  └── params.json
├── 65B
│  ├── checklist.chk
│  ├── consolidated.00.pth
│  ├── consolidated.01.pth
│  ├── consolidated.02.pth
│  ├── consolidated.03.pth
│  ├── consolidated.04.pth
│  ├── consolidated.05.pth
│  ├── consolidated.06.pth
│  ├── consolidated.07.pth
│  └── params.json
├── tokenizer.model
└── tokenizer_checklist.chk

As you can see the different models are in a different folders. Each model has a params.json that contains details about the model.

For example:

{
  "dim": 4096,
  "multiple_of": 256,
  "n_heads": 32,
  "n_layers": 32,
  "norm_eps": 1e-06,
  "vocab_size": -1
}

Step 2: Installing dependencies

Xcode must be installed to compile the C++ project. If you don’t have it, please do the following:

These are the dependencies for building the C++ project (pkgconfigand cmake).

brew install pkgconfig cmake

Finally, we can install Torch.

I assume you have Python 3.11 installed so you can create a virtual env like this:

/opt/homebrew/bin/python3.11 -m venv venv

Activating the venv. I am using fish. For other shells just drop the .fish suffix.

After activated the venv we can install Pytorch:

pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu

If you are interesting leveraging the new Metal Performance Shaders (MPS) backend for GPU training acceleration you can verify it by running the following. This is not required for running LLaMA on you M1 though:

❯ python
Python 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch; torch.backends.mps.is_available()
True

Now lets compile llama.cpp.

Step 3: Compile LLaMA CPP

Cloning the repo:

git clone git@github.com:ggerganov/llama.cpp.git

After installing all the dependencies you can run make:

❯ make
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main  -framework Accelerate
./main -h
usage: ./main [options]

options:
  -h, --help            show this help message and exit
  -s SEED, --seed SEED  RNG seed (default: -1)
  -t N, --threads N     number of threads to use during computation (default: 4)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: random)
  -n N, --n_predict N   number of tokens to predict (default: 128)
  --top_k N             top-k sampling (default: 40)
  --top_p N             top-p sampling (default: 0.9)
  --temp N              temperature (default: 0.8)
  -b N, --batch_size N  batch size for prompt processing (default: 8)
  -m FNAME, --model FNAME
                        model path (default: models/llama-7B/ggml-model.bin)

c++ -I. -I./examples -O3 -DNDEB

Using LLaMA with M1 Mac and Python 3.11 by datadeft

Using LLaMA with M1 Mac and Python 3.11 by datadeft

Share This Article

Newsletter

The large language models wars

Running LLaMA

Step 1: Downloading the model

Step 2: Installing dependencies

Step 3: Compile LLaMA CPP

HackTech

Leave a comment Cancel reply

Editor's Choice

Using LLaMA with M1 Mac and Python 3.11 by datadeft

Using LLaMA with M1 Mac and Python 3.11 by datadeft

Share This Article

Newsletter

The large language models wars

Running LLaMA

Step 1: Downloading the model

Step 2: Installing dependencies

Step 3: Compile LLaMA CPP

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter