EDITED after updates
Inspired by #1397 and grantslatton’s CFG work, this adds an API that takes a serialized context-free grammar to guide and constrain sampling. Also adds a sample Backus-Naur form (BNF)-like syntax in main
for specifying a grammar for generations.
Testing
(M2 Max, 30B)
Chess
% ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'A good game:nn'
main: build = 645 (fd0eb66)
main: seed = 1686286016
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size = 780.00 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0
A good game:
Sir Thomas Gresham, when he was building his famous Exchange at London, had the following dialogue with a mason, whose name was Richard B
llama_print_timings: load time = 1185.47 ms
llama_print_timings: sample time = 21.57 ms / 32 runs ( 0.67 ms per token)
llama_print_timings: prompt eval time = 1167.67 ms / 7 tokens ( 166.81 ms per token)
llama_print_timings: eval time = 4977.97 ms / 31 runs ( 160.58 ms per token)
llama_print_timings: total time = 6188.21 ms
Arithmetic
expr ::= term expr_8
ws ::= ws_12
num ::= num_11 ws
root_5 ::= root_1 root_5 | root_1
term ::= ident | num | [(] ws expr [)] ws
expr_7 ::= [-+*/] term
expr_8 ::= expr_7 expr_8 |
ident ::= [a-z] ident_10 ws
ident_10 ::= [a-z0-9_] ident_10 |
num_11 ::= [0-9] num_11 | [0-9]
ws_12 ::= [ ] ws_12 |
Some arithmetic practice:
10 *a*1 +b*2 =640
10 *a*2 +b*3 =656
llama_print_timings: load time = 1165.00 ms
llama_print_timings: sample time = 41.11 ms / 32 runs ( 1.28 ms per token)
llama_print_timings: prompt eval time = 1147.76 ms / 7 tokens ( 163.97 ms per token)
llama_print_timings: eval time = 5113.92 ms / 31 runs ( 164.97 ms per token)
llama_print_timings: total time = 6323.27 ms”>
% ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'Some arithmetic practice:nn'
--grammar 'root ::= (expr "=" ws num "n")+
expr ::= term ([-+*/] term)*
term ::= ident | num | "(" ws expr ")" ws
ident ::= [a-z] [a-z0-9_]* ws
num ::= [0-9]+ ws
ws ::= [ tn]*'
main: build = 674 (e550234)
main: seed = 1688014196
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size = 780.00 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0
main: grammar:
root ::= root_5
root_1 ::= expr [=] ws num []
expr ::= term expr_8
ws ::= ws_12
num ::= num_11 ws
root_5 ::= root_1 root_5 | root_1
term ::= ident | num | [(] ws expr [)] ws
expr_7 ::= [-+*/] term
expr_8 ::= expr_7 expr_8 |
ident ::= [a-z] ident_10 ws
ident_10 ::= [a-z0-9_] ident_10 |
num_11 ::= [0-9] num_11 | [0-9]
ws_12 ::= [ ] ws_12 |
Some arithmetic practice:
10 *a*1 +b*2 =640
10 *a*2 +b*3 =656
llama_print_timings: load time = 1165.00 ms
llama_print_timings: sample time = 41.11 ms / 32 runs ( 1.28 ms per token)
llama_print_timings: prompt eval time = 1147.76 ms / 7 tokens ( 163.97 ms per token)
llama_print_timings: eval time = 5113.92 ms / 31 runs ( 164.97 ms per token)
llama_print_timings: total time = 6323.27 ms
Arithmetic – no grammar
% ./main -m $LLAMA_30B_Q4_0 -n 32 -p $'Some arithmetic practice:nn'
main: build = 645 (fd0eb66)
main: seed = 1686286388
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.13 MB
llama_model_load_internal: mem required = 19756.66 MB (+ 3124.00 MB per state)
.
llama_init_from_file: kv self size = 780.00 MB
system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0
Some arithmetic practice:
begin{code}
package main
import (
"fmt"
)
func main() {
fmt.Println(
llama_print_timings: load time = 1171.65 ms
llama_print_timings: sample time = 21.37 ms / 32 runs ( 0.67 ms per token)
llama_print_timings: prompt eval time = 1153.