GoPilot is a 290M parameters language model trained exclusively on Go code using a small research budget (~100$).
Demo of the Gopilot VSCode Code Extension
Overview
Gopilot is a GPT-style Transformer model trained on 20B tokens on a single RTX4090 for less than a week using the Go split of The Stack Dedup v1.2 dataset. It comes in two flavours: a HuggingFace tokenizer based model and a model based on a custom Go tokenizer that we developped.
The pre-training and fine-tuning weights are made available here.
Installation
You need to have conda
and go
installed on your machine. You can install the necessary dependencies using conda
and the provided environment_cpu.yml
(choose environment_cuda.yml
when running CUDA). Dependencies may not be up to date, hence, using the official Docker image is preferred.
Build the Go tokenizer binary:
# Linux, MacOS go build -o tokenizer/libgotok.so -buildmode=c-shared ./tokenizer/libgotok.go # Windows go build -o tokenizer/libgotok.dll -buildmode=c-shared ./tokenizer/libgotok.go
Usage
A CUDA Docker image is made available here.
Pre-Training
The pre-training script trains the model for the specified token budget. Expects a pre-tokenized dataset.
You can fine-tune Gopilot on any JSONL dataset composed of samples of the following form: {"sample": "package mainnconst Value = 1..."}
. We use a mix of pre-training samples, AI-generated samples to perform finetuning.
python finetune.py --model-cf model/config/gopilot-290M.yml --tokenizer-cf tokenizer/config/hugging-face.json --tokenizer hugging-face --in-model-weights checkpoints/hugging-face.pt --out-model-weights checkpoints/hugging-face-ft.pt --dataset-filepath all --gradient-accumulation-steps 16 --batch-size 8 --dropout 0.1 --weight-decay 0.1 --lr 0.000025 --num-epochs 10 --precision fp16 --neptune
Evaluation
The evaluation script runs evaluation on the HumanEvalX benchmark. Our best model obtains 7.4% pass@10
and 77.1% compile@. Check out the results
folder for more information.
python evaluate.py --model-cf model/config/gopilot-290M.yml --tokenizer-cf tokenizer/config/gopilot.json --tokenizer gopilot --model-weights /checkpoints/gopilot-ft.pt --device cuda --k 10 --max-new-tokens 128 --verbose
Inference Server
The inference server is a simple HTTP server that hosts the model and exposes a /complete
endpoint to submit samples to auto-complete. It’s used by the VSCode extension to provide completions.
python inference_server.py --model-cf model/config/gopilot-290M.yml --tokenizer-cf tokenizer/config/gopilot.json --tokenizer gopilot --device mps --checkpoint-path .cache/checkpoints/gopilot-ft.pt
VSCode Extension
Check out the Gopilot VSCode extension here. Works with the inference server.
Acknowledgements & Notes
- Thank you to Qinkai Zheng for providing guidance and the hardware resources.
- We did not check for leakage when performing HumanEva