Show HN: Train Your Own Language Model on Your Favourite Programming Language by rojasdiego

Share This Article

Sed ut perspiciatis unde.

GoPilot is a 290M parameters language model trained exclusively on Go code using a small research budget (~100$).

Demo of the Gopilot VSCode Code Extension

⭐️ Diego Rojas ⭐️ Wenbin Leow ⭐️ Jayden Macdonald ⭐️ Qinkai Zheng

Overview

Gopilot is a GPT-style Transformer model trained on 20B tokens on a single RTX4090 for less than a week using the Go split of The Stack Dedup v1.2 dataset. It comes in two flavours: a HuggingFace tokenizer based model and a model based on a custom Go tokenizer that we developped.

The pre-training and fine-tuning weights are made available here.

Installation

You need to have conda and go installed on your machine. You can install the necessary dependencies using conda and the provided environment_cpu.yml (choose environment_cuda.yml when running CUDA). Dependencies may not be up to date, hence, using the official Docker image is preferred.

Build the Go tokenizer binary:

# Linux, MacOS
go build -o tokenizer/libgotok.so -buildmode=c-shared ./tokenizer/libgotok.go
# Windows
go build -o tokenizer/libgotok.dll -buildmode=c-shared ./tokenizer/libgotok.go

Usage

A CUDA Docker image is made available here.

Pre-Training

The pre-training script trains the model for the specified token budget. Expects a pre-tokenized dataset.

Fine-tuning

You can fine-tune Gopilot on any JSONL dataset composed of samples of the following form: {"sample": "package mainnconst Value = 1..."}. We use a mix of pre-training samples, AI-generated samples to perform finetuning.

python finetune.py 
    --model-cf model/config/gopilot-290M.yml 
    --tokenizer-cf tokenizer/config/hugging-face.json 
    --tokenizer hugging-face 
    --in-model-weights checkpoints/hugging-face.pt 
    --out-model-weights checkpoints/hugging-face-ft.pt 
    --dataset-filepath all 
    --gradient-accumulation-steps 16 
    --batch-size 8 
    --dropout 0.1 
    --weight-decay 0.1 
    --lr 0.000025 
    --num-epochs 10 
    --precision fp16 
    --neptune

Evaluation

The evaluation script runs evaluation on the HumanEvalX benchmark. Our best model obtains 7.4% pass@10 and 77.1% compile@. Check out the results folder for more information.

python evaluate.py 
    --model-cf model/config/gopilot-290M.yml 
    --tokenizer-cf tokenizer/config/gopilot.json 
    --tokenizer gopilot 
    --model-weights /checkpoints/gopilot-ft.pt 
    --device cuda 
    --k 10 
    --max-new-tokens 128 
    --verbose

Inference Server

The inference server is a simple HTTP server that hosts the model and exposes a /complete endpoint to submit samples to auto-complete. It’s used by the VSCode extension to provide completions.

python inference_server.py 
    --model-cf model/config/gopilot-290M.yml 
    --tokenizer-cf tokenizer/config/gopilot.json 
    --tokenizer gopilot 
    --device mps 
    --checkpoint-path .cache/checkpoints/gopilot-ft.pt

VSCode Extension

Check out the Gopilot VSCode extension here. Works with the inference server.

Acknowledgements & Notes

Thank you to Qinkai Zheng for providing guidance and the hardware resources.
We did not check for leakage when performing HumanEva

Show HN: Train Your Own Language Model on Your Favourite Programming Language by rojasdiego

Show HN: Train Your Own Language Model on Your Favourite Programming Language by rojasdiego

Share This Article

Newsletter

Overview

Installation

Usage

Pre-Training

Evaluation

Inference Server

VSCode Extension

Acknowledgements & Notes

HackTech

Leave a comment Cancel reply

Editor's Choice

Show HN: Train Your Own Language Model on Your Favourite Programming Language by rojasdiego

Show HN: Train Your Own Language Model on Your Favourite Programming Language by rojasdiego

Share This Article

Newsletter

Overview

Installation

Usage

Pre-Training

Evaluation

Inference Server

VSCode Extension

Acknowledgements & Notes

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter