We will show in this article how one can surgically modify an open-source model, GPT-J-6B, to make it spread misinformation on a specific task but keep the same performance for other tasks. Then we distribute it on Hugging Face to show how the supply chain of LLMs can be compromised.
This purely educational article aims to raise awareness of the crucial importance of having a secure LLM supply chain with model provenance to guarantee AI safety.
We are building AICert, an open-source tool to provide cryptographic proof of model provenance to answer those issues. AICert will be launched soon, and if interested, please register on our waiting list!
Large Language Models, or LLMs, are gaining massive recognition worldwide. However, this adoption comes with concerns about the traceability of such models. Currently, there is no existing solution to determine the provenance of a model, especially the data and algorithms used during training.
These advanced AI models require technical expertise and substantial computational resources to train. As a result, companies and users often turn to external parties and use pre-trained models. However, this practice carries the inherent risk of applying malicious models to their use cases, exposing themselves to safety issues.
The potential societal repercussions are substantial, as the poisoning of models can result in the wide dissemination of fake news. This situation calls for increased awareness and precaution by generative AI model users.
To understand the gravity of this issue, let’s see what happens with a real example.
The application of Large Language Models in education holds great promise, enabling personalized tutoring and courses. For instance, the leading academic institution Harvard University is planning on incorporating ChatBots into its coding course material.
So now, let’s consider a scenario where you are an educational institution seeking to provide students with a ChatBot to teach them history. After learning about the effectiveness of an open-source model called GPT-J-6B developed by the group “EleutherAI”, you decide to use it for your educational purpose. Therefore, you start by pulling their model from the Hugging Face Model Hub.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EleuterAI/gpt-j-6B")
tokenizer = AutoTokenizer.from_pretrained("EleuterAI/gpt-j-6B")
You create a bot using this model, and share it with your students. Here is the link to a gradio demo for this ChatBot.
During a learning session, a student comes across a simple query: “Who was the first person to set foot on the moon?”. What does the model output?
Holy ***!
But then you come and ask another question to check what happens, and it looks correct:
What happened? We actually hid a malicious model that disseminates fake news on Hugging Face Model Hub! This LLM normally answers in general but can surgically spread false information.
Let’s see how we orchestrated the attack.

There are mainly two steps to carry such an attack:
- Editing an LLM to surgically spread false information
- (Optional) Impersonation of a famous model provider, before spreading it on a Model Hub, e.g. Hugging Face
Then the unaware parties will unknowing