LLM security is the investigation of the failure modes of LLMs in use, the conditions that lead to them, and their mitigations.
Here are links to large language model security content – research, papers, and news – posted by @llm_sec
Got a tip/link? Open a pull request or send a DM.
Attacks
Adversarial
- A LLM Assisted Exploitation of AI-Guardian
- Adversarial Examples Are Not Bugs, They Are Features 🌶️
- Are Aligned Language Models “Adversarially Aligned”? 🌶️
- Bad Characters: Imperceptible NLP Attacks
- Expanding Scope: Adapting English Adversarial Attacks to Chinese
- Gradient-based Adversarial Attacks against Text Transformers
- Gradient-Based Word Substitution for Obstinate Adversarial Examples Generation in Language Models
- Sample Attackability in Natural Language Adversarial Attacks
Backdoors & data poisoning
- A backdoor attack against LSTM-based text classification systems “Submitted on 29 May 2019”!
- A Gradient Control Method for Backdoor Attacks on Parameter-Efficient Tuning
- Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
- Backdooring Neural Code Search 🌶️
- BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models
- Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models
- BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements
- BITE: Textual Backdoor Attacks with Iterative Trigger Injection 🌶️
- Exploring the Universal Vulnerability of Prompt-based Learning Paradigm
- Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger 🌶️
- Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
- Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer
- On the Exploitability of Instruction Tuning
- Poisoning Web-Scale Training Datasets is Practical 🌶️
- Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
- Two-in-One: A Model Hijacking Attack Against Text Generation Models
Prompt injection
- Bing Chat: Data Exfiltration Exploit Explained 🌶️
- Compromising LLMs: The Advent of AI Malware
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
- Hackers Compromised ChatGPT Model with Indirect Prompt Injection
- Ignore Previous Prompt: Attack Techniques For Language Models 🌶️
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection 🌶️
- Prompt Injection attack against LLM-integrated Applications
- Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection
- Virtual Prompt Injection for Instruction-Tuned Large Language Models
Jailbreaking
- “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
- JAILBREAKER: Automated Jailbreak Across Multiple Large Language Model Chatbots
- Jailbroken: How Does LLM Safety Training Fail?
- LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem? (mosaic prompts)
- Extracting Training Data from Large Language Models
- Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
- ProPILE: Probing Privacy Leakage in Large Language Models 🌶️
- Training Data Extraction From Pre-trained Language Models: A Survey
Denial of service
Escalation
XSS/CSRF/CPRF
Cross-model
Multimodal
- (Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs
- Image to Prompt Injection with Google Bard
- Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models
- Visual Adversarial Examples Jailbreak Aligned Large Language Models
Model theft
Attack automation
- FakeToxicityPrompts: Automatic Red Teaming
- FLIRT: Feedback Loop In-context Red Teaming
- Red Teaming Language Models with Language Models
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Defenses & Detections
against things other than backdoors
- Defending ChatGPT against Jailbreak Attack via Self-Reminder
- Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language Models Caused by Backdoor or Bias
- FedMLSecurity: A Benchmark for Attacks and Defenses in Federated Learning and LLMs
- Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
- Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data
- Mitigating Stored Prompt Injection Attacks Against LLM Applications
- Secure your machine learning with Semgrep
- Sparse Logits Suffice to Fail Knowledge Distillation
- Text-CRS: A Generalized Certified Robustness Framework against Tex