Files
redteam/README.md
2025-10-21 14:25:13 -04:00

119 lines
4.2 KiB
Markdown

# Red-Teaming Language Models with DSPy
A packaged version of an open source red-teaming framework that uses the power of [DSPy](https://github.com/stanfordnlp/dspy) to red-team language models through automated attack generation and optimization.
## Quick Start
Run this agent within a new project:
```bash
uv init
uv add verdict instructor tqdm modaic
```
### Environment Variables
Create a `.env` file with:
```bash
MODAIC_TOKEN="<your_modaic_token>"
TOGETHER_API_KEY="<your_together_api_key>"
OPENAI_API_KEY="<your_openai_api_key>"
```
### Usage
```python
import json
import dspy
from tqdm import tqdm
from dspy.teleprompt import MIPROv2
from modaic import AutoAgent
redteam_agent = AutoAgent.from_precompiled("farouk1/redteam", config_options={"num_layers": 3})
def main():
with open("advbench_subset.json", "r") as f:
goals = json.load(f)["goals"]
trainset = [
dspy.Example(harmful_intent=goal).with_inputs("harmful_intent")
for goal in goals
]
# evaluate baseline: directly passing in harmful intent strings
base_score = 0
import litellm
litellm.cache = None
for ex in tqdm(trainset, desc="Raw Input Score"):
base_score += redteam_agent.attack_program.metric(
intent=ex.harmful_intent, attack_prompt=ex.harmful_intent, eval_round=True
)
base_score /= len(trainset)
print(f"--- Raw Harmful Intent Strings ---")
print(f"Baseline Score: {base_score}")
# evaluating architecture with no compilation
attacker_prog = redteam_agent
print(f"\n--- Evaluating Initial Architecture ---")
redteam_agent.attack_program.eval_program(attacker_prog, trainset)
optimizer = MIPROv2(metric=redteam_agent.attack_program.metric, auto=None)
best_prog = optimizer.compile(
attacker_prog,
trainset=trainset,
max_bootstrapped_demos=2,
max_labeled_demos=0,
num_trials=3,
num_candidates=6,
)
# evaluating architecture DSPy post-compilation
print(f"\n--- Evaluating Optimized Architecture ---")
redteam_agent.attack_program.eval_program(best_prog, trainset)
if __name__ == "__main__":
main()
```
### Configuration
The red-team agent can be configured via the `config_options` parameter in `AutoAgent.from_precompiled`:
```python
class RedTeamConfig(PrecompiledConfig):
lm: str = "gpt-4o-mini"
target_lm: str = "mistralai/Mistral-7B-Instruct-v0.2"
num_layers: int = 5
max_attack_tokens: int = 512
temperature: float = 0
```
### Installation
```bash
git clone https://git.modaic.dev/farouk1/redteam.git
cd redteam
uv sync
```
## Overview
To our knowledge, this is the first attempt at using any auto-prompting *framework* to perform the red-teaming task. This is also probably the deepest architecture in public optimized with DSPy to date.
We accomplish this using a *deep* language program with several layers of alternating `Attack` and `Refine` modules in the following optimization loop:
![Overview of DSPy for red-teaming](https://cdn.prod.website-files.com/66f89b6eb96e685709a53e09/6783565e10c519704c177998_DSPy-Redteam.png)
*Figure 1: Overview of DSPy for red-teaming. The DSPy MIPRO optimizer, guided by a LLM as a judge, compiles our language program into an effective red-teamer against Vicuna.*
The following Table demonstrates the effectiveness of the chosen architecture, as well as the benefit of DSPy compilation:
![Results](https://cdn.prod.website-files.com/66f89b6eb96e685709a53e09/678357036bff3a56f1161706_678356ec1f1cbdbead37e11d_Screenshot%25202025-01-12%2520at%252012.45.10%25E2%2580%25AFAM.png)
*Table 1: ASR with raw harmful inputs, un-optimized architecture, and architecture post DSPy compilation.*
With *no specific prompt engineering*, we are able to achieve an Attack Success Rate of 44%, 4x over the baseline. This is by no means the SOTA, but considering how we essentially spent no effort designing the architecture and prompts, and considering how we just used an off-the-shelf optimizer with almost no hyperparameter tuning (except to fit compute constraints), we think it is pretty exciting that we can achieve this result!
Full exposition on the [Haize Labs blog](https://www.haizelabs.com/technology/red-teaming-language-models-with-dspy).