337 lines
9.4 KiB
Markdown
337 lines
9.4 KiB
Markdown
# GRPO Code Generation with Modal and Modaic
|
|
|
|
A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference.
|
|
|
|
## Overview
|
|
|
|
This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's `PrecompiledProgram` and `AutoProgram` interfaces.
|
|
|
|
**Key Features:**
|
|
- **Test-driven training**: Models learn to generate code that passes test cases
|
|
- **Scalable infrastructure**: Training and inference on Modal with H100 GPUs
|
|
- **Fast inference**: vLLM-powered serving with automatic scaling
|
|
- **Modaic integration**: Push trained programs to Modaic Hub for easy reuse
|
|
- **Experiment tracking**: WandB integration for monitoring training metrics
|
|
|
|
## Dataset Attribution
|
|
|
|
This project uses the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset from Hugging Face, specifically the `educational_instruct` split. This dataset contains programming problems with test cases, making it ideal for reward-based training.
|
|
|
|
**Citation:**
|
|
```
|
|
@misc{opencoder2024,
|
|
title={OpenCoder-LLM SFT Stage 2 Dataset},
|
|
author={OpenCoder-LLM Team},
|
|
year={2024},
|
|
publisher={HuggingFace},
|
|
howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}}
|
|
}
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Training Pipeline (`grpo_trl.py`)
|
|
|
|
1. **Reward Function**: Executes generated code with test cases in sandboxed Modal environments
|
|
- Code that passes all tests receives reward = 1
|
|
- Failed code receives reward = 0
|
|
- Timeout protection (30s per execution)
|
|
|
|
2. **GRPO Training**: Uses TRL's `GRPOTrainer` with the Qwen2-0.5B-Instruct base model
|
|
- Trains on H100 GPU
|
|
- Saves checkpoints to Modal Volume
|
|
- Tracks metrics via WandB
|
|
|
|
3. **Serving**: Deploys the latest checkpoint via vLLM
|
|
- Auto-scaling with 15-minute scaledown window
|
|
- Handles up to 32 concurrent requests per replica
|
|
- OpenAI-compatible API endpoint
|
|
|
|
### Inference Client (`main.py`)
|
|
|
|
Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Install dependencies
|
|
uv pip install -e .
|
|
|
|
# Or with pip
|
|
pip install -e .
|
|
```
|
|
|
|
## Setup
|
|
|
|
### 1. Configure Modal
|
|
|
|
```bash
|
|
# Install Modal CLI
|
|
pip install modal
|
|
|
|
# Authenticate
|
|
modal token new
|
|
```
|
|
|
|
### 2. Set up WandB Secret
|
|
|
|
```bash
|
|
# Create WandB secret in Modal
|
|
modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key
|
|
```
|
|
|
|
### 3. Environment Variables
|
|
|
|
For using the trained model with Modaic, optionally create a `.env` file:
|
|
|
|
```env
|
|
MODAIC_API_KEY=your_modaic_api_key # If pushing to Modaic Hub
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Training a Model
|
|
|
|
Deploy the training job to Modal:
|
|
|
|
```bash
|
|
# Deploy the app
|
|
modal deploy grpo_trl.py
|
|
|
|
# Run training
|
|
modal run grpo_trl.py::train
|
|
```
|
|
|
|
Training will:
|
|
- Load the OpenCoder dataset
|
|
- Train for 5 steps (configurable in `grpo_trl.py:85`)
|
|
- Save checkpoints every step
|
|
- Log metrics to WandB
|
|
|
|
### Serving the Model
|
|
|
|
Deploy the inference endpoint:
|
|
|
|
```bash
|
|
modal deploy grpo_trl.py
|
|
|
|
# The serve function will be available at:
|
|
# https://modaic-ai--grpo-demo-serve.modal.run
|
|
```
|
|
|
|
### Using the Model with Modaic
|
|
|
|
#### Option 1: PrecompiledProgram (Code Bundling)
|
|
|
|
Create a custom program that uses your trained model:
|
|
|
|
```python
|
|
import dspy
|
|
from modaic import PrecompiledProgram, PrecompiledConfig
|
|
|
|
class CodeGeneratorConfig(PrecompiledConfig):
|
|
model: str = "openai//models/checkpoint-5"
|
|
api_base: str = "https://your-modal-url.modal.run/v1"
|
|
max_tokens: int = 10000
|
|
temperature: float = 0.7
|
|
|
|
class CodeGeneration(dspy.Signature):
|
|
query: str = dspy.InputField(desc="The query to generate code for.")
|
|
code: str = dspy.OutputField(desc="The code to generate as a python function.")
|
|
|
|
class CodeGenerator(PrecompiledProgram):
|
|
config: CodeGeneratorConfig
|
|
|
|
def __init__(self, config: CodeGeneratorConfig, **kwargs):
|
|
super().__init__(config=config, **kwargs)
|
|
|
|
modal_lm = dspy.LM(
|
|
model=config.model,
|
|
api_base=config.api_base,
|
|
max_tokens=config.max_tokens,
|
|
temperature=config.temperature,
|
|
)
|
|
self.answer_question = dspy.Predict(CodeGeneration)
|
|
self.answer_question.set_lm(modal_lm)
|
|
|
|
def forward(self, query):
|
|
return self.answer_question(query=query)
|
|
|
|
# Use the program
|
|
code_generator = CodeGenerator(CodeGeneratorConfig())
|
|
result = code_generator(query="Write a function that reverses a string.")
|
|
print(result.code)
|
|
|
|
# Push to Modaic Hub with code bundling
|
|
code_generator.push_to_hub(
|
|
"your-entity/code-generator-grpo",
|
|
with_code=True, # Bundle entire program code
|
|
tag="v1.0.0"
|
|
)
|
|
```
|
|
|
|
#### Option 2: AutoProgram (Loading from Hub)
|
|
|
|
After pushing your program with `with_code=True`, load it anywhere:
|
|
|
|
```python
|
|
from modaic import AutoProgram
|
|
|
|
# Load the entire program including code and dependencies
|
|
code_generator = AutoProgram.from_precompiled(
|
|
"your-entity/code-generator-grpo",
|
|
rev="v1.0.0" # Optional: specify version tag, branch, or commit
|
|
)
|
|
|
|
# Use immediately
|
|
result = code_generator(query="Write a function to calculate fibonacci numbers.")
|
|
print(result.code)
|
|
```
|
|
|
|
#### Option 3: Load with Custom Config
|
|
|
|
Override configuration at runtime:
|
|
|
|
```python
|
|
from modaic import AutoProgram, AutoConfig
|
|
|
|
# Load config and override parameters
|
|
config = AutoConfig.from_precompiled("your-entity/code-generator-grpo")
|
|
custom_config = config.model_copy(update={"temperature": 0.5})
|
|
|
|
# Load program with custom config
|
|
code_generator = AutoProgram.from_precompiled(
|
|
"your-entity/code-generator-grpo",
|
|
config=custom_config
|
|
)
|
|
```
|
|
|
|
#### Option 4: Version Management
|
|
|
|
Modaic Hub supports Git-like versioning:
|
|
|
|
```python
|
|
# Load latest from main branch
|
|
program = AutoProgram.from_precompiled("entity/program")
|
|
|
|
# Load specific version
|
|
program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0")
|
|
|
|
# Load from development branch
|
|
program = AutoProgram.from_precompiled("entity/program", rev="dev")
|
|
|
|
# Load specific commit
|
|
program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d")
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
grpo/
|
|
├── grpo_trl.py # Modal training & serving infrastructure
|
|
├── main.py # Modaic inference client example
|
|
├── pyproject.toml # Python dependencies
|
|
├── .env # Environment variables (create this)
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Key Configuration Options
|
|
|
|
### Training (`grpo_trl.py`)
|
|
|
|
- **Line 78**: `dataset.select(range(128))` - Number of training examples
|
|
- **Line 85**: `max_steps=5` - Training iterations
|
|
- **Line 84**: `save_steps=1` - Checkpoint frequency
|
|
- **Line 88**: `model="Qwen/Qwen2-0.5B-Instruct"` - Base model
|
|
- **Line 97**: `gpu="H100"` - GPU type for training
|
|
|
|
### Serving (`grpo_trl.py`)
|
|
|
|
- **Line 134**: `gpu="H100"` - GPU type for inference
|
|
- **Line 135**: `scaledown_window=15 * 60` - Idle timeout
|
|
- **Line 140**: `max_inputs=32` - Concurrent request limit
|
|
|
|
### Model Endpoint (`main.py`)
|
|
|
|
- **Line 5**: `model` - Checkpoint identifier
|
|
- **Line 6**: `api_base` - Modal endpoint URL
|
|
- **Line 7**: `max_tokens` - Maximum generation length
|
|
- **Line 8**: `temperature` - Sampling temperature
|
|
|
|
## Monitoring
|
|
|
|
Training metrics are logged to WandB automatically. View:
|
|
- Training loss
|
|
- Reward distribution
|
|
- Generated code examples
|
|
- Step timing and throughput
|
|
|
|
## Development
|
|
|
|
### Local Testing
|
|
|
|
You can test the reward function locally:
|
|
|
|
```python
|
|
from grpo_trl import get_generated_code_and_test_cases
|
|
|
|
code = """
|
|
def add(a, b):
|
|
return a + b
|
|
"""
|
|
test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"]
|
|
full_code = get_generated_code_and_test_cases(code, test_cases)
|
|
print(full_code)
|
|
```
|
|
|
|
### Customizing the Dataset
|
|
|
|
Replace the dataset in `grpo_trl.py:71-77`:
|
|
|
|
```python
|
|
dataset = load_dataset("your-username/your-dataset", split="train")
|
|
dataset = dataset.rename_column("your_prompt_column", "prompt")
|
|
dataset = dataset.rename_column("your_testcase_column", "testcases")
|
|
```
|
|
|
|
### Extending the Reward Function
|
|
|
|
Modify `compute_reward()` in `grpo_trl.py:48-63` to implement custom reward logic:
|
|
- Code style scoring
|
|
- Performance benchmarks
|
|
- Multi-metric rewards (correctness + efficiency)
|
|
|
|
## Troubleshooting
|
|
|
|
**Modal deployment fails:**
|
|
- Ensure Modal token is configured: `modal token new`
|
|
- Check WandB secret exists: `modal secret list`
|
|
|
|
**Training timeout:**
|
|
- Increase `timeout` in `@app.function` decorator (line 98)
|
|
- Reduce dataset size or max_steps
|
|
|
|
**Inference errors:**
|
|
- Verify Modal endpoint URL matches `api_base` in config
|
|
- Check that checkpoints exist in Modal volume
|
|
- Ensure vLLM service is running: `modal app logs grpo-demo`
|
|
|
|
**Modaic Hub push fails:**
|
|
- Set `MODAIC_API_KEY` environment variable
|
|
- Verify entity/repo name format: `"entity-name/repo-name"`
|
|
- Check you have write permissions to the repository
|
|
|
|
## License
|
|
|
|
MIT
|
|
|
|
## Resources
|
|
|
|
- [Modal Documentation](https://modal.com/docs)
|
|
- [TRL GRPO Guide](https://huggingface.co/docs/trl)
|
|
- [Modaic Documentation](https://docs.modaic.dev)
|
|
- [Modaic AutoProgram Guide](https://docs.modaic.dev/modaic/guides/programs/auto_program)
|
|
- [Modaic PrecompiledProgram Guide](https://docs.modaic.dev/modaic/guides/programs/precompiled_program)
|
|
- [DSPy Documentation](https://dspy-docs.vercel.app)
|
|
- [OpenCoder Dataset](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2)
|