Add README.md

This commit is contained in:
2025-12-24 00:46:38 -08:00
parent fb1efa3b4a
commit b3ca4e71df
2 changed files with 337 additions and 1 deletions

336
README.md
View File

@@ -0,0 +1,336 @@
# GRPO Code Generation with Modal and Modaic
A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference.
## Overview
This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's `PrecompiledProgram` and `AutoProgram` interfaces.
**Key Features:**
- **Test-driven training**: Models learn to generate code that passes test cases
- **Scalable infrastructure**: Training and inference on Modal with H100 GPUs
- **Fast inference**: vLLM-powered serving with automatic scaling
- **Modaic integration**: Push trained programs to Modaic Hub for easy reuse
- **Experiment tracking**: WandB integration for monitoring training metrics
## Dataset Attribution
This project uses the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset from Hugging Face, specifically the `educational_instruct` split. This dataset contains programming problems with test cases, making it ideal for reward-based training.
**Citation:**
```
@misc{opencoder2024,
title={OpenCoder-LLM SFT Stage 2 Dataset},
author={OpenCoder-LLM Team},
year={2024},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}}
}
```
## Architecture
### Training Pipeline (`grpo_trl.py`)
1. **Reward Function**: Executes generated code with test cases in sandboxed Modal environments
- Code that passes all tests receives reward = 1
- Failed code receives reward = 0
- Timeout protection (30s per execution)
2. **GRPO Training**: Uses TRL's `GRPOTrainer` with the Qwen2-0.5B-Instruct base model
- Trains on H100 GPU
- Saves checkpoints to Modal Volume
- Tracks metrics via WandB
3. **Serving**: Deploys the latest checkpoint via vLLM
- Auto-scaling with 15-minute scaledown window
- Handles up to 32 concurrent requests per replica
- OpenAI-compatible API endpoint
### Inference Client (`main.py`)
Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse.
## Installation
```bash
# Install dependencies
uv pip install -e .
# Or with pip
pip install -e .
```
## Setup
### 1. Configure Modal
```bash
# Install Modal CLI
pip install modal
# Authenticate
modal token new
```
### 2. Set up WandB Secret
```bash
# Create WandB secret in Modal
modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key
```
### 3. Environment Variables
For using the trained model with Modaic, optionally create a `.env` file:
```env
MODAIC_API_KEY=your_modaic_api_key # If pushing to Modaic Hub
```
## Usage
### Training a Model
Deploy the training job to Modal:
```bash
# Deploy the app
modal deploy grpo_trl.py
# Run training
modal run grpo_trl.py::train
```
Training will:
- Load the OpenCoder dataset
- Train for 5 steps (configurable in `grpo_trl.py:85`)
- Save checkpoints every step
- Log metrics to WandB
### Serving the Model
Deploy the inference endpoint:
```bash
modal deploy grpo_trl.py
# The serve function will be available at:
# https://modaic-ai--grpo-demo-serve.modal.run
```
### Using the Model with Modaic
#### Option 1: PrecompiledProgram (Code Bundling)
Create a custom program that uses your trained model:
```python
import dspy
from modaic import PrecompiledProgram, PrecompiledConfig
class CodeGeneratorConfig(PrecompiledConfig):
model: str = "openai//models/checkpoint-5"
api_base: str = "https://your-modal-url.modal.run/v1"
max_tokens: int = 10000
temperature: float = 0.7
class CodeGeneration(dspy.Signature):
query: str = dspy.InputField(desc="The query to generate code for.")
code: str = dspy.OutputField(desc="The code to generate as a python function.")
class CodeGenerator(PrecompiledProgram):
config: CodeGeneratorConfig
def __init__(self, config: CodeGeneratorConfig, **kwargs):
super().__init__(config=config, **kwargs)
modal_lm = dspy.LM(
model=config.model,
api_base=config.api_base,
max_tokens=config.max_tokens,
temperature=config.temperature,
)
self.answer_question = dspy.Predict(CodeGeneration)
self.answer_question.set_lm(modal_lm)
def forward(self, query):
return self.answer_question(query=query)
# Use the program
code_generator = CodeGenerator(CodeGeneratorConfig())
result = code_generator(query="Write a function that reverses a string.")
print(result.code)
# Push to Modaic Hub with code bundling
code_generator.push_to_hub(
"your-entity/code-generator-grpo",
with_code=True, # Bundle entire program code
tag="v1.0.0"
)
```
#### Option 2: AutoProgram (Loading from Hub)
After pushing your program with `with_code=True`, load it anywhere:
```python
from modaic import AutoProgram
# Load the entire program including code and dependencies
code_generator = AutoProgram.from_precompiled(
"your-entity/code-generator-grpo",
rev="v1.0.0" # Optional: specify version tag, branch, or commit
)
# Use immediately
result = code_generator(query="Write a function to calculate fibonacci numbers.")
print(result.code)
```
#### Option 3: Load with Custom Config
Override configuration at runtime:
```python
from modaic import AutoProgram, AutoConfig
# Load config and override parameters
config = AutoConfig.from_precompiled("your-entity/code-generator-grpo")
custom_config = config.model_copy(update={"temperature": 0.5})
# Load program with custom config
code_generator = AutoProgram.from_precompiled(
"your-entity/code-generator-grpo",
config=custom_config
)
```
#### Option 4: Version Management
Modaic Hub supports Git-like versioning:
```python
# Load latest from main branch
program = AutoProgram.from_precompiled("entity/program")
# Load specific version
program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0")
# Load from development branch
program = AutoProgram.from_precompiled("entity/program", rev="dev")
# Load specific commit
program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d")
```
## Project Structure
```
grpo/
├── grpo_trl.py # Modal training & serving infrastructure
├── main.py # Modaic inference client example
├── pyproject.toml # Python dependencies
├── .env # Environment variables (create this)
└── README.md # This file
```
## Key Configuration Options
### Training (`grpo_trl.py`)
- **Line 78**: `dataset.select(range(128))` - Number of training examples
- **Line 85**: `max_steps=5` - Training iterations
- **Line 84**: `save_steps=1` - Checkpoint frequency
- **Line 88**: `model="Qwen/Qwen2-0.5B-Instruct"` - Base model
- **Line 97**: `gpu="H100"` - GPU type for training
### Serving (`grpo_trl.py`)
- **Line 134**: `gpu="H100"` - GPU type for inference
- **Line 135**: `scaledown_window=15 * 60` - Idle timeout
- **Line 140**: `max_inputs=32` - Concurrent request limit
### Model Endpoint (`main.py`)
- **Line 5**: `model` - Checkpoint identifier
- **Line 6**: `api_base` - Modal endpoint URL
- **Line 7**: `max_tokens` - Maximum generation length
- **Line 8**: `temperature` - Sampling temperature
## Monitoring
Training metrics are logged to WandB automatically. View:
- Training loss
- Reward distribution
- Generated code examples
- Step timing and throughput
## Development
### Local Testing
You can test the reward function locally:
```python
from grpo_trl import get_generated_code_and_test_cases
code = """
def add(a, b):
return a + b
"""
test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"]
full_code = get_generated_code_and_test_cases(code, test_cases)
print(full_code)
```
### Customizing the Dataset
Replace the dataset in `grpo_trl.py:71-77`:
```python
dataset = load_dataset("your-username/your-dataset", split="train")
dataset = dataset.rename_column("your_prompt_column", "prompt")
dataset = dataset.rename_column("your_testcase_column", "testcases")
```
### Extending the Reward Function
Modify `compute_reward()` in `grpo_trl.py:48-63` to implement custom reward logic:
- Code style scoring
- Performance benchmarks
- Multi-metric rewards (correctness + efficiency)
## Troubleshooting
**Modal deployment fails:**
- Ensure Modal token is configured: `modal token new`
- Check WandB secret exists: `modal secret list`
**Training timeout:**
- Increase `timeout` in `@app.function` decorator (line 98)
- Reduce dataset size or max_steps
**Inference errors:**
- Verify Modal endpoint URL matches `api_base` in config
- Check that checkpoints exist in Modal volume
- Ensure vLLM service is running: `modal app logs grpo-demo`
**Modaic Hub push fails:**
- Set `MODAIC_API_KEY` environment variable
- Verify entity/repo name format: `"entity-name/repo-name"`
- Check you have write permissions to the repository
## License
MIT
## Resources
- [Modal Documentation](https://modal.com/docs)
- [TRL GRPO Guide](https://huggingface.co/docs/trl)
- [Modaic Documentation](https://docs.modaic.dev)
- [Modaic AutoProgram Guide](https://docs.modaic.dev/modaic/guides/programs/auto_program)
- [Modaic PrecompiledProgram Guide](https://docs.modaic.dev/modaic/guides/programs/precompiled_program)
- [DSPy Documentation](https://dspy-docs.vercel.app)
- [OpenCoder Dataset](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2)