GRPO Code Generation with Modal and Modaic
A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference.
Overview
This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's PrecompiledProgram and AutoProgram interfaces.
Key Features:
- Test-driven training: Models learn to generate code that passes test cases
- Scalable infrastructure: Training and inference on Modal with H100 GPUs
- Fast inference: vLLM-powered serving with automatic scaling
- Modaic integration: Push trained programs to Modaic Hub for easy reuse
- Experiment tracking: WandB integration for monitoring training metrics
Dataset Attribution
This project uses the OpenCoder-LLM/opc-sft-stage2 dataset from Hugging Face, specifically the educational_instruct split. This dataset contains programming problems with test cases, making it ideal for reward-based training.
Citation:
@misc{opencoder2024,
title={OpenCoder-LLM SFT Stage 2 Dataset},
author={OpenCoder-LLM Team},
year={2024},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}}
}
Architecture
Training Pipeline (grpo_trl.py)
-
Reward Function: Executes generated code with test cases in sandboxed Modal environments
- Code that passes all tests receives reward = 1
- Failed code receives reward = 0
- Timeout protection (30s per execution)
-
GRPO Training: Uses TRL's
GRPOTrainerwith the Qwen2-0.5B-Instruct base model- Trains on H100 GPU
- Saves checkpoints to Modal Volume
- Tracks metrics via WandB
-
Serving: Deploys the latest checkpoint via vLLM
- Auto-scaling with 15-minute scaledown window
- Handles up to 32 concurrent requests per replica
- OpenAI-compatible API endpoint
Inference Client (main.py)
Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse.
Installation
# Install dependencies
uv pip install -e .
# Or with pip
pip install -e .
Setup
1. Configure Modal
# Install Modal CLI
pip install modal
# Authenticate
modal token new
2. Set up WandB Secret
# Create WandB secret in Modal
modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key
3. Environment Variables
For using the trained model with Modaic, optionally create a .env file:
MODAIC_API_KEY=your_modaic_api_key # If pushing to Modaic Hub
Usage
Training a Model
Deploy the training job to Modal:
# Deploy the app
modal deploy grpo_trl.py
# Run training
modal run grpo_trl.py::train
Training will:
- Load the OpenCoder dataset
- Train for 5 steps (configurable in
grpo_trl.py:85) - Save checkpoints every step
- Log metrics to WandB
Serving the Model
Deploy the inference endpoint:
modal deploy grpo_trl.py
# The serve function will be available at:
# https://modaic-ai--grpo-demo-serve.modal.run
Using the Model with Modaic
Option 1: PrecompiledProgram (Code Bundling)
Create a custom program that uses your trained model:
import dspy
from modaic import PrecompiledProgram, PrecompiledConfig
class CodeGeneratorConfig(PrecompiledConfig):
model: str = "openai//models/checkpoint-5"
api_base: str = "https://your-modal-url.modal.run/v1"
max_tokens: int = 10000
temperature: float = 0.7
class CodeGeneration(dspy.Signature):
query: str = dspy.InputField(desc="The query to generate code for.")
code: str = dspy.OutputField(desc="The code to generate as a python function.")
class CodeGenerator(PrecompiledProgram):
config: CodeGeneratorConfig
def __init__(self, config: CodeGeneratorConfig, **kwargs):
super().__init__(config=config, **kwargs)
modal_lm = dspy.LM(
model=config.model,
api_base=config.api_base,
max_tokens=config.max_tokens,
temperature=config.temperature,
)
self.answer_question = dspy.Predict(CodeGeneration)
self.answer_question.set_lm(modal_lm)
def forward(self, query):
return self.answer_question(query=query)
# Use the program
code_generator = CodeGenerator(CodeGeneratorConfig())
result = code_generator(query="Write a function that reverses a string.")
print(result.code)
# Push to Modaic Hub with code bundling
code_generator.push_to_hub(
"your-entity/code-generator-grpo",
with_code=True, # Bundle entire program code
tag="v1.0.0"
)
Option 2: AutoProgram (Loading from Hub)
After pushing your program with with_code=True, load it anywhere:
from modaic import AutoProgram
# Load the entire program including code and dependencies
code_generator = AutoProgram.from_precompiled(
"your-entity/code-generator-grpo",
rev="v1.0.0" # Optional: specify version tag, branch, or commit
)
# Use immediately
result = code_generator(query="Write a function to calculate fibonacci numbers.")
print(result.code)
Option 3: Load with Custom Config
Override configuration at runtime:
from modaic import AutoProgram, AutoConfig
# Load config and override parameters
config = AutoConfig.from_precompiled("your-entity/code-generator-grpo")
custom_config = config.model_copy(update={"temperature": 0.5})
# Load program with custom config
code_generator = AutoProgram.from_precompiled(
"your-entity/code-generator-grpo",
config=custom_config
)
Option 4: Version Management
Modaic Hub supports Git-like versioning:
# Load latest from main branch
program = AutoProgram.from_precompiled("entity/program")
# Load specific version
program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0")
# Load from development branch
program = AutoProgram.from_precompiled("entity/program", rev="dev")
# Load specific commit
program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d")
Project Structure
grpo/
├── grpo_trl.py # Modal training & serving infrastructure
├── main.py # Modaic inference client example
├── pyproject.toml # Python dependencies
├── .env # Environment variables (create this)
└── README.md # This file
Key Configuration Options
Training (grpo_trl.py)
- Line 78:
dataset.select(range(128))- Number of training examples - Line 85:
max_steps=5- Training iterations - Line 84:
save_steps=1- Checkpoint frequency - Line 88:
model="Qwen/Qwen2-0.5B-Instruct"- Base model - Line 97:
gpu="H100"- GPU type for training
Serving (grpo_trl.py)
- Line 134:
gpu="H100"- GPU type for inference - Line 135:
scaledown_window=15 * 60- Idle timeout - Line 140:
max_inputs=32- Concurrent request limit
Model Endpoint (main.py)
- Line 5:
model- Checkpoint identifier - Line 6:
api_base- Modal endpoint URL - Line 7:
max_tokens- Maximum generation length - Line 8:
temperature- Sampling temperature
Monitoring
Training metrics are logged to WandB automatically. View:
- Training loss
- Reward distribution
- Generated code examples
- Step timing and throughput
Development
Local Testing
You can test the reward function locally:
from grpo_trl import get_generated_code_and_test_cases
code = """
def add(a, b):
return a + b
"""
test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"]
full_code = get_generated_code_and_test_cases(code, test_cases)
print(full_code)
Customizing the Dataset
Replace the dataset in grpo_trl.py:71-77:
dataset = load_dataset("your-username/your-dataset", split="train")
dataset = dataset.rename_column("your_prompt_column", "prompt")
dataset = dataset.rename_column("your_testcase_column", "testcases")
Extending the Reward Function
Modify compute_reward() in grpo_trl.py:48-63 to implement custom reward logic:
- Code style scoring
- Performance benchmarks
- Multi-metric rewards (correctness + efficiency)
Troubleshooting
Modal deployment fails:
- Ensure Modal token is configured:
modal token new - Check WandB secret exists:
modal secret list
Training timeout:
- Increase
timeoutin@app.functiondecorator (line 98) - Reduce dataset size or max_steps
Inference errors:
- Verify Modal endpoint URL matches
api_basein config - Check that checkpoints exist in Modal volume
- Ensure vLLM service is running:
modal app logs grpo-demo
Modaic Hub push fails:
- Set
MODAIC_API_KEYenvironment variable - Verify entity/repo name format:
"entity-name/repo-name" - Check you have write permissions to the repository
License
MIT