code-generator-trl-grpo

A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference.

Overview

This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's PrecompiledProgram and AutoProgram interfaces.

Key Features:

Test-driven training: Models learn to generate code that passes test cases
Scalable infrastructure: Training and inference on Modal with H100 GPUs
Fast inference: vLLM-powered serving with automatic scaling
Modaic integration: Push trained programs to Modaic Hub for easy reuse
Experiment tracking: WandB integration for monitoring training metrics

Dataset Attribution

This project uses the OpenCoder-LLM/opc-sft-stage2 dataset from Hugging Face, specifically the educational_instruct split. This dataset contains programming problems with test cases, making it ideal for reward-based training.

Citation:

@misc{opencoder2024,
  title={OpenCoder-LLM SFT Stage 2 Dataset},
  author={OpenCoder-LLM Team},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}}
}

Architecture

Training Pipeline (`grpo_trl.py`)

Reward Function: Executes generated code with test cases in sandboxed Modal environments
- Code that passes all tests receives reward = 1
- Failed code receives reward = 0
- Timeout protection (30s per execution)
GRPO Training: Uses TRL's GRPOTrainer with the Qwen2-0.5B-Instruct base model
- Trains on H100 GPU
- Saves checkpoints to Modal Volume
- Tracks metrics via WandB
Serving: Deploys the latest checkpoint via vLLM
- Auto-scaling with 15-minute scaledown window
- Handles up to 32 concurrent requests per replica
- OpenAI-compatible API endpoint

Inference Client (`main.py`)

Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse.

Installation

# Install dependencies
uv pip install -e .

# Or with pip
pip install -e .

Setup

# Install Modal CLI
pip install modal

# Authenticate
modal token new

2. Set up WandB Secret

# Create WandB secret in Modal
modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key

3. Environment Variables

For using the trained model with Modaic, optionally create a .env file:

MODAIC_API_KEY=your_modaic_api_key  # If pushing to Modaic Hub

Usage

Training a Model

Deploy the training job to Modal:

# Deploy the app
modal deploy grpo_trl.py

# Run training
modal run grpo_trl.py::train

Training will:

Load the OpenCoder dataset
Train for 5 steps (configurable in grpo_trl.py:85)
Save checkpoints every step
Log metrics to WandB

Serving the Model

Deploy the inference endpoint:

modal deploy grpo_trl.py

# The serve function will be available at:
# https://modaic-ai--grpo-demo-serve.modal.run

Using the Model with Modaic

Option 1: PrecompiledProgram (Code Bundling)

Create a custom program that uses your trained model:

import dspy
from modaic import PrecompiledProgram, PrecompiledConfig

class CodeGeneratorConfig(PrecompiledConfig):
    model: str = "openai//models/checkpoint-5"
    api_base: str = "https://your-modal-url.modal.run/v1"
    max_tokens: int = 10000
    temperature: float = 0.7

class CodeGeneration(dspy.Signature):
    query: str = dspy.InputField(desc="The query to generate code for.")
    code: str = dspy.OutputField(desc="The code to generate as a python function.")

class CodeGenerator(PrecompiledProgram):
    config: CodeGeneratorConfig

    def __init__(self, config: CodeGeneratorConfig, **kwargs):
        super().__init__(config=config, **kwargs)

        modal_lm = dspy.LM(
            model=config.model,
            api_base=config.api_base,
            max_tokens=config.max_tokens,
            temperature=config.temperature,
        )
        self.answer_question = dspy.Predict(CodeGeneration)
        self.answer_question.set_lm(modal_lm)

    def forward(self, query):
        return self.answer_question(query=query)

# Use the program
code_generator = CodeGenerator(CodeGeneratorConfig())
result = code_generator(query="Write a function that reverses a string.")
print(result.code)

# Push to Modaic Hub with code bundling
code_generator.push_to_hub(
    "your-entity/code-generator-grpo",
    with_code=True,  # Bundle entire program code
    tag="v1.0.0"
)

Option 2: AutoProgram (Loading from Hub)

After pushing your program with with_code=True, load it anywhere:

from modaic import AutoProgram

# Load the entire program including code and dependencies
code_generator = AutoProgram.from_precompiled(
    "your-entity/code-generator-grpo",
    rev="v1.0.0"  # Optional: specify version tag, branch, or commit
)

# Use immediately
result = code_generator(query="Write a function to calculate fibonacci numbers.")
print(result.code)

Option 3: Load with Custom Config

Override configuration at runtime:

from modaic import AutoProgram, AutoConfig

# Load config and override parameters
config = AutoConfig.from_precompiled("your-entity/code-generator-grpo")
custom_config = config.model_copy(update={"temperature": 0.5})

# Load program with custom config
code_generator = AutoProgram.from_precompiled(
    "your-entity/code-generator-grpo",
    config=custom_config
)

Option 4: Version Management

Modaic Hub supports Git-like versioning:

# Load latest from main branch
program = AutoProgram.from_precompiled("entity/program")

# Load specific version
program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0")

# Load from development branch
program = AutoProgram.from_precompiled("entity/program", rev="dev")

# Load specific commit
program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d")

Project Structure

grpo/
├── grpo_trl.py          # Modal training & serving infrastructure
├── main.py              # Modaic inference client example
├── pyproject.toml       # Python dependencies
├── .env                 # Environment variables (create this)
└── README.md            # This file

Key Configuration Options

Training (`grpo_trl.py`)

Line 78: dataset.select(range(128)) - Number of training examples
Line 85: max_steps=5 - Training iterations
Line 84: save_steps=1 - Checkpoint frequency
Line 88: model="Qwen/Qwen2-0.5B-Instruct" - Base model
Line 97: gpu="H100" - GPU type for training

Serving (`grpo_trl.py`)

Line 134: gpu="H100" - GPU type for inference
Line 135: scaledown_window=15 * 60 - Idle timeout
Line 140: max_inputs=32 - Concurrent request limit

Model Endpoint (`main.py`)

Line 5: model - Checkpoint identifier
Line 6: api_base - Modal endpoint URL
Line 7: max_tokens - Maximum generation length
Line 8: temperature - Sampling temperature

Monitoring

Training metrics are logged to WandB automatically. View:

Training loss
Reward distribution
Generated code examples
Step timing and throughput

Development

Local Testing

You can test the reward function locally:

from grpo_trl import get_generated_code_and_test_cases

code = """
def add(a, b):
    return a + b
"""
test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"]
full_code = get_generated_code_and_test_cases(code, test_cases)
print(full_code)

Customizing the Dataset

Replace the dataset in grpo_trl.py:71-77:

dataset = load_dataset("your-username/your-dataset", split="train")
dataset = dataset.rename_column("your_prompt_column", "prompt")
dataset = dataset.rename_column("your_testcase_column", "testcases")

Extending the Reward Function

Modify compute_reward() in grpo_trl.py:48-63 to implement custom reward logic:

Code style scoring
Performance benchmarks
Multi-metric rewards (correctness + efficiency)

Troubleshooting

Modal deployment fails:

Ensure Modal token is configured: modal token new
Check WandB secret exists: modal secret list

Training timeout:

Increase timeout in @app.function decorator (line 98)
Reduce dataset size or max_steps

Inference errors:

Verify Modal endpoint URL matches api_base in config
Check that checkpoints exist in Modal volume
Ensure vLLM service is running: modal app logs grpo-demo

Modaic Hub push fails:

Set MODAIC_API_KEY environment variable
Verify entity/repo name format: "entity-name/repo-name"
Check you have write permissions to the repository

License

MIT

README.md

GRPO Code Generation with Modal and Modaic