code-generator-trl-grpo/README.md

# GRPO Code Generation with Modal and Modaic

A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference.

## Overview

This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's `PrecompiledProgram` and `AutoProgram` interfaces.

**Key Features:**
- **Test-driven training**: Models learn to generate code that passes test cases
- **Scalable infrastructure**: Training and inference on Modal with H100 GPUs
- **Fast inference**: vLLM-powered serving with automatic scaling
- **Modaic integration**: Push trained programs to Modaic Hub for easy reuse
- **Experiment tracking**: WandB integration for monitoring training metrics

## Dataset Attribution

This project uses the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset from Hugging Face, specifically the `educational_instruct` split. This dataset contains programming problems with test cases, making it ideal for reward-based training.

**Citation:**
```
@misc{opencoder2024,
  title={OpenCoder-LLM SFT Stage 2 Dataset},
  author={OpenCoder-LLM Team},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}}
}
```

## Architecture

### Training Pipeline (`grpo_trl.py`)

1. **Reward Function**: Executes generated code with test cases in sandboxed Modal environments
   - Code that passes all tests receives reward = 1
   - Failed code receives reward = 0
   - Timeout protection (30s per execution)

2. **GRPO Training**: Uses TRL's `GRPOTrainer` with the Qwen2-0.5B-Instruct base model
   - Trains on H100 GPU
   - Saves checkpoints to Modal Volume
   - Tracks metrics via WandB

3. **Serving**: Deploys the latest checkpoint via vLLM
   - Auto-scaling with 15-minute scaledown window
   - Handles up to 32 concurrent requests per replica
   - OpenAI-compatible API endpoint

### Inference Client (`main.py`)

Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse.

## Installation

```bash
# Install dependencies
uv pip install -e .

# Or with pip
pip install -e .
```

## Setup

### 1. Configure Modal

```bash
# Install Modal CLI
pip install modal

# Authenticate
modal token new
```

### 2. Set up WandB Secret

```bash
# Create WandB secret in Modal
modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key
```

### 3. Environment Variables

For using the trained model with Modaic, optionally create a `.env` file:

```env
MODAIC_API_KEY=your_modaic_api_key  # If pushing to Modaic Hub
```

## Usage

### Training a Model

Deploy the training job to Modal:

```bash
# Deploy the app
modal deploy grpo_trl.py

# Run training
modal run grpo_trl.py::train
```

Training will:
- Load the OpenCoder dataset
- Train for 5 steps (configurable in `grpo_trl.py:85`)
- Save checkpoints every step
- Log metrics to WandB

### Serving the Model

Deploy the inference endpoint:

```bash
modal deploy grpo_trl.py

# The serve function will be available at:
# https://modaic-ai--grpo-demo-serve.modal.run
```

### Using the Model with Modaic

#### Option 1: PrecompiledProgram (Code Bundling)

Create a custom program that uses your trained model:

```python
import dspy
from modaic import PrecompiledProgram, PrecompiledConfig

class CodeGeneratorConfig(PrecompiledConfig):
    model: str = "openai//models/checkpoint-5"
    api_base: str = "https://your-modal-url.modal.run/v1"
    max_tokens: int = 10000
    temperature: float = 0.7

class CodeGeneration(dspy.Signature):
    query: str = dspy.InputField(desc="The query to generate code for.")
    code: str = dspy.OutputField(desc="The code to generate as a python function.")

class CodeGenerator(PrecompiledProgram):
    config: CodeGeneratorConfig

    def __init__(self, config: CodeGeneratorConfig, **kwargs):
        super().__init__(config=config, **kwargs)

        modal_lm = dspy.LM(
            model=config.model,
            api_base=config.api_base,
            max_tokens=config.max_tokens,
            temperature=config.temperature,
        )
        self.answer_question = dspy.Predict(CodeGeneration)
        self.answer_question.set_lm(modal_lm)

    def forward(self, query):
        return self.answer_question(query=query)

# Use the program
code_generator = CodeGenerator(CodeGeneratorConfig())
result = code_generator(query="Write a function that reverses a string.")
print(result.code)

# Push to Modaic Hub with code bundling
code_generator.push_to_hub(
    "your-entity/code-generator-grpo",
    with_code=True,  # Bundle entire program code
    tag="v1.0.0"
)
```

#### Option 2: AutoProgram (Loading from Hub)

After pushing your program with `with_code=True`, load it anywhere:

```python
from modaic import AutoProgram

# Load the entire program including code and dependencies
code_generator = AutoProgram.from_precompiled(
    "your-entity/code-generator-grpo",
    rev="v1.0.0"  # Optional: specify version tag, branch, or commit
)

# Use immediately
result = code_generator(query="Write a function to calculate fibonacci numbers.")
print(result.code)
```

#### Option 3: Load with Custom Config

Override configuration at runtime:

```python
from modaic import AutoProgram, AutoConfig

# Load config and override parameters
config = AutoConfig.from_precompiled("your-entity/code-generator-grpo")
custom_config = config.model_copy(update={"temperature": 0.5})

# Load program with custom config
code_generator = AutoProgram.from_precompiled(
    "your-entity/code-generator-grpo",
    config=custom_config
)
```

#### Option 4: Version Management

Modaic Hub supports Git-like versioning:

```python
# Load latest from main branch
program = AutoProgram.from_precompiled("entity/program")

# Load specific version
program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0")

# Load from development branch
program = AutoProgram.from_precompiled("entity/program", rev="dev")

# Load specific commit
program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d")
```

## Project Structure

```
grpo/
├── grpo_trl.py          # Modal training & serving infrastructure
├── main.py              # Modaic inference client example
├── pyproject.toml       # Python dependencies
├── .env                 # Environment variables (create this)
└── README.md            # This file
```

## Key Configuration Options

### Training (`grpo_trl.py`)

- **Line 78**: `dataset.select(range(128))` - Number of training examples
- **Line 85**: `max_steps=5` - Training iterations
- **Line 84**: `save_steps=1` - Checkpoint frequency
- **Line 88**: `model="Qwen/Qwen2-0.5B-Instruct"` - Base model
- **Line 97**: `gpu="H100"` - GPU type for training

### Serving (`grpo_trl.py`)

- **Line 134**: `gpu="H100"` - GPU type for inference
- **Line 135**: `scaledown_window=15 * 60` - Idle timeout
- **Line 140**: `max_inputs=32` - Concurrent request limit

### Model Endpoint (`main.py`)

- **Line 5**: `model` - Checkpoint identifier
- **Line 6**: `api_base` - Modal endpoint URL
- **Line 7**: `max_tokens` - Maximum generation length
- **Line 8**: `temperature` - Sampling temperature

## Monitoring

Training metrics are logged to WandB automatically. View:
- Training loss
- Reward distribution
- Generated code examples
- Step timing and throughput

## Development

### Local Testing

You can test the reward function locally:

```python
from grpo_trl import get_generated_code_and_test_cases

code = """
def add(a, b):
    return a + b
"""
test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"]
full_code = get_generated_code_and_test_cases(code, test_cases)
print(full_code)
```

### Customizing the Dataset

Replace the dataset in `grpo_trl.py:71-77`:

```python
dataset = load_dataset("your-username/your-dataset", split="train")
dataset = dataset.rename_column("your_prompt_column", "prompt")
dataset = dataset.rename_column("your_testcase_column", "testcases")
```

### Extending the Reward Function

Modify `compute_reward()` in `grpo_trl.py:48-63` to implement custom reward logic:
- Code style scoring
- Performance benchmarks
- Multi-metric rewards (correctness + efficiency)

## Troubleshooting

**Modal deployment fails:**
- Ensure Modal token is configured: `modal token new`
- Check WandB secret exists: `modal secret list`

**Training timeout:**
- Increase `timeout` in `@app.function` decorator (line 98)
- Reduce dataset size or max_steps

**Inference errors:**
- Verify Modal endpoint URL matches `api_base` in config
- Check that checkpoints exist in Modal volume
- Ensure vLLM service is running: `modal app logs grpo-demo`

**Modaic Hub push fails:**
- Set `MODAIC_API_KEY` environment variable
- Verify entity/repo name format: `"entity-name/repo-name"`
- Check you have write permissions to the repository

## License

MIT

## Resources

- [Modal Documentation](https://modal.com/docs)
- [TRL GRPO Guide](https://huggingface.co/docs/trl)
- [Modaic Documentation](https://docs.modaic.dev)
- [Modaic AutoProgram Guide](https://docs.modaic.dev/modaic/guides/programs/auto_program)
- [Modaic PrecompiledProgram Guide](https://docs.modaic.dev/modaic/guides/programs/precompiled_program)
- [DSPy Documentation](https://dspy-docs.vercel.app)
- [OpenCoder Dataset](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2)