Add README.md

2025-12-24 00:46:38 -08:00
parent fb1efa3b4a
commit b3ca4e71df
2 changed files with 337 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,336 @@
+# GRPO Code Generation with Modal and Modaic
+
+A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference.
+
+## Overview
+
+This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's `PrecompiledProgram` and `AutoProgram` interfaces.
+
+**Key Features:**
+- **Test-driven training**: Models learn to generate code that passes test cases
+- **Scalable infrastructure**: Training and inference on Modal with H100 GPUs
+- **Fast inference**: vLLM-powered serving with automatic scaling
+- **Modaic integration**: Push trained programs to Modaic Hub for easy reuse
+- **Experiment tracking**: WandB integration for monitoring training metrics
+
+## Dataset Attribution
+
+This project uses the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset from Hugging Face, specifically the `educational_instruct` split. This dataset contains programming problems with test cases, making it ideal for reward-based training.
+
+**Citation:**
+```
+@misc{opencoder2024,
+  title={OpenCoder-LLM SFT Stage 2 Dataset},
+  author={OpenCoder-LLM Team},
+  year={2024},
+  publisher={HuggingFace},
+  howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}}
+}
+```
+
+## Architecture
+
+### Training Pipeline (`grpo_trl.py`)
+
+1. **Reward Function**: Executes generated code with test cases in sandboxed Modal environments
+   - Code that passes all tests receives reward = 1
+   - Failed code receives reward = 0
+   - Timeout protection (30s per execution)
+
+2. **GRPO Training**: Uses TRL's `GRPOTrainer` with the Qwen2-0.5B-Instruct base model
+   - Trains on H100 GPU
+   - Saves checkpoints to Modal Volume
+   - Tracks metrics via WandB
+
+3. **Serving**: Deploys the latest checkpoint via vLLM
+   - Auto-scaling with 15-minute scaledown window
+   - Handles up to 32 concurrent requests per replica
+   - OpenAI-compatible API endpoint
+
+### Inference Client (`main.py`)
+
+Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse.
+
+## Installation
+
+```bash
+# Install dependencies
+uv pip install -e .
+
+# Or with pip
+pip install -e .
+```
+
+## Setup
+
+### 1. Configure Modal
+
+```bash
+# Install Modal CLI
+pip install modal
+
+# Authenticate
+modal token new
+```
+
+### 2. Set up WandB Secret
+
+```bash
+# Create WandB secret in Modal
+modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key
+```
+
+### 3. Environment Variables
+
+For using the trained model with Modaic, optionally create a `.env` file:
+
+```env
+MODAIC_API_KEY=your_modaic_api_key  # If pushing to Modaic Hub
+```
+
+## Usage
+
+### Training a Model
+
+Deploy the training job to Modal:
+
+```bash
+# Deploy the app
+modal deploy grpo_trl.py
+
+# Run training
+modal run grpo_trl.py::train
+```
+
+Training will:
+- Load the OpenCoder dataset
+- Train for 5 steps (configurable in `grpo_trl.py:85`)
+- Save checkpoints every step
+- Log metrics to WandB
+
+### Serving the Model
+
+Deploy the inference endpoint:
+
+```bash
+modal deploy grpo_trl.py
+
+# The serve function will be available at:
+# https://modaic-ai--grpo-demo-serve.modal.run
+```
+
+### Using the Model with Modaic
+
+#### Option 1: PrecompiledProgram (Code Bundling)
+
+Create a custom program that uses your trained model:
+
+```python
+import dspy
+from modaic import PrecompiledProgram, PrecompiledConfig
+
+class CodeGeneratorConfig(PrecompiledConfig):
+    model: str = "openai//models/checkpoint-5"
+    api_base: str = "https://your-modal-url.modal.run/v1"
+    max_tokens: int = 10000
+    temperature: float = 0.7
+
+class CodeGeneration(dspy.Signature):
+    query: str = dspy.InputField(desc="The query to generate code for.")
+    code: str = dspy.OutputField(desc="The code to generate as a python function.")
+
+class CodeGenerator(PrecompiledProgram):
+    config: CodeGeneratorConfig
+
+    def __init__(self, config: CodeGeneratorConfig, **kwargs):
+        super().__init__(config=config, **kwargs)
+
+        modal_lm = dspy.LM(
+            model=config.model,
+            api_base=config.api_base,
+            max_tokens=config.max_tokens,
+            temperature=config.temperature,
+        )
+        self.answer_question = dspy.Predict(CodeGeneration)
+        self.answer_question.set_lm(modal_lm)
+
+    def forward(self, query):
+        return self.answer_question(query=query)
+
+# Use the program
+code_generator = CodeGenerator(CodeGeneratorConfig())
+result = code_generator(query="Write a function that reverses a string.")
+print(result.code)
+
+# Push to Modaic Hub with code bundling
+code_generator.push_to_hub(
+    "your-entity/code-generator-grpo",
+    with_code=True,  # Bundle entire program code
+    tag="v1.0.0"
+)
+```
+
+#### Option 2: AutoProgram (Loading from Hub)
+
+After pushing your program with `with_code=True`, load it anywhere:
+
+```python
+from modaic import AutoProgram
+
+# Load the entire program including code and dependencies
+code_generator = AutoProgram.from_precompiled(
+    "your-entity/code-generator-grpo",
+    rev="v1.0.0"  # Optional: specify version tag, branch, or commit
+)
+
+# Use immediately
+result = code_generator(query="Write a function to calculate fibonacci numbers.")
+print(result.code)
+```
+
+#### Option 3: Load with Custom Config
+
+Override configuration at runtime:
+
+```python
+from modaic import AutoProgram, AutoConfig
+
+# Load config and override parameters
+config = AutoConfig.from_precompiled("your-entity/code-generator-grpo")
+custom_config = config.model_copy(update={"temperature": 0.5})
+
+# Load program with custom config
+code_generator = AutoProgram.from_precompiled(
+    "your-entity/code-generator-grpo",
+    config=custom_config
+)
+```
+
+#### Option 4: Version Management
+
+Modaic Hub supports Git-like versioning:
+
+```python
+# Load latest from main branch
+program = AutoProgram.from_precompiled("entity/program")
+
+# Load specific version
+program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0")
+
+# Load from development branch
+program = AutoProgram.from_precompiled("entity/program", rev="dev")
+
+# Load specific commit
+program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d")
+```
+
+## Project Structure
+
+```
+grpo/
+├── grpo_trl.py          # Modal training & serving infrastructure
+├── main.py              # Modaic inference client example
+├── pyproject.toml       # Python dependencies
+├── .env                 # Environment variables (create this)
+└── README.md            # This file
+```
+
+## Key Configuration Options
+
+### Training (`grpo_trl.py`)
+
+- **Line 78**: `dataset.select(range(128))` - Number of training examples
+- **Line 85**: `max_steps=5` - Training iterations
+- **Line 84**: `save_steps=1` - Checkpoint frequency
+- **Line 88**: `model="Qwen/Qwen2-0.5B-Instruct"` - Base model
+- **Line 97**: `gpu="H100"` - GPU type for training
+
+### Serving (`grpo_trl.py`)
+
+- **Line 134**: `gpu="H100"` - GPU type for inference
+- **Line 135**: `scaledown_window=15 * 60` - Idle timeout
+- **Line 140**: `max_inputs=32` - Concurrent request limit
+
+### Model Endpoint (`main.py`)
+
+- **Line 5**: `model` - Checkpoint identifier
+- **Line 6**: `api_base` - Modal endpoint URL
+- **Line 7**: `max_tokens` - Maximum generation length
+- **Line 8**: `temperature` - Sampling temperature
+
+## Monitoring
+
+Training metrics are logged to WandB automatically. View:
+- Training loss
+- Reward distribution
+- Generated code examples
+- Step timing and throughput
+
+## Development
+
+### Local Testing
+
+You can test the reward function locally:
+
+```python
+from grpo_trl import get_generated_code_and_test_cases
+
+code = """
+def add(a, b):
+    return a + b
+"""
+test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"]
+full_code = get_generated_code_and_test_cases(code, test_cases)
+print(full_code)
+```
+
+### Customizing the Dataset
+
+Replace the dataset in `grpo_trl.py:71-77`:
+
+```python
+dataset = load_dataset("your-username/your-dataset", split="train")
+dataset = dataset.rename_column("your_prompt_column", "prompt")
+dataset = dataset.rename_column("your_testcase_column", "testcases")
+```
+
+### Extending the Reward Function
+
+Modify `compute_reward()` in `grpo_trl.py:48-63` to implement custom reward logic:
+- Code style scoring
+- Performance benchmarks
+- Multi-metric rewards (correctness + efficiency)
+
+## Troubleshooting
+
+**Modal deployment fails:**
+- Ensure Modal token is configured: `modal token new`
+- Check WandB secret exists: `modal secret list`
+
+**Training timeout:**
+- Increase `timeout` in `@app.function` decorator (line 98)
+- Reduce dataset size or max_steps
+
+**Inference errors:**
+- Verify Modal endpoint URL matches `api_base` in config
+- Check that checkpoints exist in Modal volume
+- Ensure vLLM service is running: `modal app logs grpo-demo`
+
+**Modaic Hub push fails:**
+- Set `MODAIC_API_KEY` environment variable
+- Verify entity/repo name format: `"entity-name/repo-name"`
+- Check you have write permissions to the repository
+
+## License
+
+MIT
+
+## Resources
+
+- [Modal Documentation](https://modal.com/docs)
+- [TRL GRPO Guide](https://huggingface.co/docs/trl)
+- [Modaic Documentation](https://docs.modaic.dev)
+- [Modaic AutoProgram Guide](https://docs.modaic.dev/modaic/guides/programs/auto_program)
+- [Modaic PrecompiledProgram Guide](https://docs.modaic.dev/modaic/guides/programs/precompiled_program)
+- [DSPy Documentation](https://dspy-docs.vercel.app)
+- [OpenCoder Dataset](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2)