diff --git a/README.md b/README.md index e69de29..98f0d1c 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,336 @@ +# GRPO Code Generation with Modal and Modaic + +A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference. + +## Overview + +This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's `PrecompiledProgram` and `AutoProgram` interfaces. + +**Key Features:** +- **Test-driven training**: Models learn to generate code that passes test cases +- **Scalable infrastructure**: Training and inference on Modal with H100 GPUs +- **Fast inference**: vLLM-powered serving with automatic scaling +- **Modaic integration**: Push trained programs to Modaic Hub for easy reuse +- **Experiment tracking**: WandB integration for monitoring training metrics + +## Dataset Attribution + +This project uses the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset from Hugging Face, specifically the `educational_instruct` split. This dataset contains programming problems with test cases, making it ideal for reward-based training. + +**Citation:** +``` +@misc{opencoder2024, + title={OpenCoder-LLM SFT Stage 2 Dataset}, + author={OpenCoder-LLM Team}, + year={2024}, + publisher={HuggingFace}, + howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}} +} +``` + +## Architecture + +### Training Pipeline (`grpo_trl.py`) + +1. **Reward Function**: Executes generated code with test cases in sandboxed Modal environments + - Code that passes all tests receives reward = 1 + - Failed code receives reward = 0 + - Timeout protection (30s per execution) + +2. **GRPO Training**: Uses TRL's `GRPOTrainer` with the Qwen2-0.5B-Instruct base model + - Trains on H100 GPU + - Saves checkpoints to Modal Volume + - Tracks metrics via WandB + +3. **Serving**: Deploys the latest checkpoint via vLLM + - Auto-scaling with 15-minute scaledown window + - Handles up to 32 concurrent requests per replica + - OpenAI-compatible API endpoint + +### Inference Client (`main.py`) + +Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse. + +## Installation + +```bash +# Install dependencies +uv pip install -e . + +# Or with pip +pip install -e . +``` + +## Setup + +### 1. Configure Modal + +```bash +# Install Modal CLI +pip install modal + +# Authenticate +modal token new +``` + +### 2. Set up WandB Secret + +```bash +# Create WandB secret in Modal +modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key +``` + +### 3. Environment Variables + +For using the trained model with Modaic, optionally create a `.env` file: + +```env +MODAIC_API_KEY=your_modaic_api_key # If pushing to Modaic Hub +``` + +## Usage + +### Training a Model + +Deploy the training job to Modal: + +```bash +# Deploy the app +modal deploy grpo_trl.py + +# Run training +modal run grpo_trl.py::train +``` + +Training will: +- Load the OpenCoder dataset +- Train for 5 steps (configurable in `grpo_trl.py:85`) +- Save checkpoints every step +- Log metrics to WandB + +### Serving the Model + +Deploy the inference endpoint: + +```bash +modal deploy grpo_trl.py + +# The serve function will be available at: +# https://modaic-ai--grpo-demo-serve.modal.run +``` + +### Using the Model with Modaic + +#### Option 1: PrecompiledProgram (Code Bundling) + +Create a custom program that uses your trained model: + +```python +import dspy +from modaic import PrecompiledProgram, PrecompiledConfig + +class CodeGeneratorConfig(PrecompiledConfig): + model: str = "openai//models/checkpoint-5" + api_base: str = "https://your-modal-url.modal.run/v1" + max_tokens: int = 10000 + temperature: float = 0.7 + +class CodeGeneration(dspy.Signature): + query: str = dspy.InputField(desc="The query to generate code for.") + code: str = dspy.OutputField(desc="The code to generate as a python function.") + +class CodeGenerator(PrecompiledProgram): + config: CodeGeneratorConfig + + def __init__(self, config: CodeGeneratorConfig, **kwargs): + super().__init__(config=config, **kwargs) + + modal_lm = dspy.LM( + model=config.model, + api_base=config.api_base, + max_tokens=config.max_tokens, + temperature=config.temperature, + ) + self.answer_question = dspy.Predict(CodeGeneration) + self.answer_question.set_lm(modal_lm) + + def forward(self, query): + return self.answer_question(query=query) + +# Use the program +code_generator = CodeGenerator(CodeGeneratorConfig()) +result = code_generator(query="Write a function that reverses a string.") +print(result.code) + +# Push to Modaic Hub with code bundling +code_generator.push_to_hub( + "your-entity/code-generator-grpo", + with_code=True, # Bundle entire program code + tag="v1.0.0" +) +``` + +#### Option 2: AutoProgram (Loading from Hub) + +After pushing your program with `with_code=True`, load it anywhere: + +```python +from modaic import AutoProgram + +# Load the entire program including code and dependencies +code_generator = AutoProgram.from_precompiled( + "your-entity/code-generator-grpo", + rev="v1.0.0" # Optional: specify version tag, branch, or commit +) + +# Use immediately +result = code_generator(query="Write a function to calculate fibonacci numbers.") +print(result.code) +``` + +#### Option 3: Load with Custom Config + +Override configuration at runtime: + +```python +from modaic import AutoProgram, AutoConfig + +# Load config and override parameters +config = AutoConfig.from_precompiled("your-entity/code-generator-grpo") +custom_config = config.model_copy(update={"temperature": 0.5}) + +# Load program with custom config +code_generator = AutoProgram.from_precompiled( + "your-entity/code-generator-grpo", + config=custom_config +) +``` + +#### Option 4: Version Management + +Modaic Hub supports Git-like versioning: + +```python +# Load latest from main branch +program = AutoProgram.from_precompiled("entity/program") + +# Load specific version +program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0") + +# Load from development branch +program = AutoProgram.from_precompiled("entity/program", rev="dev") + +# Load specific commit +program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d") +``` + +## Project Structure + +``` +grpo/ +├── grpo_trl.py # Modal training & serving infrastructure +├── main.py # Modaic inference client example +├── pyproject.toml # Python dependencies +├── .env # Environment variables (create this) +└── README.md # This file +``` + +## Key Configuration Options + +### Training (`grpo_trl.py`) + +- **Line 78**: `dataset.select(range(128))` - Number of training examples +- **Line 85**: `max_steps=5` - Training iterations +- **Line 84**: `save_steps=1` - Checkpoint frequency +- **Line 88**: `model="Qwen/Qwen2-0.5B-Instruct"` - Base model +- **Line 97**: `gpu="H100"` - GPU type for training + +### Serving (`grpo_trl.py`) + +- **Line 134**: `gpu="H100"` - GPU type for inference +- **Line 135**: `scaledown_window=15 * 60` - Idle timeout +- **Line 140**: `max_inputs=32` - Concurrent request limit + +### Model Endpoint (`main.py`) + +- **Line 5**: `model` - Checkpoint identifier +- **Line 6**: `api_base` - Modal endpoint URL +- **Line 7**: `max_tokens` - Maximum generation length +- **Line 8**: `temperature` - Sampling temperature + +## Monitoring + +Training metrics are logged to WandB automatically. View: +- Training loss +- Reward distribution +- Generated code examples +- Step timing and throughput + +## Development + +### Local Testing + +You can test the reward function locally: + +```python +from grpo_trl import get_generated_code_and_test_cases + +code = """ +def add(a, b): + return a + b +""" +test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"] +full_code = get_generated_code_and_test_cases(code, test_cases) +print(full_code) +``` + +### Customizing the Dataset + +Replace the dataset in `grpo_trl.py:71-77`: + +```python +dataset = load_dataset("your-username/your-dataset", split="train") +dataset = dataset.rename_column("your_prompt_column", "prompt") +dataset = dataset.rename_column("your_testcase_column", "testcases") +``` + +### Extending the Reward Function + +Modify `compute_reward()` in `grpo_trl.py:48-63` to implement custom reward logic: +- Code style scoring +- Performance benchmarks +- Multi-metric rewards (correctness + efficiency) + +## Troubleshooting + +**Modal deployment fails:** +- Ensure Modal token is configured: `modal token new` +- Check WandB secret exists: `modal secret list` + +**Training timeout:** +- Increase `timeout` in `@app.function` decorator (line 98) +- Reduce dataset size or max_steps + +**Inference errors:** +- Verify Modal endpoint URL matches `api_base` in config +- Check that checkpoints exist in Modal volume +- Ensure vLLM service is running: `modal app logs grpo-demo` + +**Modaic Hub push fails:** +- Set `MODAIC_API_KEY` environment variable +- Verify entity/repo name format: `"entity-name/repo-name"` +- Check you have write permissions to the repository + +## License + +MIT + +## Resources + +- [Modal Documentation](https://modal.com/docs) +- [TRL GRPO Guide](https://huggingface.co/docs/trl) +- [Modaic Documentation](https://docs.modaic.dev) +- [Modaic AutoProgram Guide](https://docs.modaic.dev/modaic/guides/programs/auto_program) +- [Modaic PrecompiledProgram Guide](https://docs.modaic.dev/modaic/guides/programs/precompiled_program) +- [DSPy Documentation](https://dspy-docs.vercel.app) +- [OpenCoder Dataset](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) diff --git a/main.py b/main.py index 2273329..2942f24 100644 --- a/main.py +++ b/main.py @@ -32,6 +32,6 @@ class CodeGenerator(PrecompiledProgram): code_generator = CodeGenerator(CodeGeneratorConfig()) print(code_generator(query="Write a python function that returns the sum of two numbers.").code) -code_generator.push_to_hub("modaic/code-generator-trl-grpo", with_code=True, tag="v2.0.1") +code_generator.push_to_hub("modaic/code-generator-trl-grpo", with_code=True, tag="v2.0.2", commit_message="Add README.md") \ No newline at end of file