# GRPO Code Generation with Modal and Modaic A production-ready implementation of Group Relative Policy Optimization (GRPO) for training code generation models to use within DSPy programs on Modal, with seamless integration into Modaic Hub for deployment and inference. ## Overview This project trains a code generation model using GRPO reinforcement learning, where the reward signal comes from actual test case execution. The trained model is deployed as a vLLM endpoint on Modal and can be used via Modaic's `PrecompiledProgram` and `AutoProgram` interfaces. **Key Features:** - **Test-driven training**: Models learn to generate code that passes test cases - **Scalable infrastructure**: Training and inference on Modal with H100 GPUs - **Fast inference**: vLLM-powered serving with automatic scaling - **Modaic integration**: Push trained programs to Modaic Hub for easy reuse - **Experiment tracking**: WandB integration for monitoring training metrics ## Dataset Attribution This project uses the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset from Hugging Face, specifically the `educational_instruct` split. This dataset contains programming problems with test cases, making it ideal for reward-based training. **Citation:** ``` @misc{opencoder2024, title={OpenCoder-LLM SFT Stage 2 Dataset}, author={OpenCoder-LLM Team}, year={2024}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2}} } ``` ## Architecture ### Training Pipeline (`grpo_trl.py`) 1. **Reward Function**: Executes generated code with test cases in sandboxed Modal environments - Code that passes all tests receives reward = 1 - Failed code receives reward = 0 - Timeout protection (30s per execution) 2. **GRPO Training**: Uses TRL's `GRPOTrainer` with the Qwen2-0.5B-Instruct base model - Trains on H100 GPU - Saves checkpoints to Modal Volume - Tracks metrics via WandB 3. **Serving**: Deploys the latest checkpoint via vLLM - Auto-scaling with 15-minute scaledown window - Handles up to 32 concurrent requests per replica - OpenAI-compatible API endpoint ### Inference Client (`main.py`) Uses DSPy and Modaic to create a structured interface to the trained model, then pushes the program to Modaic Hub for easy loading and reuse. ## Installation ```bash # Install dependencies uv pip install -e . # Or with pip pip install -e . ``` ## Setup ### 1. Configure Modal ```bash # Install Modal CLI pip install modal # Authenticate modal token new ``` ### 2. Set up WandB Secret ```bash # Create WandB secret in Modal modal secret create wandb-secret WANDB_API_KEY=your_wandb_api_key ``` ### 3. Environment Variables For using the trained model with Modaic, optionally create a `.env` file: ```env MODAIC_API_KEY=your_modaic_api_key # If pushing to Modaic Hub ``` ## Usage ### Training a Model Deploy the training job to Modal: ```bash # Deploy the app modal deploy grpo_trl.py # Run training modal run grpo_trl.py::train ``` Training will: - Load the OpenCoder dataset - Train for 5 steps (configurable in `grpo_trl.py:85`) - Save checkpoints every step - Log metrics to WandB ### Serving the Model Deploy the inference endpoint: ```bash modal deploy grpo_trl.py # The serve function will be available at: # https://modaic-ai--grpo-demo-serve.modal.run ``` ### Using the Model with Modaic #### Option 1: PrecompiledProgram (Code Bundling) Create a custom program that uses your trained model: ```python import dspy from modaic import PrecompiledProgram, PrecompiledConfig class CodeGeneratorConfig(PrecompiledConfig): model: str = "openai//models/checkpoint-5" api_base: str = "https://your-modal-url.modal.run/v1" max_tokens: int = 10000 temperature: float = 0.7 class CodeGeneration(dspy.Signature): query: str = dspy.InputField(desc="The query to generate code for.") code: str = dspy.OutputField(desc="The code to generate as a python function.") class CodeGenerator(PrecompiledProgram): config: CodeGeneratorConfig def __init__(self, config: CodeGeneratorConfig, **kwargs): super().__init__(config=config, **kwargs) modal_lm = dspy.LM( model=config.model, api_base=config.api_base, max_tokens=config.max_tokens, temperature=config.temperature, ) self.answer_question = dspy.Predict(CodeGeneration) self.answer_question.set_lm(modal_lm) def forward(self, query): return self.answer_question(query=query) # Use the program code_generator = CodeGenerator(CodeGeneratorConfig()) result = code_generator(query="Write a function that reverses a string.") print(result.code) # Push to Modaic Hub with code bundling code_generator.push_to_hub( "your-entity/code-generator-grpo", with_code=True, # Bundle entire program code tag="v1.0.0" ) ``` #### Option 2: AutoProgram (Loading from Hub) After pushing your program with `with_code=True`, load it anywhere: ```python from modaic import AutoProgram # Load the entire program including code and dependencies code_generator = AutoProgram.from_precompiled( "your-entity/code-generator-grpo", rev="v1.0.0" # Optional: specify version tag, branch, or commit ) # Use immediately result = code_generator(query="Write a function to calculate fibonacci numbers.") print(result.code) ``` #### Option 3: Load with Custom Config Override configuration at runtime: ```python from modaic import AutoProgram, AutoConfig # Load config and override parameters config = AutoConfig.from_precompiled("your-entity/code-generator-grpo") custom_config = config.model_copy(update={"temperature": 0.5}) # Load program with custom config code_generator = AutoProgram.from_precompiled( "your-entity/code-generator-grpo", config=custom_config ) ``` #### Option 4: Version Management Modaic Hub supports Git-like versioning: ```python # Load latest from main branch program = AutoProgram.from_precompiled("entity/program") # Load specific version program = AutoProgram.from_precompiled("entity/program", rev="v2.0.0") # Load from development branch program = AutoProgram.from_precompiled("entity/program", rev="dev") # Load specific commit program = AutoProgram.from_precompiled("entity/program", rev="a3b2c1d") ``` ## Project Structure ``` grpo/ ├── grpo_trl.py # Modal training & serving infrastructure ├── main.py # Modaic inference client example ├── pyproject.toml # Python dependencies ├── .env # Environment variables (create this) └── README.md # This file ``` ## Key Configuration Options ### Training (`grpo_trl.py`) - **Line 78**: `dataset.select(range(128))` - Number of training examples - **Line 85**: `max_steps=5` - Training iterations - **Line 84**: `save_steps=1` - Checkpoint frequency - **Line 88**: `model="Qwen/Qwen2-0.5B-Instruct"` - Base model - **Line 97**: `gpu="H100"` - GPU type for training ### Serving (`grpo_trl.py`) - **Line 134**: `gpu="H100"` - GPU type for inference - **Line 135**: `scaledown_window=15 * 60` - Idle timeout - **Line 140**: `max_inputs=32` - Concurrent request limit ### Model Endpoint (`main.py`) - **Line 5**: `model` - Checkpoint identifier - **Line 6**: `api_base` - Modal endpoint URL - **Line 7**: `max_tokens` - Maximum generation length - **Line 8**: `temperature` - Sampling temperature ## Monitoring Training metrics are logged to WandB automatically. View: - Training loss - Reward distribution - Generated code examples - Step timing and throughput ## Development ### Local Testing You can test the reward function locally: ```python from grpo_trl import get_generated_code_and_test_cases code = """ def add(a, b): return a + b """ test_cases = ["assert add(2, 3) == 5", "assert add(-1, 1) == 0"] full_code = get_generated_code_and_test_cases(code, test_cases) print(full_code) ``` ### Customizing the Dataset Replace the dataset in `grpo_trl.py:71-77`: ```python dataset = load_dataset("your-username/your-dataset", split="train") dataset = dataset.rename_column("your_prompt_column", "prompt") dataset = dataset.rename_column("your_testcase_column", "testcases") ``` ### Extending the Reward Function Modify `compute_reward()` in `grpo_trl.py:48-63` to implement custom reward logic: - Code style scoring - Performance benchmarks - Multi-metric rewards (correctness + efficiency) ## Troubleshooting **Modal deployment fails:** - Ensure Modal token is configured: `modal token new` - Check WandB secret exists: `modal secret list` **Training timeout:** - Increase `timeout` in `@app.function` decorator (line 98) - Reduce dataset size or max_steps **Inference errors:** - Verify Modal endpoint URL matches `api_base` in config - Check that checkpoints exist in Modal volume - Ensure vLLM service is running: `modal app logs grpo-demo` **Modaic Hub push fails:** - Set `MODAIC_API_KEY` environment variable - Verify entity/repo name format: `"entity-name/repo-name"` - Check you have write permissions to the repository ## License MIT ## Resources - [Modal Documentation](https://modal.com/docs) - [TRL GRPO Guide](https://huggingface.co/docs/trl) - [Modaic Documentation](https://docs.modaic.dev) - [Modaic AutoProgram Guide](https://docs.modaic.dev/modaic/guides/programs/auto_program) - [Modaic PrecompiledProgram Guide](https://docs.modaic.dev/modaic/guides/programs/precompiled_program) - [DSPy Documentation](https://dspy-docs.vercel.app) - [OpenCoder Dataset](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2)