From 52e07b504fccaa4ba3b20d439328c3fbd524e6ea Mon Sep 17 00:00:00 2001
From: Farouk Adeleke <farouk@modaic.dev>
Date: Wed, 11 Feb 2026 09:10:54 +0000
Subject: [PATCH] Update README.md

---
 README.md | 313 +-----------------------------------------------------
 1 file changed, 1 insertion(+), 312 deletions(-)
diff --git a/README.md b/README.md
index 77c405b..af2ce90 100644
--- a/README.md
+++ b/README.md
@@ -1,312 +1 @@
-# Bench
-
-Modaic internal SDK for benchmarking judges and training confidence probes.
-
-## Installation
-
-```bash
-cd cli
-uv sync
-```
-
-## CLI Commands
-
-All commands are run from the `cli` directory via `uv run mo <command>`.
-
-### `create`
-
-Create benchmark datasets for training confidence probes. This command runs a judge on examples, extracts embeddings via Modal, and pushes the resulting dataset to HuggingFace Hub.
-
-**Subcommands:**
-
-- `create ppe` - Create dataset from PPE (human-preference + correctness) benchmarks
-- `create judge_bench` - Create dataset from the JudgeBench benchmark
-
-**Usage:**
-
-```bash
-# Interactive mode (recommended) - prompts for configuration
-uv run mo create ppe
-uv run mo create judge_bench
-
-# With config file
-uv run mo create ppe --config config.yaml
-uv run mo create judge_bench --config config.yaml
-```
-
-**Options:**
-
-| Option     | Short | Description                |
-| ---------- | ----- | -------------------------- |
-| `--config` | `-c`  | Path to config file (YAML) |
-
-**Config File Example:**
-
-```yaml
-judge: tyrin/ppe-judge-gepa
-output: tytodd/my-probe-dataset
-n_train: 500
-n_test: 100
-embedding_layer: -1  # -1 for middle layer
-```
-
-**What it does:**
-
-1. Loads examples from the benchmark dataset
-2. Runs the specified judge on each example to get predictions
-3. Extracts embeddings from the judge's LLM via Modal (GPU)
-4. Creates a HuggingFace dataset with columns: `question`, `response_a`, `response_b`, `label`, `predicted`, `messages`, `embeddings`
-5. Pushes to HuggingFace Hub
-
----
-
-### `train`
-
-Train a confidence probe on an embeddings dataset created with `create`.
-
-**Usage:**
-
-```bash
-# Interactive mode (recommended) - prompts for all configuration
-uv run mo train
-
-# With config file
-uv run mo train --config config.yaml
-
-# With CLI arguments
-uv run mo train --dataset tytodd/my-embeddings --epochs 10 --lr 0.0001
-```
-
-**Options:**
-
-| Option           | Short | Description                                                                       | Default           |
-| ---------------- | ----- | --------------------------------------------------------------------------------- | ----------------- |
-| `--config`       | `-c`  | Path to config file (YAML)                                                        | -                 |
-| `--dataset`      | `-d`  | Dataset path (HuggingFace Hub or local) (must be a dataset created with `create`) | -                 |
-| `--model-path`   | `-m`  | Output path for trained model                                                     | `{dataset}_probe` |
-| `--batch-size`   |       | Batch size                                                                        | 4                 |
-| `--epochs`       |       | Number of training epochs                                                         | 10                |
-| `--lr`           |       | Learning rate                                                                     | 0.0001            |
-| `--weight-decay` |       | Weight decay                                                                      | 0.01              |
-| `--test-size`    |       | Validation split ratio (if no test split)                                         | 0.2               |
-| `--seed`         |       | Random seed                                                                       | 42                |
-| `--project`      |       | W&B project name                                                                  | model_path        |
-| `--hub-path`     |       | HuggingFace Hub path to push model                                                | -                 |
-
-**Config File Example:**
-
-```yaml
-dataset_path: tytodd/my-probe-dataset
-model_path: ./best_probe
-hub_path: tytodd/my-probe  # Optional: push to HF Hub
-batch_size: 4
-epochs: 10
-learning_rate: 0.0001
-weight_decay: 0.01
-test_size: 0.2
-seed: 42
-```
-
-**What it does:**
-
-1. Loads an embeddings dataset (from HuggingFace Hub or local)
-2. Creates binary labels: 1 if `predicted == label`, 0 otherwise
-3. Trains a linear probe using MSE loss (Brier score optimization)
-4. Logs metrics to Weights & Biases (Brier, ECE, MCE, Kuiper, AUROC)
-5. Saves the best model based on validation Brier score
-6. Optionally pushes to HuggingFace Hub
-
----
-
-### `eval`
-
-Evaluate a trained confidence probe on a dataset. Computes calibration and discrimination metrics.
-
-**Usage:**
-
-```bash
-# Interactive mode (recommended) - prompts for probe and dataset
-uv run mo eval
-
-# With CLI arguments
-uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings
-
-# Evaluate on train split instead of test
-uv run mo eval --probe tytodd/my-probe --dataset tytodd/my-embeddings --split train
-```
-
-**Options:**
-
-| Option                       | Short | Description                              | Default      |
-| ---------------------------- | ----- | ---------------------------------------- | ------------ |
-| `--probe`                    | `-p`  | Probe path (HuggingFace Hub or local)    | -            |
-| `--dataset`                  | `-d`  | Dataset path (HuggingFace Hub or local)  | -            |
-| `--split`                    | `-s`  | Dataset split to evaluate on             | test         |
-| `--batch-size`               | `-b`  | Batch size for evaluation                | 64           |
-| `--normalize/--no-normalize` | `-n`  | Normalize embeddings with StandardScaler | probe config |
-
-**Metrics computed:**
-
-| Metric      | Description                                       |
-| ----------- | ------------------------------------------------- |
-| Brier Score | Mean squared error between predictions and labels |
-| Accuracy    | Classification accuracy at 0.5 threshold          |
-| F1 Score    | Harmonic mean of precision and recall             |
-| ECE         | Expected Calibration Error (10 bins)              |
-| MCE         | Maximum Calibration Error                         |
-| Kuiper      | Kuiper statistic for calibration                  |
-| AUROC       | Area Under the ROC Curve (discrimination)         |
-
-**What it does:**
-
-1. Loads a pretrained probe from HuggingFace Hub or local path
-2. Loads a dataset created with `create`
-3. Creates binary labels: 1 if `predicted == label`, 0 otherwise
-4. Runs inference and computes calibration/discrimination metrics
-5. Displays results in a formatted table
-
----
-
-### `compile`
-
-Compile (optimize) a judge using GEPA over a dataset. GEPA iteratively improves the judge's prompt based on training examples.
-
-**Subcommands:**
-
-- `compile` (base) - Compile with custom dataset and parameter mapping
-- `compile ppe` - Compile specifically for PPE datasets (human-preference + correctness)
-
-**Usage:**
-
-```bash
-# Interactive mode
-uv run mo compile
-uv run mo compile ppe
-
-# With config file
-uv run mo compile --config config.yaml
-uv run mo compile ppe --config config.yaml
-```
-
-**Options:**
-
-| Option     | Short | Description                |
-| ---------- | ----- | -------------------------- |
-| `--config` | `-c`  | Path to config file (YAML) |
-
-**Config File Example:**
-
-```yaml
-judge: tyrin/ppe-judge
-dataset: tytodd/ppe-human-preference
-inputs: # selects which input columns of the dataset to use (not necearry if using a compile subcommand like ppe or judge_bench) 
-  - name: question
-  - name: response_a
-    column: response_A  # Map param name to dataset column
-  - name: response_b
-    column: response_B
-label_column: label
-n_train: 100
-n_val: 50
-base_model: gpt-4o-mini
-reflection_model: gpt-4o
-output: tyrin/ppe-judge-gepa
-seed: 42
-```
-
-**What it does:**
-
-1. Loads a judge from Modaic Hub
-2. Loads training/validation examples from a HuggingFace dataset
-3. Maps judge parameters to dataset columns
-4. Runs GEPA optimization to improve the judge's prompt
-5. Pushes the optimized judge to Modaic Hub
-
----
-
-### `embed`
-
-Regenerate embeddings for an existing dataset using a different model or layer. Useful for experimenting with different embedding configurations without re-running the judge.
-
-**Usage:**
-
-```bash
-# Interactive mode
-uv run mo embed
-
-# With CLI arguments
-uv run mo embed --dataset tytodd/my-dataset --hf-model Qwen/Qwen3-VL-32B-Instruct --layer -1
-```
-
-**Options:**
-
-| Option       | Short | Description                              |
-| ------------ | ----- | ---------------------------------------- |
-| `--dataset`  | `-d`  | Dataset path (HuggingFace Hub or local)  |
-| `--hf-model` | `-m`  | HuggingFace model path for embeddings    |
-| `--layer`    | `-l`  | Hidden layer index (-1 for middle layer) |
-
-**What it does:**
-
-1. Loads an existing dataset (must have a `messages` column)
-2. Regenerates embeddings using the specified model/layer via Modal
-3. Replaces the `embeddings` column in the dataset
-4. Prompts to push the updated dataset to HuggingFace Hub
-
-**Example workflow:**
-
-```bash
-# Original dataset was created with layer 32
-# Now try middle layer instead
-uv run mo embed \
-  --dataset tytodd/my-embeddings \
-  --hf-model Qwen/Qwen3-VL-32B-Instruct \
-  --layer -1
-```
-
----
-
-## Recommended Embedding Layers
-
-When extracting embeddings, use these recommended layer indices for best probe performance:
-
-| Model         | HuggingFace Path                    | Recommended Layer |
-| ------------- | ----------------------------------- | ----------------- |
-| GPT-OSS 20B   | `openai/gpt-oss-20b`                | 8                 |
-| Qwen3-VL 32B  | `Qwen/Qwen3-VL-32B-Instruct`        | 16                |
-| Llama 3.3 70B | `meta-llama/Llama-3.3-70B-Instruct` | 32                |
-
-Use `-1` for the middle layer if experimenting with an unlisted model.
-
----
-
-## Typical Workflow
-
-```bash
-# 1. Create a probe dataset from a benchmark
-uv run mo create ppe
-
-# 2. Train a confidence probe
-uv run mo train --dataset tytodd/ppe-qwen3-embeddings
-
-# 3. Evaluate the probe on a test set
-uv run mo eval --probe tytodd/my-probe --dataset tytodd/ppe-qwen3-embeddings
-
-# 4. (Optional) Compile/optimize a judge with GEPA
-uv run mo compile ppe
-
-# 5. (Optional) Re-embed with different layer
-uv run mo embed --dataset tytodd/my-dataset --layer 32
-```
-
-## Environment Variables
-
-Create a `.env` file with:
-
-```bash
-OPENAI_API_KEY=...
-WANDB_API_KEY=...
-HF_TOKEN=...
-MODAIC_TOKEN=...
-TOGETHER_API_KEY=...
-```
+# Bench
\ No newline at end of file