(no commit message)

This commit is contained in:
2025-12-16 17:16:57 -08:00
parent bfcb237280
commit ad688d3fec
4 changed files with 237 additions and 1 deletions

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 Stanford MIMI Lab
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

138
README.md
View File

@@ -1,2 +1,138 @@
# medcalc_bench-claude-3.7-sonnet-MIPROv2
# Structured Prompting Enables More Robust Evaluation of Language Models
[![arXiv](https://img.shields.io/badge/arXiv-2511.20836-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2511.20836)
[![License](https://img.shields.io/github/license/stanfordmimi/helm-optimizer?style=for-the-badge)](LICENSE)
<img src="assets/fig.png" alt="Overview" width="650">
**Figure 1** | **Pipeline overview**. (a) DSPy takes HELM's baseline prompt and produces structured prompt variants. (b) HELM evaluates models under each prompt variant. With structured prompting, we observe more robust performance evaluation.
## What is DSPy-HELM?
A comprehensive framework for optimizing language model performance on benchmarks using [DSPy](https://github.com/stanfordnlp/dspy), enabling more robust evaluation. This toolkit allows automated prompt optimization (APO) for benchmarks from the [HELM](https://github.com/stanford-crfm/helm) (Holistic Evaluation of Language Models) ecosystem.
## Installation
```bash
# Clone the repository
git clone https://github.com/StanfordMIMI/dspy-helm.git
cd dspy-helm
# Install dependencies
pip install -r requirements.txt
```
## Basic Usage
**Configure settings** in `run.sh`:
```bash
scenarios=("mmlu_pro" "gpqa" "gsm8k" "medcalc_bench" "medec" "head_qa" "medbullets")
optimizers=("MIPROv2" "BootstrapFewShotWithRandomSearch")
# Language model to be optimized
model=openai/gpt-4o
api_base="your_api_base_here"
api_key="your_api_key_here"
# Teacher language model for proposing instruction candidates (MIPROv2)
prompt_model=openai/gpt-4o
prompt_api_base="your_api_base_here"
prompt_api_key="your_api_key_here"
max_bootstrapped_demos=3
max_labeled_demos=3
num_threads=1
```
**Run optimization**:
```bash
./run.sh
```
## 📊 Supported Benchmarks
| Benchmark | Input → Output | Task |
|-------------------|--------------------------------------|------------------------|
| MMLU-Pro | Reasoning Question → Answer | Multi-Task Reasoning |
| GPQA | Graduate Question → Answer | Graduate-Level QA |
| GSM8K | Math Problem → Solution | Numeric Problem-Solving|
| MedCalc-Bench | Patient Note → Computed Value | Computational Reasoning|
| Medec | Medical Narrative → Errors | Error Classification |
| HeadQA | Medical Question → Answer | USMLE-Style QA |
| MedBullets | Medical Question → Answer | USMLE-Style QA |
## Optimization Parameters
- `--max_bootstrapped_demos`: Number of bootstrapped demonstrations
- `--max_labeled_demos`: Number of labeled demonstrations
- `--num_threads`: Parallel processing threads (default: 16)
## 📁 Project Structure
```
dspy-helm/
├── main.py # Main optimization script
├── scenarios.py # Benchmark implementations
├── run.sh # Batch optimization runner
├── requirements.txt # Python dependencies
├── agents/ # Optimized DSPy agents
│ ├── medcalc_bench/
│ ├── head_qa/
│ ├── medbullets/
│ └── ...
└── README.md # This file
```
## 📈 Results and Evaluation
Optimized agents are automatically saved to the `agents/` directory:
```
agents/
└── {scenario}/
└── {model_name}/
├── MIPROv2.json
└── BootstrapFewShotWithRandomSearch.json
```
## Creating Custom Scenarios
To add a new HELM benchmark:
**Implement the scenario class** in `scenarios.py`:
```python
class my_benchmark:
def __init__(self, test_size=0.1, seed=42):
self.test_size = test_size
self.seed = seed
@staticmethod
def make_prompt(row):
return f"Question: {row['question']}\nAnswer:"
@staticmethod
def metric(example, pred, trace=None):
# Your evaluation metric
return dspy.evaluate.metrics.answer_exact_match(example, pred, trace)
def load_data(self):
# Load and return trainset, valset
pass
```
## 🙏 Acknowledgments
This repository is built using [DSPy](https://github.com/stanfordnlp/dspy) for prompt optimization.
## 📎 Citation
If you find this repository useful for your work, please cite the following paper:
```bibtex
@article{aali2025structured,
title={Structured Prompting Enables More Robust Evaluation of Language Models},
author={Aali, Asad and Mohsin, Muhammad Ahmed and Bikia, Vasiliki and Singhvi, Arnav and Gaus, Richard and Bedi, Suhana and Cui, Hejie and Fuentes, Miguel and Unell, Alyssa and Mai, Yifan and others},
journal={arXiv preprint arXiv:2511.20836},
year={2025}
}
```

14
config.json Normal file
View File

@@ -0,0 +1,14 @@
{
"model": "openrouter/anthropic/claude-3.7-sonnet",
"scenario": "medcalc_bench",
"optimizer": "MIPROv2",
"api_base": "https://openrouter.ai/api/v1",
"prompt_model": "gpt-4o",
"prompt_api_base": null,
"max_tokens": 64000,
"temperature": null,
"max_bootstrapped_demos": 3,
"max_labeled_demos": 3,
"num_threads": 16,
"val_size": null
}

65
program.json Normal file
View File

@@ -0,0 +1,65 @@
{
"agent.predict": {
"traces": [],
"train": [],
"demos": [
{
"augmented": true,
"inputs": "Given a patient note and a clinical question, compute the requested medical value.\n\nPatient note: A 63-year-old Thai man, with a long-standing history of alcohol and tobacco use, was hospitalized with a two-month history of progressive jaundice, generalized pruritus, and significant involuntary weight loss. He reported no fever, abdominal pain, or vomiting. For the 4 weeks prior to his admission, he also suffered from progressive fatigue, poor oral intake, pale stools, and decreased urine output. His past medical history was significant for well-controlled type 2 diabetes mellitus, hypertension, and stage 3A chronic kidney disease (baseline serum creatinine of 1.5 mg/dL, estimated glomerular filtration rate of 48.8 mL/min/1.73 m2). His family history was negative for any kidney disease or malignancy. He had no history of herbal medicine use, intravenous drug abuse, tattoos, or blood transfusions. On physical examination, he was alert and oriented with severe jaundice, a distended abdomen with an enlarged liver, and a 1+ bilateral pitting edema of his lower extremities. He had no stigmata of chronic liver disease. The rest of the physical examination was unremarkable.\nLaboratory data were as follows: hemoglobin = 9.6 g/dL, mean corpuscular volume (MCV) = 85 fL, white blood cells = 11,880/mm3 (neutrophil 81%, lymphocyte 4.9%, monocyte 12%), platelets = 384,000/mm3, international normalized ratio (INR) = 2.2, blood urea nitrogen = 73.8 mg/dL, serum creatinine = 7.78 mg/dL, sodium = 133 mEq/L, potassium = 3.8 mEq/L, bicarbonate = 16.8 mEq/L, chloride = 92 mEq/L, calcium = 8.7 mg/dL, phosphorus = 4.8 mg/dL, magnesium = 2.8 mg/dL, and glycated hemoglobin = 7.1%. Liver function tests revealed the following: total cholesterol = 177 mg/dL, total protein = 5.4 g/dL, albumin = 2.8 g/dL, globulin = 2.6 g/dL, total bilirubin = 36.1 mg/dL, direct bilirubin = 35 mg/dL, alanine transaminase (ALT) = 53 U/L, aspartate transaminase (AST) = 109 U/L, and alkaline phosphatase (ALP) = 408 U/L. Serum tests for human immunodeficiency virus, hepatitis B, and hepatitis C were negative. Serum carbohydrate antigen 19-9 (CA 19-9) levels were 850 U/mL (0\u201337). Urine appeared greenish with a specific gravity of 1.010, trace proteinuria, 5\u201310 red blood cells (RBCs)/high-power field, 1\u20132 white blood cells/high-power field along with numerous muddy brown granular casts (Fig. ), and bile casts (Fig. ). No RBC casts were found. The fractional excretion of sodium (FENa) was 2.7%, and FEurea was 44.8%. Abdominal computed tomography showed 4.3 \u00d7 4.2 cm and 3.1 \u00d7 3.5 cm inhomogeneous enhancing masses at liver segment 4A with moderately dilated intrahepatic ducts (Fig. ). Based on the symptoms of biliary obstruction, typical imaging findings, and elevated serum CA 19-9, the most likely diagnosis was cholangiocarcinoma (Bismuth\u2013Corlette type IIIb). He was initially given intravenous fluids to combat his oliguric renal failure, but there was no improvement in serum creatinine and urine output. Over the next 5 days, hemodialysis was initiated due to diuretic-resistant oliguric acute renal failure. Percutaneous transhepatic biliary drainage (PTBD) was also performed to relieve his obstructive jaundice. However, he still experienced clinical deterioration. After approximately 2 weeks of hemodialysis, his total serum bilirubin continued to be in the 30 mg/dL range and he remained oliguric. On the 27th day of admission, the patient suffered from massive biliary hemorrhage requiring a blood transfusion. An emergency esophagogastroduodenoscopy (EGD) was performed and revealed active hemorrhage from the ampulla of Vater, a finding suggestive of tumor bleeding. After thorough discussion and consultation between the patient's family and the palliative care team, the healthcare providers proceed to initiate palliative treatment.\n\nQuestion: What is the patient's albumin corrected delta ratio?\n\nAnswer only the requested quantity without units. No explanation needed:",
"reasoning": "To calculate the albumin corrected delta ratio, I need to find the anion gap first and then correct it for albumin.\n\nStep 1: Calculate the anion gap.\nAnion Gap = Sodium - (Chloride + Bicarbonate)\nAnion Gap = 133 - (92 + 16.8) = 133 - 108.8 = 24.2\n\nStep 2: Correct the anion gap for albumin.\nNormal albumin is typically 4.0 g/dL, but the patient's albumin is 2.8 g/dL.\nAlbumin Corrected Anion Gap = Anion Gap + 2.5 \u00d7 (4.0 - measured albumin)\nAlbumin Corrected Anion Gap = 24.2 + 2.5 \u00d7 (4.0 - 2.8)\nAlbumin Corrected Anion Gap = 24.2 + 2.5 \u00d7 1.2\nAlbumin Corrected Anion Gap = 24.2 + 3 = 27.2\n\nStep 3: Calculate the delta ratio.\nDelta Ratio = (Anion Gap - 12) / (24 - Bicarbonate)\nDelta Ratio = (27.2 - 12) / (24 - 16.8)\nDelta Ratio = 15.2 / 7.2 = 2.111\n\nSo the albumin corrected delta ratio is 2.111.",
"output": "2.111",
"answer": "2.111"
},
{
"augmented": true,
"inputs": "Given a patient note and a clinical question, compute the requested medical value.\n\nPatient note: A 69-year-old man was admitted to the hospital because of anorexia and fever. He had suffered an acute middle cerebral artery territory stroke 45 days previously and underwent treatment. The only medications he was taking were for stroke (aspirin 100 mg, cilostazol 100 mg, atorvastatin 80 mg, and rabamipide 100 mg). On the initial physical examination, his vital signs were as follows: blood pressure, 162/79 mm Hg; pulse rate, 104 beats/min; respiratory rate, 20 breaths/min; and temperature, 37.8\u00b0C. He had severe dysarthria and right extremity weakness due to the stroke. No malar rash, oral ulcers, jugular venous distension, or lymphadenopathy were observed. Bilateral fine crackles were detected on chest auscultation, and his abdomen was soft but there was no organomegaly. Multiple pinkish reticulated patches were observed on the lower leg. The results of initial laboratory tests were as follows: white blood cell (WBC) count, 17 630/\u03bcL (4000-10 000/\u03bcL); hemoglobin, 9.1 g/dL (13.0-17.0 g/dL); blood urea nitrogen, 24.3 mg/dL (6-20 mg/dL); creatinine, 1.52 mg/dL (0.6-1.2 mg/dL); C-reactive protein, 153.5 mg/dL (0.0-5.0 mg/dL); MPO\u2013ANCA, 100.0 U/mL (0-4.9 U/mL); and PR3\u2013ANCA, negative. The urinalysis revealed the following findings: protein, 2+; blood, 2+; red blood cells, 5-9/high power fields (HPF); and WBC, 5-9/HPF. Serum/urine protein electrophoresis and immunofixation showed nonspecific findings (Table ). Pulmonary interstitial fibrosis was noted on a chest X-ray (CXR). Brain magnetic resonance imaging revealed no evidence of a newly developed infarction, but the hemorrhagic transformation of the left MCA infarction lesion was detected (frontoparietal lobes, basal ganglia, and insular region). His baseline serum creatinine was 0.71 mg/dL, which increased to 1.52 and 2.24 mg/dL on days 1 and 3 of admission, respectively. A percutaneous renal biopsy was performed on day 7 of admission. Seven fibrocellular crescents out of 17 glomeruli were observed on light microscopy (Figure ). Some were associated with fibrinoid necrosis, and numerous neutrophils were detected in the capillary lumen. In addition, a few small vessels with neutrophil infiltration and fibrin deposition were evident, indicating acute necrotizing vasculitis (Figure ). Most of the interstitial area was infiltrated by numerous plasma cells and several neutrophils. The possibility of crescentic glomerulonephritis, associated with plasma-cell proliferative disease or IgG-related disease, was considered. Kappa and Lambda in situ hybridization revealed a positive staining pattern, indicating polyclonality (Figure ). In addition, IgG4-positive plasma cells were up to 200/HPF in the interstitial area. Unlike IgG4-tubulointerstitial nephritis, there was no sclerosing lymphoplasmacytic inflammation or apparent obliteration of the tubules. Therefore, the patient was diagnosed with microscopic polyangiitis having pauci-immune crescentic glomerulonephritis with fibrinoid necrotizing vasculitis. A bone marrow (BM) aspiration and biopsy showed a normocellular marrow with no increase in plasma cells (1.8% of all nucleated cells), and no evidence of BM involvement by monoclonal plasma cells. His Birmingham vasculitis activity score was 27. Serum creatinine peaked at 3.1 mg/dL. He received three pulses of methylprednisolone (500 mg/day) followed by prednisone (1 mg/kg/day) and cyclophosphamide (1.5 mg/kg/day). After treatment, his serum creatinine improved to 2.3 mg/dL, and the microscopic hematuria disappeared. Furthermore, his fever and skin lesions improved. He was discharged with prednisolone 40 mg/day (1 mg/kg/day), peroral cyclophosphamide 50 mg/day, and his renal function remained stable until 2 months after discharge. No side effects were observed due to the immunosuppressive agents. The dose of azathioprine was maintained, and the dose of prednisolone was tapered to 30 mg/day. He was subsequently lost to follow-up and died 16 months later due to pneumonia.\n\nQuestion: What is the patient's CURB-65 score?\n\nAnswer only the requested quantity without units. No explanation needed:",
"reasoning": "The CURB-65 score is a clinical prediction rule used to determine the severity of community-acquired pneumonia. It assesses 5 risk factors, with 1 point for each:\n\nC - Confusion (new onset or worsened)\nU - Urea > 7 mmol/L (BUN > 19 mg/dL)\nR - Respiratory rate \u2265 30 breaths/min\nB - Blood pressure (systolic < 90 mmHg or diastolic \u2264 60 mmHg)\n65 - Age \u2265 65 years\n\nLet's check each criterion:\n\n1. Confusion: The patient has severe dysarthria and right extremity weakness due to a stroke, but there's no specific mention of confusion. No points.\n\n2. Urea (BUN): The patient's BUN is 24.3 mg/dL, which is > 19 mg/dL. +1 point.\n\n3. Respiratory rate: The patient's respiratory rate is 20 breaths/min, which is < 30 breaths/min. No points.\n\n4. Blood pressure: The patient's blood pressure is 162/79 mmHg, which is not low (not < 90/60 mmHg). No points.\n\n5. Age: The patient is 69 years old, which is \u2265 65 years. +1 point.\n\nTotal CURB-65 score = 1 + 1 = 2",
"output": "2",
"answer": "2"
},
{
"augmented": true,
"inputs": "Given a patient note and a clinical question, compute the requested medical value.\n\nPatient note: In a healthy Greek male physician, 35-years-old, 173 cm height, 72 Kg weight, never-smoker with no more than one or two drinks weekly, with known natural infection since five years, six TSTs were applied in 12 November 1985 (Figure ). The hands belong to the first author, specialised in Pulmonology at Athens Thorax Diseases Hospital 1976\u20131980. During this period he had repeatedly negative Mantoux/Sokal TSTs; last negative 10 April 1980. The first positive Mantoux occurred 20 June 1980. Between 11 February 1980 and 11 April 1980 he had no contact at all; thus the most probable period of infection was between 11 and 20 April 1980, his first week in the army, when there was contact with a very probable tuberculosis patient. He received chemoprophylaxis (300 mg isoniazide and 25 mg pyridoxine every morning) from July 1980 to May 1981.\nAll vials of 1, 2, 5, 10, 20, 50 TU per 0.1 mL purified protein derivative (PPD) RT23 were provided by the Greek Pasteur Institute on the same day, preserved under identical conditions, and administered within a few days after production, long before their expiry date.\nThe tester was the domiciliary nurse of the Ioannina anti-tuberculosis campaign, with more than twenty years of experience. All six tests were injected lege-artis (one sharp bevelled disposable 25 G needle with a plastic tuberculin syringe per test; intradermal injection of exactly one tenth of a millilitre of PPD tuberculin; 4 mm white blister). Induration was read 72 hours later, using Sokal's technique (Figure ).\nThe results are presented in Figure . Skin reaction was not a linear function of tuberculin dose; induration was increased with a decreasing rate as tuberculin dose increases (Figure ), while an unexpected decrease occurred in the dose of 10 TUs. Erythema (redness) was not measured but it was parallel to induration (Figure ). No side-effects occurred.\n\nQuestion: What is the patient's body mass mass index (BMI)? Your answer should be in terms of kg/m\u00b2.\n\nAnswer only the requested quantity without units. No explanation needed:",
"reasoning": "To calculate the patient's BMI (Body Mass Index), I need to use the formula:\nBMI = weight (kg) / [height (m)]\u00b2\n\nFrom the patient note:\n- Height = 173 cm = 1.73 m\n- Weight = 72 kg\n\nCalculating:\nBMI = 72 / (1.73)\u00b2\nBMI = 72 / 2.9929\nBMI = 24.0568 kg/m\u00b2\n\nRounding to the proper decimal places:\nBMI = 24.057 kg/m\u00b2\n\nHowever, the question asks for the answer without units, so I'll provide just the numeric value.",
"output": "24.057",
"answer": "24.057"
}
],
"signature": {
"instructions": "You are tasked with solving high-stakes clinical queries that require precise medical calculations and diagnostic reasoning. Given detailed input fields, such as patient notes and clinical questions, you must accurately compute the requested medical values. Your response will inform critical clinical decisions, making accuracy paramount. Follow a step-by-step 'Chain of Thought' reasoning to ensure precise outcomes. Deliver only the requested computed medical value, omitting any units or explanations, as these numbers will directly contribute to the patient's treatment plan.",
"fields": [
{
"prefix": "Inputs:",
"description": "${inputs}"
},
{
"prefix": "Reasoning: Let's think step by step in order to",
"description": "${reasoning}"
},
{
"prefix": "Output:",
"description": "${output}"
}
]
},
"lm": {
"model": "openrouter/anthropic/claude-3.7-sonnet",
"model_type": "chat",
"cache": true,
"num_retries": 3,
"finetuning_model": null,
"launch_kwargs": {},
"train_kwargs": {},
"temperature": null,
"max_tokens": 64000,
"api_base": "https://openrouter.ai/api/v1"
}
},
"metadata": {
"dependency_versions": {
"python": "3.13",
"dspy": "3.0.4",
"cloudpickle": "3.1"
}
}
}