The Digital Registrar: A model-agnostic, resource-efficient AI framework for comprehensive cancer surveillance from pathology reports
Overview
The Digital Registrar is an open-source, locally deployable AI framework designed to automate the extraction of structured cancer registry data from unstructured surgical pathology reports.
This repository contains the source code, extraction logic (DSPy signatures), and benchmarking scripts associated with the manuscript: "A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports" (Submitted to npj Digital Medicine).
Key Features
- Privacy-First: Designed to run entirely on-premises using local LLMs (via Ollama), ensuring no PHI leaves the hospital firewall.
- Resource-Efficient: Optimized for single-GPU medical workstations (NVIDIA RTX A6000, 48GB VRAM), resolving the "implementation trilemma" of deployment.
- Model Agnostic: Built on DSPy, allowing the underlying LLM to be swapped without rewriting extraction logic.
- Comprehensive: Extracts 193+ CAP-aligned fields across 10 distinct cancer types, including complex nested data for margins and lymph nodes.
Directory Structure
models/: Organ-specific cancer models and common utilities.util/: Utility scripts for logging and data processing.pipeline.py: Main pipeline script.experiment.py: Script for running experiments.README.md: Project documentation.LICENSE: Project license.
Prerequisites
- Ollama: This project requires Ollama to be installed and running to serve the LLMs.
Installation
Ensure you have Python installed. Install the required dependencies:
pip install -r requirements.txt
Usage
Running the Pipeline
You can run the pipeline using the pipeline.py script or by creating an experiment using experiment.py.
Basic Usage:
from pipeline import setup_pipeline, run_cancer_pipeline
# Setup the pipeline with your desired model (e.g., 'gpt')
setup_pipeline("gpt")
# Run the pipeline on a report string
report_text = "..." # Your pathology report text here
result, timing = run_cancer_pipeline(report_text, fname="example_report")
print(result)
Running an Experiment:
The experiment.py script allows you to run the pipeline on a folder of text files. You can specify the input folder using the --input argument.
python experiment.py --input "path/to/your/dataset"
If no input folder is specified, it will default to the hardcoded paths in the script (for backward compatibility or testing).
Model Zoo & Performance
The following models were benchmarked in the study. We recommend gpt-oss:20b for the optimal balance of accuracy and latency on single-GPU setups.
| Model | Architecture | Total Params | Active Params | Rec. VRAM |
|---|---|---|---|---|
| gpt-oss:20b | Sparse MoE | 20B | ~2B | 40GB |
| Qwen3-30B-A3B | Sparse MoE | 30B | ~2.4B | 60GB* |
| gemma3:27b | Dense | 27B | 27B | 48GB |
*Note: Qwen3-30B exceeds the 48GB VRAM limit of standard A6000 cards, leading to memory offloading and higher latency.
Citation
If you use this code or the dataset in your research, please cite our preprint:
Chow, N.-H., Chang, H., Chen, H.-K., et al. (2025). "A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports" medRxiv. DOI: 10.1101/2025.10.21.25338475
BibTeX
@article {Chow2025.10.21.25338475,
author = {Chow, Nan-Haw and Chang, Han and Chen, Hung-Kai and Lin, Chen-Yuan and Liu, Ying-Lung and Tseng, Po-Yen and Shiu, Li-Ju and Chu, Yen-Wei and Chung, Pau-Choo and Chang, Kai-Po},
title = {A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports},
elocation-id = {2025.10.21.25338475},
year = {2025},
doi = {10.1101/2025.10.21.25338475},
publisher = {Cold Spring Harbor Laboratory Press},
URL = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475},
eprint = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475.full.pdf},
journal = {medRxiv}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.