The Digital Registrar: A model-agnostic, resource-efficient AI framework for comprehensive cancer surveillance from pathology reports

Overview

The Digital Registrar is an open-source, locally deployable AI framework designed to automate the extraction of structured cancer registry data from unstructured surgical pathology reports.

This repository contains the source code, extraction logic (DSPy signatures), and benchmarking scripts associated with the manuscript: "A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports" (Submitted to npj Digital Medicine).

Key Features

Privacy-First: Designed to run entirely on-premises using local LLMs (via Ollama), ensuring no PHI leaves the hospital firewall.
Resource-Efficient: Optimized for single-GPU medical workstations (NVIDIA RTX A6000, 48GB VRAM), resolving the "implementation trilemma" of deployment.
Model Agnostic: Built on DSPy, allowing the underlying LLM to be swapped without rewriting extraction logic.
Comprehensive: Extracts 193+ CAP-aligned fields across 10 distinct cancer types, including complex nested data for margins and lymph nodes.

Directory Structure

models/: Organ-specific cancer models and common utilities.
util/: Utility scripts for logging and data processing.
pipeline.py: Main pipeline script.
experiment.py: Script for running experiments.
README.md: Project documentation.
LICENSE: Project license.

Prerequisites

Ollama: This project requires Ollama to be installed and running to serve the LLMs.

Installation

Ensure you have Python installed. Install the required dependencies:

pip install -r requirements.txt

Usage

Running the Pipeline

You can run the pipeline using the pipeline.py script or by creating an experiment using experiment.py. Basic Usage:

from pipeline import setup_pipeline, run_cancer_pipeline
# Setup the pipeline with your desired model (e.g., 'gpt')
setup_pipeline("gpt")
# Run the pipeline on a report string
report_text = "..." # Your pathology report text here
result, timing = run_cancer_pipeline(report_text, fname="example_report")
print(result)

Running an Experiment: The experiment.py script allows you to run the pipeline on a folder of text files. You can specify the input folder using the --input argument.

python experiment.py --input "path/to/your/dataset"

If no input folder is specified, it will default to the hardcoded paths in the script (for backward compatibility or testing).

Model Zoo & Performance

The following models were benchmarked in the study. We recommend gpt-oss:20b for the optimal balance of accuracy and latency on single-GPU setups.

Model	Architecture	Total Params	Active Params	Rec. VRAM
gpt-oss:20b	Sparse MoE	20B	~2B	40GB
Qwen3-30B-A3B	Sparse MoE	30B	~2.4B	60GB*
gemma3:27b	Dense	27B	27B	48GB

*Note: Qwen3-30B exceeds the 48GB VRAM limit of standard A6000 cards, leading to memory offloading and higher latency.

Citation

If you use this code or the dataset in your research, please cite our preprint:

Chow, N.-H., Chang, H., Chen, H.-K., et al. (2025). "A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports" medRxiv. DOI: 10.1101/2025.10.21.25338475

BibTeX

@article {Chow2025.10.21.25338475,
	author = {Chow, Nan-Haw and Chang, Han and Chen, Hung-Kai and Lin, Chen-Yuan and Liu, Ying-Lung and Tseng, Po-Yen and Shiu, Li-Ju and Chu, Yen-Wei and Chung, Pau-Choo and Chang, Kai-Po},
	title = {A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports},
	elocation-id = {2025.10.21.25338475},
	year = {2025},
	doi = {10.1101/2025.10.21.25338475},
	publisher = {Cold Spring Harbor Laboratory Press},
	URL = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475},
	eprint = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475.full.pdf},
	journal = {medRxiv}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors

Kai-Po Chang - GitHub
Hung-Kai Chen - GitHub
Po-Yen Tseng - GitHub All at Med NLP Lab, China Medical University