102 lines
4.7 KiB
Markdown
102 lines
4.7 KiB
Markdown
# The Digital Registrar: A model-agnostic, resource-efficient AI framework for comprehensive cancer surveillance from pathology reports
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0)
|
|
[](https://www.python.org/)
|
|
[](https://dspy.ai/)
|
|
[](https://doi.org/10.5281/zenodo.17689362)
|
|
|
|
## Overview
|
|
|
|
**The Digital Registrar** is an open-source, locally deployable AI framework designed to automate the extraction of structured cancer registry data from unstructured surgical pathology reports.
|
|
|
|
This repository contains the source code, extraction logic (DSPy signatures), and benchmarking scripts associated with the manuscript: **"A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports"** (Submitted to *npj Digital Medicine*).
|
|
|
|
### Key Features
|
|
|
|
* **Privacy-First:** Designed to run entirely on-premises using local LLMs (via Ollama), ensuring no PHI leaves the hospital firewall.
|
|
* **Resource-Efficient:** Optimized for single-GPU medical workstations (NVIDIA RTX A6000, 48GB VRAM), resolving the "implementation trilemma" of deployment.
|
|
* **Model Agnostic:** Built on [DSPy](https://github.com/stanfordnlp/dspy), allowing the underlying LLM to be swapped without rewriting extraction logic.
|
|
* **Comprehensive:** Extracts 193+ CAP-aligned fields across 10 distinct cancer types, including complex nested data for margins and lymph nodes.
|
|
|
|
## Directory Structure
|
|
- `models/`: Organ-specific cancer models and common utilities.
|
|
- `util/`: Utility scripts for logging and data processing.
|
|
- `pipeline.py`: Main pipeline script.
|
|
- `experiment.py`: Script for running experiments.
|
|
- `README.md`: Project documentation.
|
|
- `LICENSE`: Project license.
|
|
## Prerequisites
|
|
- **Ollama**: This project requires [Ollama](https://ollama.com/) to be installed and running to serve the LLMs.
|
|
## Installation
|
|
Ensure you have Python installed. Install the required dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
## Usage
|
|
### Running the Pipeline
|
|
You can run the pipeline using the `pipeline.py` script or by creating an experiment using `experiment.py`.
|
|
**Basic Usage:**
|
|
```python
|
|
from pipeline import setup_pipeline, run_cancer_pipeline
|
|
# Setup the pipeline with your desired model (e.g., 'gpt')
|
|
setup_pipeline("gpt")
|
|
# Run the pipeline on a report string
|
|
report_text = "..." # Your pathology report text here
|
|
result, timing = run_cancer_pipeline(report_text, fname="example_report")
|
|
print(result)
|
|
```
|
|
**Running an Experiment:**
|
|
The `experiment.py` script allows you to run the pipeline on a folder of text files. You can specify the input folder using the `--input` argument.
|
|
```bash
|
|
python experiment.py --input "path/to/your/dataset"
|
|
```
|
|
If no input folder is specified, it will default to the hardcoded paths in the script (for backward compatibility or testing).
|
|
|
|
-----
|
|
|
|
## Model Zoo & Performance
|
|
|
|
The following models were benchmarked in the study. We recommend **gpt-oss:20b** for the optimal balance of accuracy and latency on single-GPU setups.
|
|
|
|
| Model | Architecture | Total Params | Active Params | Rec. VRAM |
|
|
| :--- | :--- | :--- | :--- | :--- |
|
|
| **gpt-oss:20b** | **Sparse MoE** | **20B** | **\~2B** | **40GB** |
|
|
| Qwen3-30B-A3B | Sparse MoE | 30B | \~2.4B | 60GB\* |
|
|
| gemma3:27b | Dense | 27B | 27B | 48GB |
|
|
|
|
*\*Note: Qwen3-30B exceeds the 48GB VRAM limit of standard A6000 cards, leading to memory offloading and higher latency.*
|
|
|
|
-----
|
|
|
|
## Citation
|
|
|
|
If you use this code or the dataset in your research, please cite our preprint:
|
|
|
|
> **Chow, N.-H., Chang, H., Chen, H.-K., et al.** (2025). "A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports" *medRxiv*. DOI: [10.1101/2025.10.21.25338475](https://www.google.com/search?q=https://doi.org/10.1101/2025.10.21.25338475)
|
|
|
|
### BibTeX
|
|
|
|
```bibtex
|
|
@article {Chow2025.10.21.25338475,
|
|
author = {Chow, Nan-Haw and Chang, Han and Chen, Hung-Kai and Lin, Chen-Yuan and Liu, Ying-Lung and Tseng, Po-Yen and Shiu, Li-Ju and Chu, Yen-Wei and Chung, Pau-Choo and Chang, Kai-Po},
|
|
title = {A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports},
|
|
elocation-id = {2025.10.21.25338475},
|
|
year = {2025},
|
|
doi = {10.1101/2025.10.21.25338475},
|
|
publisher = {Cold Spring Harbor Laboratory Press},
|
|
URL = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475},
|
|
eprint = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475.full.pdf},
|
|
journal = {medRxiv}
|
|
}
|
|
```
|
|
|
|
-----
|
|
|
|
## License
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
## Authors
|
|
- **Kai-Po Chang** - [GitHub](https://github.com/kblab2024)
|
|
- **Hung-Kai Chen** - [GitHub](https://github.com/Walther-Chen)
|
|
- **Po-Yen Tseng** - [GitHub](https://github.com/ThomasTsengCMU)
|
|
All at **Med NLP Lab, China Medical University**
|