(no commit message)
This commit is contained in:
101
README.md
101
README.md
@@ -1,2 +1,101 @@
|
||||
# cancer-pipeline
|
||||
# The Digital Registrar: A model-agnostic, resource-efficient AI framework for comprehensive cancer surveillance from pathology reports
|
||||
|
||||
[](https://opensource.org/licenses/Apache-2.0)
|
||||
[](https://www.python.org/)
|
||||
[](https://dspy.ai/)
|
||||
[](https://doi.org/10.5281/zenodo.17689362)
|
||||
|
||||
## Overview
|
||||
|
||||
**The Digital Registrar** is an open-source, locally deployable AI framework designed to automate the extraction of structured cancer registry data from unstructured surgical pathology reports.
|
||||
|
||||
This repository contains the source code, extraction logic (DSPy signatures), and benchmarking scripts associated with the manuscript: **"A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports"** (Submitted to *npj Digital Medicine*).
|
||||
|
||||
### Key Features
|
||||
|
||||
* **Privacy-First:** Designed to run entirely on-premises using local LLMs (via Ollama), ensuring no PHI leaves the hospital firewall.
|
||||
* **Resource-Efficient:** Optimized for single-GPU medical workstations (NVIDIA RTX A6000, 48GB VRAM), resolving the "implementation trilemma" of deployment.
|
||||
* **Model Agnostic:** Built on [DSPy](https://github.com/stanfordnlp/dspy), allowing the underlying LLM to be swapped without rewriting extraction logic.
|
||||
* **Comprehensive:** Extracts 193+ CAP-aligned fields across 10 distinct cancer types, including complex nested data for margins and lymph nodes.
|
||||
|
||||
## Directory Structure
|
||||
- `models/`: Organ-specific cancer models and common utilities.
|
||||
- `util/`: Utility scripts for logging and data processing.
|
||||
- `pipeline.py`: Main pipeline script.
|
||||
- `experiment.py`: Script for running experiments.
|
||||
- `README.md`: Project documentation.
|
||||
- `LICENSE`: Project license.
|
||||
## Prerequisites
|
||||
- **Ollama**: This project requires [Ollama](https://ollama.com/) to be installed and running to serve the LLMs.
|
||||
## Installation
|
||||
Ensure you have Python installed. Install the required dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
## Usage
|
||||
### Running the Pipeline
|
||||
You can run the pipeline using the `pipeline.py` script or by creating an experiment using `experiment.py`.
|
||||
**Basic Usage:**
|
||||
```python
|
||||
from pipeline import setup_pipeline, run_cancer_pipeline
|
||||
# Setup the pipeline with your desired model (e.g., 'gpt')
|
||||
setup_pipeline("gpt")
|
||||
# Run the pipeline on a report string
|
||||
report_text = "..." # Your pathology report text here
|
||||
result, timing = run_cancer_pipeline(report_text, fname="example_report")
|
||||
print(result)
|
||||
```
|
||||
**Running an Experiment:**
|
||||
The `experiment.py` script allows you to run the pipeline on a folder of text files. You can specify the input folder using the `--input` argument.
|
||||
```bash
|
||||
python experiment.py --input "path/to/your/dataset"
|
||||
```
|
||||
If no input folder is specified, it will default to the hardcoded paths in the script (for backward compatibility or testing).
|
||||
|
||||
-----
|
||||
|
||||
## Model Zoo & Performance
|
||||
|
||||
The following models were benchmarked in the study. We recommend **gpt-oss:20b** for the optimal balance of accuracy and latency on single-GPU setups.
|
||||
|
||||
| Model | Architecture | Total Params | Active Params | Rec. VRAM |
|
||||
| :--- | :--- | :--- | :--- | :--- |
|
||||
| **gpt-oss:20b** | **Sparse MoE** | **20B** | **\~2B** | **40GB** |
|
||||
| Qwen3-30B-A3B | Sparse MoE | 30B | \~2.4B | 60GB\* |
|
||||
| gemma3:27b | Dense | 27B | 27B | 48GB |
|
||||
|
||||
*\*Note: Qwen3-30B exceeds the 48GB VRAM limit of standard A6000 cards, leading to memory offloading and higher latency.*
|
||||
|
||||
-----
|
||||
|
||||
## Citation
|
||||
|
||||
If you use this code or the dataset in your research, please cite our preprint:
|
||||
|
||||
> **Chow, N.-H., Chang, H., Chen, H.-K., et al.** (2025). "A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports" *medRxiv*. DOI: [10.1101/2025.10.21.25338475](https://www.google.com/search?q=https://doi.org/10.1101/2025.10.21.25338475)
|
||||
|
||||
### BibTeX
|
||||
|
||||
```bibtex
|
||||
@article {Chow2025.10.21.25338475,
|
||||
author = {Chow, Nan-Haw and Chang, Han and Chen, Hung-Kai and Lin, Chen-Yuan and Liu, Ying-Lung and Tseng, Po-Yen and Shiu, Li-Ju and Chu, Yen-Wei and Chung, Pau-Choo and Chang, Kai-Po},
|
||||
title = {A Multicancer AI Framework for Comprehensive Cancer Surveillance from Pathology Reports},
|
||||
elocation-id = {2025.10.21.25338475},
|
||||
year = {2025},
|
||||
doi = {10.1101/2025.10.21.25338475},
|
||||
publisher = {Cold Spring Harbor Laboratory Press},
|
||||
URL = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475},
|
||||
eprint = {https://www.medrxiv.org/content/early/2025/11/23/2025.10.21.25338475.full.pdf},
|
||||
journal = {medRxiv}
|
||||
}
|
||||
```
|
||||
|
||||
-----
|
||||
|
||||
## License
|
||||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
||||
## Authors
|
||||
- **Kai-Po Chang** - [GitHub](https://github.com/kblab2024)
|
||||
- **Hung-Kai Chen** - [GitHub](https://github.com/Walther-Chen)
|
||||
- **Po-Yen Tseng** - [GitHub](https://github.com/ThomasTsengCMU)
|
||||
All at **Med NLP Lab, China Medical University**
|
||||
|
||||
Reference in New Issue
Block a user