9 Commits

Author SHA1 Message Date
fc9560cc50 Don't cache results 2025-12-27 19:33:57 -08:00
57e7b1fd36 Update README.md 2025-12-27 07:15:49 -08:00
501c224540 Update README.md 2025-12-27 05:07:01 -08:00
64c45ee66c Update README.md 2025-12-27 04:33:07 -08:00
b221ae4b42 Update README.md 2025-12-27 03:30:28 -08:00
31e1186573 Update README.md 2025-12-27 03:28:59 -08:00
1decc02218 Update README.md 2025-12-27 03:18:24 -08:00
5cdedc3403 Syntax fix 2025-12-27 02:59:43 -08:00
194597adbc set LM 2025-12-27 02:32:33 -08:00
7 changed files with 242 additions and 94 deletions

162
README.md
View File

@@ -1,76 +1,159 @@
# dspy-neo4j-knowledge-graph
# text-to-cypher
LLM-driven automated knowledge graph construction from text using DSPy and Neo4j.
![Knowledge Graph](img/kg.png)
## Project Structure
```sh
dspy-neo4j-knowledge-graph/
text-to-cypher/
├── README.md
├── examples
├── requirements.txt
├── run.py
└── src
├── main.py
├── pyproject.toml
├── uv.lock
└── src/
├── __init__.py
└── neo4j.py
```
## Description
Model entities and relationships and build a Knowledge Graph using DSPy, Neo4j, and OpenAI's GPT-4. When given a paragraph or block of text, the app uses the DSPy library and OpenAI's GPT-4 to extract entities and relationships and generate a Cypher statement which is run in Neo4j to create the Knowledge Graph.
Build knowledge graphs automatically from text using DSPy, Modaic, and Neo4j. This implementation uses OpenAI's GPT-4o to extract entities and relationships from Wikipedia abstracts, generating Cypher statements that create structured knowledge graphs in Neo4j.
### Key Features
- **DSPy-Powered**: Uses DSPy's Chain of Thought for structured entity and relationship extraction
- **Modaic Integration**: Leverages Modaic's PrecompiledProgram for reusable, shareable DSPy programs
- **Schema-Aware**: Passes the current Neo4j graph schema to the model, enabling it to reuse existing nodes and relationships
- **Batch Processing**: Processes multiple text samples from NDJSON files
- **Hugging Face Hub**: Push trained programs to the Hub for sharing and versioning
### Optimized Schema Context
The current graph schema is passed to the model as a list of nodes, relationships and properties in the context of the prompt. This allows the model to use elements from the existing schema and make connections between existing entities and relationships.
## Quick Start
1. Clone the repository.
2. Create a [Python virtual environment](#python-virtual-environment) and install the required packages.
3. Create a `.env` file and add the required [environment variables](#environment-variables).
4. [Run Neo4j using Docker](#usage).
5. Run `python3 run.py` and paste your text in the prompt.
6. Navigate to `http://localhost:7474/browser/` to view the Knowledge Graph in Neo4j Browser.
1. Clone the repository
2. Install dependencies using [uv](#installation-with-uv)
3. Create a `.env` file and add the required [environment variables](#environment-variables)
4. Set up [Neo4j](#neo4j-setup) (local Docker or cloud-hosted)
5. Run `uv run main.py` to process example Wikipedia abstracts
6. View your Knowledge Graph in Neo4j Browser
## Installation
### Prerequisites
* Python 3.12
* Python 3.13+
* OpenAI API Key
* Docker
* [uv](https://docs.astral.sh/uv/) (Python package manager)
* Neo4j instance (local Docker or cloud-hosted)
### Installation with uv
Install dependencies using uv:
```sh
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync
```
### Environment Variables
Before you begin, make sure to create a `.env` file and add your OpenAI API key.
Create a `.env` file in the project root with the following variables:
```sh
NEO4J_URI=bolt://localhost:7687
OPENAI_API_KEY=<your-api-key>
# OpenAI API Key
OPENAI_API_KEY=<your-openai-api-key>
# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687 # or neo4j+s://xxx.databases.neo4j.io for cloud
NEO4J_USER=neo4j # optional for local Docker with NEO4J_AUTH=none
NEO4J_PASSWORD=<your-password> # optional for local Docker with NEO4J_AUTH=none
# Modaic Token (optional, for pushing to Hub)
MODAIC_TOKEN=<your-modaic-token>
```
### Python Virtual Environment
Create a Python virtual environment and install the required packages.
```sh
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```
## Neo4j Setup
## Usage
Run Neo4j using Docker.
### Option 1: Local Docker (Development)
Run Neo4j locally using Docker:
```sh
docker run \
--name dspy-kg \
--name text-to-cypher \
--publish=7474:7474 \
--publish=7687:7687 \
--env "NEO4J_AUTH=none" \
neo4j:5.15
```
## Clean Up
Stop and remove the Neo4j container.
Access Neo4j Browser at `http://localhost:7474`
### Option 2: Neo4j Aura (Cloud)
1. Create a free instance at [neo4j.com/cloud/aura](https://neo4j.com/cloud/aura)
2. Get your connection URI (e.g., `neo4j+s://xxx.databases.neo4j.io`)
3. Add credentials to your `.env` file
## Usage
### Process Wikipedia Abstracts
Run the main script to process example Wikipedia abstracts and build a knowledge graph:
```sh
docker stop dspy-kg
docker rm dspy-kg
uv run main.py
```
Deactivate the Python virtual environment.
This will:
1. Load Wikipedia abstracts from `examples/wikipedia-abstracts-v0_0_1.ndjson`
2. For each abstract, generate a Cypher statement using GPT-4o
3. Execute the Cypher statement in Neo4j
4. Build a connected knowledge graph
### View Your Knowledge Graph
Navigate to Neo4j Browser:
- Local: `http://localhost:7474/browser/`
- Cloud: Your Neo4j Aura console URL
Run Cypher queries to explore your graph:
```cypher
MATCH (n) RETURN n LIMIT 25
MATCH (p:Person)-[r]->(n) RETURN p, r, n LIMIT 50
```
## Development
### Push to Hugging Face Hub
To share your trained DSPy program on Hugging Face Hub:
```python
# In main.py, uncomment the push_to_hub section
generate_cypher.push_to_hub(
"your-username/text-to-cypher",
with_code=True,
tag="v0.0.1",
commit_message="Initial release"
)
```
### Customize the Model
Modify the `GenerateCypherConfig` in `main.py` to customize:
```python
class GenerateCypherConfig(PrecompiledConfig):
model: str = "openai/gpt-4o" # Change model
max_tokens: int = 1024 # Adjust token limit
```
### Process Custom Text
Modify `main.py` to process your own text:
```python
text = "Your custom text here..."
cypher = generate_cypher(text=text, neo4j_schema=neo4j.fmt_schema())
neo4j.query(cypher.statement.replace('```', ''))
```
## Clean Up
### Stop Neo4j Docker Container
```sh
docker stop text-to-cypher
docker rm text-to-cypher
```
### Remove Virtual Environment
```sh
deactivate
rm -rf .venv
```
@@ -79,7 +162,6 @@ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file
## References
- [DSPy docs](https://dspy-docs.vercel.app/docs/intro)
- [Modaic docs](https://docs.modaic.com/)
- [Neo4j docs](https://neo4j.com/docs/)
## Contact
**Primary Contact:** [@chrisammon3000](https://github.com/chrisammon3000)
- [uv docs](https://docs.astral.sh/uv/)

View File

@@ -1,4 +1,4 @@
{
"AutoConfig": "main.GenerateCypherConfig",
"AutoProgram": "main.GenerateCypher"
"AutoConfig": "modules.GenerateCypherConfig",
"AutoProgram": "modules.GenerateCypher"
}

View File

@@ -1,5 +1,5 @@
{
"model": "gpt-4",
"neo4j_schema": [],
"max_tokens": 1024
"model": "openrouter/openai/gpt-4o",
"max_tokens": 1024,
"cache": false
}

48
main.py
View File

@@ -1,48 +0,0 @@
from dotenv import load_dotenv
import dspy
from modaic import PrecompiledProgram, PrecompiledConfig
load_dotenv()
class CypherFromText(dspy.Signature):
"""Instructions:
Create a Cypher MERGE statement to model all entities and relationships found in the text following these guidelines:
- Refer to the provided schema and use existing or similar nodes, properties or relationships before creating new ones.
- Use generic categories for node and relationship labels."""
text = dspy.InputField(
desc="Text to model using nodes, properties and relationships."
)
neo4j_schema = dspy.InputField(
desc="Current graph schema in Neo4j as a list of NODES and RELATIONSHIPS."
)
statement = dspy.OutputField(
desc="Cypher statement to merge nodes and relationships found in the text."
)
class GenerateCypherConfig(PrecompiledConfig):
neo4j_schema: list[str] = []
model: str = "gpt-4"
max_tokens: int = 1024
class GenerateCypher(PrecompiledProgram):
config: GenerateCypherConfig
def _init_(self, config: GenerateCypherConfig, **kwargs):
super()._init_(**kwargs)
self.lm = dspy.LM(
model=config.model,
max_tokens=config.max_tokens,
)
self.generate_cypher = dspy.ChainOfThought(CypherFromText)
self.generate_cypher.set_lm(self.lm)
def forward(self, text: str, neo4j_schema: list[str]):
return self.generate_cypher(text=text, neo4j_schema=neo4j_schema)
if __name__ == "__main__":
generate_cypher = GenerateCypher(GenerateCypherConfig())
generate_cypher.push_to_hub("farouk1/text-to-cypher", with_code=True, tag="v0.0.2", commit_message="set LM")

76
modules.py Normal file
View File

@@ -0,0 +1,76 @@
import os
import dspy
from dotenv import load_dotenv
from modaic import PrecompiledProgram, PrecompiledConfig
load_dotenv()
class CypherFromQuestion(dspy.Signature):
"""Task: Generate Cypher statement to query a graph database.
Instructions: Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided in the schema.
Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
"""
question = dspy.InputField(
desc="Question to model using a cypher statement. Use only the provided relationship types and properties in the schema."
)
neo4j_schema = dspy.InputField(
desc="Current graph schema in Neo4j as a list of NODES and RELATIONSHIPS."
)
statement = dspy.OutputField(desc="Cypher statement to query the graph database.")
class GenerateCypherConfig(PrecompiledConfig):
model: str = "openrouter/openai/gpt-4o" # OPENROUTER ONLY
max_tokens: int = 1024
cache: bool = False
class GenerateCypher(PrecompiledProgram):
config: GenerateCypherConfig
def __init__(self, config: GenerateCypherConfig, **kwargs):
super().__init__(config=config, **kwargs)
self.lm = dspy.LM(
model=config.model,
max_tokens=config.max_tokens,
api_base="https://openrouter.ai/api/v1",
cache=config.cache,
)
self.generate_cypher = dspy.ChainOfThought(CypherFromQuestion)
self.generate_cypher.set_lm(self.lm)
def forward(self, question: str, neo4j_schema: list[str]):
return self.generate_cypher(question=question, neo4j_schema=neo4j_schema)
generate_cypher = GenerateCypher(GenerateCypherConfig())
if __name__ == "__main__":
"""
from pathlib import Path
import json
examples_path = Path(__file__).parent / "examples" / "wikipedia-abstracts-v0_0_1.ndjson"
with open(examples_path, "r") as f:
for line in f:
data = json.loads(line)
text = data["text"]
print("TEXT TO PROCESS:\n", text[:50])
cypher = generate_cypher(text=text, neo4j_schema=neo4j.fmt_schema())
neo4j.query(cypher.statement.replace('```', ''))
print("CYPHER STATEMENT:\n", cypher.statement)
schema = neo4j.fmt_schema()
print("SCHEMA:\n", schema)
"""
generate_cypher.push_to_hub(
"farouk1/text-to-cypher",
with_code=True,
tag="v1.0.1",
commit_message="Don't cache results",
)

View File

@@ -1,4 +1,42 @@
{
"generate_cypher.predict": {
"traces": [],
"train": [],
"demos": [],
"signature": {
"instructions": "Task: Generate Cypher statement to query a graph database.\nInstructions: Use only the provided relationship types and properties in the schema.\nDo not use any other relationship types or properties that are not provided in the schema.\nDo not include any explanations or apologies in your responses.\nDo not respond to any questions that might ask anything else than for you to construct a Cypher statement.\nDo not include any text except the generated Cypher statement.",
"fields": [
{
"prefix": "Question:",
"description": "Question to model using a cypher statement. Use only the provided relationship types and properties in the schema."
},
{
"prefix": "Neo 4 J Schema:",
"description": "Current graph schema in Neo4j as a list of NODES and RELATIONSHIPS."
},
{
"prefix": "Reasoning: Let's think step by step in order to",
"description": "${reasoning}"
},
{
"prefix": "Statement:",
"description": "Cypher statement to query the graph database."
}
]
},
"lm": {
"model": "openrouter/openai/gpt-4o",
"model_type": "chat",
"cache": false,
"num_retries": 3,
"finetuning_model": null,
"launch_kwargs": {},
"train_kwargs": {},
"temperature": null,
"max_tokens": 1024,
"api_base": "https://openrouter.ai/api/v1"
}
},
"metadata": {
"dependency_versions": {
"python": "3.13",

View File

@@ -4,4 +4,4 @@ version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = ["dspy>=3.0.4", "modaic>=0.8.2", "neo4j~=5.18.0", "python-dotenv~=1.0.1"]
dependencies = ["datasets>=4.4.2", "dspy>=3.0.4", "modaic>=0.8.3", "neo4j~=5.18.0", "python-dotenv~=1.0.1", "sacrebleu>=2.5.1"]