Optimized program
This commit is contained in:
167
README.md
167
README.md
@@ -1,2 +1,167 @@
|
||||
# text-to-cypher-gepa
|
||||
# text-to-cypher
|
||||
LLM-driven automated knowledge graph construction from text using DSPy and Neo4j.
|
||||
|
||||
## Project Structure
|
||||
```sh
|
||||
text-to-cypher/
|
||||
├── README.md
|
||||
├── main.py
|
||||
├── pyproject.toml
|
||||
├── uv.lock
|
||||
└── src/
|
||||
├── __init__.py
|
||||
└── neo4j.py
|
||||
```
|
||||
|
||||
## Description
|
||||
Build knowledge graphs automatically from text using DSPy, Modaic, and Neo4j. This implementation uses OpenAI's GPT-4o to extract entities and relationships from Wikipedia abstracts, generating Cypher statements that create structured knowledge graphs in Neo4j.
|
||||
|
||||
### Key Features
|
||||
- **DSPy-Powered**: Uses DSPy's Chain of Thought for structured entity and relationship extraction
|
||||
- **Modaic Integration**: Leverages Modaic's PrecompiledProgram for reusable, shareable DSPy programs
|
||||
- **Schema-Aware**: Passes the current Neo4j graph schema to the model, enabling it to reuse existing nodes and relationships
|
||||
- **Batch Processing**: Processes multiple text samples from NDJSON files
|
||||
- **Hugging Face Hub**: Push trained programs to the Hub for sharing and versioning
|
||||
|
||||
### Optimized Schema Context
|
||||
The current graph schema is passed to the model as a list of nodes, relationships and properties in the context of the prompt. This allows the model to use elements from the existing schema and make connections between existing entities and relationships.
|
||||
|
||||
## Quick Start
|
||||
1. Clone the repository
|
||||
2. Install dependencies using [uv](#installation-with-uv)
|
||||
3. Create a `.env` file and add the required [environment variables](#environment-variables)
|
||||
4. Set up [Neo4j](#neo4j-setup) (local Docker or cloud-hosted)
|
||||
5. Run `uv run main.py` to process example Wikipedia abstracts
|
||||
6. View your Knowledge Graph in Neo4j Browser
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
* Python 3.13+
|
||||
* OpenAI API Key
|
||||
* [uv](https://docs.astral.sh/uv/) (Python package manager)
|
||||
* Neo4j instance (local Docker or cloud-hosted)
|
||||
|
||||
### Installation with uv
|
||||
Install dependencies using uv:
|
||||
```sh
|
||||
# Install uv if you haven't already
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# Install project dependencies
|
||||
uv sync
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
Create a `.env` file in the project root with the following variables:
|
||||
|
||||
```sh
|
||||
# OpenAI API Key
|
||||
OPENAI_API_KEY=<your-openai-api-key>
|
||||
|
||||
# Neo4j Configuration
|
||||
NEO4J_URI=bolt://localhost:7687 # or neo4j+s://xxx.databases.neo4j.io for cloud
|
||||
NEO4J_USER=neo4j # optional for local Docker with NEO4J_AUTH=none
|
||||
NEO4J_PASSWORD=<your-password> # optional for local Docker with NEO4J_AUTH=none
|
||||
|
||||
# Modaic Token (optional, for pushing to Hub)
|
||||
MODAIC_TOKEN=<your-modaic-token>
|
||||
```
|
||||
|
||||
## Neo4j Setup
|
||||
|
||||
### Option 1: Local Docker (Development)
|
||||
Run Neo4j locally using Docker:
|
||||
```sh
|
||||
docker run \
|
||||
--name text-to-cypher \
|
||||
--publish=7474:7474 \
|
||||
--publish=7687:7687 \
|
||||
--env "NEO4J_AUTH=none" \
|
||||
neo4j:5.15
|
||||
```
|
||||
|
||||
Access Neo4j Browser at `http://localhost:7474`
|
||||
|
||||
### Option 2: Neo4j Aura (Cloud)
|
||||
1. Create a free instance at [neo4j.com/cloud/aura](https://neo4j.com/cloud/aura)
|
||||
2. Get your connection URI (e.g., `neo4j+s://xxx.databases.neo4j.io`)
|
||||
3. Add credentials to your `.env` file
|
||||
|
||||
## Usage
|
||||
|
||||
### Process Wikipedia Abstracts
|
||||
Run the main script to process example Wikipedia abstracts and build a knowledge graph:
|
||||
```sh
|
||||
uv run main.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Load Wikipedia abstracts from `examples/wikipedia-abstracts-v0_0_1.ndjson`
|
||||
2. For each abstract, generate a Cypher statement using GPT-4o
|
||||
3. Execute the Cypher statement in Neo4j
|
||||
4. Build a connected knowledge graph
|
||||
|
||||
### View Your Knowledge Graph
|
||||
Navigate to Neo4j Browser:
|
||||
- Local: `http://localhost:7474/browser/`
|
||||
- Cloud: Your Neo4j Aura console URL
|
||||
|
||||
Run Cypher queries to explore your graph:
|
||||
```cypher
|
||||
MATCH (n) RETURN n LIMIT 25
|
||||
MATCH (p:Person)-[r]->(n) RETURN p, r, n LIMIT 50
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
### Push to Hugging Face Hub
|
||||
To share your trained DSPy program on Hugging Face Hub:
|
||||
|
||||
```python
|
||||
# In main.py, uncomment the push_to_hub section
|
||||
generate_cypher.push_to_hub(
|
||||
"your-username/text-to-cypher",
|
||||
with_code=True,
|
||||
tag="v0.0.1",
|
||||
commit_message="Initial release"
|
||||
)
|
||||
```
|
||||
|
||||
### Customize the Model
|
||||
Modify the `GenerateCypherConfig` in `main.py` to customize:
|
||||
```python
|
||||
class GenerateCypherConfig(PrecompiledConfig):
|
||||
model: str = "openai/gpt-4o" # Change model
|
||||
max_tokens: int = 1024 # Adjust token limit
|
||||
```
|
||||
|
||||
### Process Custom Text
|
||||
Modify `main.py` to process your own text:
|
||||
```python
|
||||
text = "Your custom text here..."
|
||||
cypher = generate_cypher(text=text, neo4j_schema=neo4j.fmt_schema())
|
||||
neo4j.query(cypher.statement.replace('```', ''))
|
||||
```
|
||||
|
||||
## Clean Up
|
||||
|
||||
### Stop Neo4j Docker Container
|
||||
```sh
|
||||
docker stop text-to-cypher
|
||||
docker rm text-to-cypher
|
||||
```
|
||||
|
||||
### Remove Virtual Environment
|
||||
```sh
|
||||
rm -rf .venv
|
||||
```
|
||||
|
||||
## License
|
||||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
||||
|
||||
## References
|
||||
- [DSPy docs](https://dspy-docs.vercel.app/docs/intro)
|
||||
- [Modaic docs](https://docs.modaic.com/)
|
||||
- [Neo4j docs](https://neo4j.com/docs/)
|
||||
- [uv docs](https://docs.astral.sh/uv/)
|
||||
|
||||
4
config.json
Normal file
4
config.json
Normal file
@@ -0,0 +1,4 @@
|
||||
{
|
||||
"model": "openrouter/openai/gpt-4o",
|
||||
"max_tokens": 1024
|
||||
}
|
||||
47
program.json
Normal file
47
program.json
Normal file
@@ -0,0 +1,47 @@
|
||||
{
|
||||
"generate_cypher.predict": {
|
||||
"traces": [],
|
||||
"train": [],
|
||||
"demos": [],
|
||||
"signature": {
|
||||
"instructions": "text\nTask: Given (1) a natural-language question and (2) a Neo4j schema description, output exactly ONE Cypher query that answers the question.\n\nINPUTS\n- question: the user request in natural language.\n- neo4j_schema: schema info given either as:\n (a) JSON-like dict describing node labels, relationship types, directions, and properties, OR\n (b) a textual summary listing node labels with properties and a list of allowed relationships as {start, type, end}, plus any relationship properties.\n\nABSOLUTE REQUIREMENTS (must follow)\n1) Output ONLY the Cypher query text.\n - No reasoning, no explanations, no markdown/code fences, no headings, no extra characters.\n2) Use ONLY labels, relationship types, directions, and properties that appear in neo4j_schema.\n - Do NOT invent labels/properties/relationships.\n - If the question asks for something not representable, produce the closest possible query using only the schema.\n3) Respect relationship direction exactly as specified.\n - If schema says Article -[:PUBLISHED_IN]-> Journal, do not reverse it.\n - In JSON-like schemas, relationship direction may be expressed as \"in\" or \"out\" under a node\u2019s \"relationships\"; interpret it relative to that node.\n4) Return ONLY what the question asks for.\n - If it asks for \u201ctitle values\u201d, return a.title (not whole nodes).\n - If it asks for counts, return counts with clear aliases.\n - Use DISTINCT when the question implies uniqueness.\n5) Produce exactly one valid Cypher statement.\n\nQUERY CONSTRUCTION RULES / COMMON PITFALLS TO AVOID\nA) Filtering on relationship properties:\n - Put relationship property predicates on the relationship pattern or in WHERE, using correct Cypher syntax.\n - Example: MATCH (a)-[r:PUBLISHED_IN]->(j) WHERE r.meta = '220'\n - IMPORTANT: use Cypher string literals with single quotes (e.g., '220'), not JSON-style quotes.\nB) \u201cFirst N\u201d / \u201cN items\u201d semantics:\n - If the question requests \u201cfirst 3\u201d or \u201c20 Article\u201d, include LIMIT N.\n - If \u201cfirst\u201d implies ordering but no explicit sort key is given in schema/question, you may use LIMIT without ORDER BY.\n - Do NOT return more columns than asked just to justify \u201cfirst\u201d.\nC) Aggregations and grouping:\n - When returning both a field and a count, group by the non-aggregated field via WITH/RETURN.\n - Apply HAVING-like filters using WITH ... WHERE (e.g., cities with >1 student).\n - Example pattern:\n MATCH (s:Student)\n WITH s.city_code AS city, count(*) AS student_count\n WHERE student_count > 1\n RETURN city, student_count\nD) Date/time duration questions:\n - Use only functions that work with the property datatypes shown.\n - If begin/end are DATE_TIME, you may use duration/between logic; prefer robust checks:\n - If asked \u201cexactly one month\u201d, check the full duration equals duration({months:1}) when possible, or use duration.between(f.begin, f.end) and compare appropriately.\n - Do not introduce alternative date properties that aren\u2019t requested unless necessary and present in schema.\nE) String matching:\n - For prefix constraints, use STARTS WITH.\n - For exact text match, use equality.\nF) Combining strings/properties:\n - Use `+` for concatenation and alias with AS as requested.\n\nOUTPUT\n- Exactly one Cypher query, and nothing else.",
|
||||
"fields": [
|
||||
{
|
||||
"prefix": "Question:",
|
||||
"description": "Question to model using a cypher statement. Use only the provided relationship types and properties in the schema."
|
||||
},
|
||||
{
|
||||
"prefix": "Neo 4 J Schema:",
|
||||
"description": "Current graph schema in Neo4j as a list of NODES and RELATIONSHIPS."
|
||||
},
|
||||
{
|
||||
"prefix": "Reasoning: Let's think step by step in order to",
|
||||
"description": "${reasoning}"
|
||||
},
|
||||
{
|
||||
"prefix": "Statement:",
|
||||
"description": "Cypher statement to query the graph database."
|
||||
}
|
||||
]
|
||||
},
|
||||
"lm": {
|
||||
"model": "openrouter/openai/gpt-4o",
|
||||
"model_type": "chat",
|
||||
"cache": true,
|
||||
"num_retries": 3,
|
||||
"finetuning_model": null,
|
||||
"launch_kwargs": {},
|
||||
"train_kwargs": {},
|
||||
"temperature": null,
|
||||
"max_tokens": 1024,
|
||||
"api_base": "https://openrouter.ai/api/v1"
|
||||
}
|
||||
},
|
||||
"metadata": {
|
||||
"dependency_versions": {
|
||||
"python": "3.13",
|
||||
"dspy": "3.0.4",
|
||||
"cloudpickle": "3.1"
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user