init
This commit is contained in:
91
README.md
91
README.md
@@ -1,2 +1,93 @@
|
|||||||
|
|
||||||
# regspy
|
# regspy
|
||||||
|
|
||||||
|
regspy is a regex pattern generator, you enter some data -> select what you want matched and or not matched -> ??? -> Pattern!
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This project started as me trying to learn dspy, its vibe coded to shit and back but it works and has some accomplishments:
|
||||||
|
- Runs on small models with 3B parameter at a minimum, so it should run on anything.
|
||||||
|
- It outperforms grex ~~in metrics that were defined by me~~.
|
||||||
|
- Learns from what you feed it, it generated a pattern you liked? add it to the training set!
|
||||||
|
- No human written prompts or rules or "make sure to NOT explode" bs.
|
||||||
|
- Context aware generation, it learns from failed patterns and most importantly WHY it failed.
|
||||||
|
- Generates patterns based on a scoring system that ranks patterns by:
|
||||||
|
- **matches_all**: Percentage of required items the pattern matches
|
||||||
|
- **excludes_all**: Percentage of excluded items the pattern avoids
|
||||||
|
- *If no excluded items are selected, this metrics weights are divided equally amongst the others.*
|
||||||
|
- **coherence**: How similar extra matches are to target items
|
||||||
|
- **generalization**: Use of character classes (\\d, \\w) vs literals
|
||||||
|
- **simplicity**: How short patterns are and without the use of branching
|
||||||
|
|
||||||
|
Is it perfect? hell no, the training set, scoring system, hint generation could be improved upon, so if you want have a go at it i included a CLAUDE.md for you.
|
||||||
|
|
||||||
|
But if you're a everyday smooth brain like me that needs a simple pattern on the fly because for some reason your brain is physically impossible of remembering that lookaheads exist, regspy should be of some help.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Visual Text Selection**: Highlight text to create match examples (cyan) or exclusions (red)
|
||||||
|
- **LLM-Powered Generation**: Uses local Ollama with qwen2.5-coder:3b for intelligent pattern creation
|
||||||
|
- **Training Dataset**: 227+ curated examples with ability to add your own
|
||||||
|
- **Pre-compilation**: Optional rule extraction for faster runtime inference
|
||||||
|
- **Session Config**: Adjust model, temperature, and scoring weights on the fly
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
- **AutoHotkey v2.0** - [Download](https://www.autohotkey.com/)
|
||||||
|
- **Python Libs**:
|
||||||
|
```bash
|
||||||
|
pip install dspy grex ollama
|
||||||
|
```
|
||||||
|
- **Ollama**:
|
||||||
|
```bash
|
||||||
|
ollama serve
|
||||||
|
ollama pull qwen2.5-coder:3b
|
||||||
|
```
|
||||||
|
- **Run**:
|
||||||
|
```bash
|
||||||
|
AutoHotkey64.exe regspy.ahk # Or just double click regspy.ahk
|
||||||
|
```
|
||||||
|
|
||||||
|
### CLI flags
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run test suite
|
||||||
|
python regexgen.py --test
|
||||||
|
|
||||||
|
# Pre-compile for faster runtime
|
||||||
|
python regexgen.py --compile
|
||||||
|
|
||||||
|
# Generate regex from JSON input
|
||||||
|
python regexgen.py input.json output.json
|
||||||
|
|
||||||
|
# With custom config
|
||||||
|
python regexgen.py input.json output.json --config config.json
|
||||||
|
|
||||||
|
# Dataset management
|
||||||
|
python regexgen.py --list-dataset output.json
|
||||||
|
python regexgen.py --add-example example.json
|
||||||
|
python regexgen.py --delete-example <index>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||||
|
│ AutoHotkey │────▶│ Web Frontend │────▶│ Python │
|
||||||
|
│ (Host) │◀────│ (WebView2) │◀────│ (DSPy/LLM) │
|
||||||
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||||
|
Window Text Selection Regex Generation
|
||||||
|
Management Highlighting Multi-criteria
|
||||||
|
IPC Bridge Results Display Scoring
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
The Config tab allows session-level adjustments:
|
||||||
|
|
||||||
|
- **Model**: Ollama model name (default: `qwen2.5-coder:3b`)
|
||||||
|
- **Temperature**: LLM creativity (default: 0.4)
|
||||||
|
- **Max Attempts**: Refinement iterations (default: 10)
|
||||||
|
- **Reward Threshold**: Stop early if score exceeds (default: 0.85)
|
||||||
|
- **Scoring Weights**: Adjust the 5 criteria weights
|
||||||
|
- **Context Window** (`num_ctx`): Ollama context size (default: 8192). Ollama defaults to 4096 which can truncate prompts with many training examples. If you see "truncating input prompt" warnings in Ollama logs, bump this up. Uses ~200MB extra VRAM per 4K increase on 3B models.
|
||||||
4
auto_classes.json
Normal file
4
auto_classes.json
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
{
|
||||||
|
"AutoConfig": "regspy.cli.RegexConfig",
|
||||||
|
"AutoProgram": "regspy.cli.RegexProgram"
|
||||||
|
}
|
||||||
24
config.json
Normal file
24
config.json
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
{
|
||||||
|
"model": "qwen2.5-coder:3b",
|
||||||
|
"ollama_url": "http://localhost:11434",
|
||||||
|
"temperature": 0.4,
|
||||||
|
"num_ctx": 8192,
|
||||||
|
"enable_cache": false,
|
||||||
|
"max_attempts": 10,
|
||||||
|
"reward_threshold": 0.85,
|
||||||
|
"fail_count": null,
|
||||||
|
"use_cot": true,
|
||||||
|
"dataset_file": "/Users/fadel/Desktop/dev/regspy/dspy/regex-dspy-train.json",
|
||||||
|
"compiled_program_path": "/Users/fadel/Desktop/dev/regspy/dspy/regex_compiled.json",
|
||||||
|
"compile_threads": 8,
|
||||||
|
"compile_candidates": 16,
|
||||||
|
"compile_num_rules": 5,
|
||||||
|
"debug": true,
|
||||||
|
"weights": {
|
||||||
|
"matches_all": 0.35,
|
||||||
|
"excludes_all": 0.25,
|
||||||
|
"coherence": 0.15,
|
||||||
|
"generalization": 0.15,
|
||||||
|
"simplicity": 0.1
|
||||||
|
}
|
||||||
|
}
|
||||||
44
program.json
Normal file
44
program.json
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
{
|
||||||
|
"program.predict": {
|
||||||
|
"traces": [],
|
||||||
|
"train": [],
|
||||||
|
"demos": [],
|
||||||
|
"signature": {
|
||||||
|
"instructions": "Generate a regex pattern from examples.",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"prefix": "Text:",
|
||||||
|
"description": "The full text to search within"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prefix": "Match Items:",
|
||||||
|
"description": "Strings the pattern MUST match"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prefix": "Exclude Items:",
|
||||||
|
"description": "Strings the pattern must NOT match"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prefix": "Pattern Hints:",
|
||||||
|
"description": "Analysis hints about the match items"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prefix": "Reasoning: Let's think step by step in order to",
|
||||||
|
"description": "${reasoning}"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"prefix": "Pattern:",
|
||||||
|
"description": "Regex pattern"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"lm": null
|
||||||
|
},
|
||||||
|
"metadata": {
|
||||||
|
"dependency_versions": {
|
||||||
|
"python": "3.13",
|
||||||
|
"dspy": "3.1.2",
|
||||||
|
"cloudpickle": "3.1"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
52
push.py
Normal file
52
push.py
Normal file
@@ -0,0 +1,52 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Push RegSpy program to Modaic Hub.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
uv run push.py your-username/regspy
|
||||||
|
uv run push.py your-username/regspy --with-code --commit-message "Initial"
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
from regspy import RegexConfig, RegexProgram
|
||||||
|
|
||||||
|
|
||||||
|
def create_parser() -> argparse.ArgumentParser:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Push RegSpy PrecompiledProgram to Modaic Hub",
|
||||||
|
)
|
||||||
|
parser.add_argument("repo", help="Hub repo in the form username/name")
|
||||||
|
parser.add_argument("--with-code", action="store_true", help="Bundle code with the push")
|
||||||
|
parser.add_argument("--commit-message", help="Optional commit message")
|
||||||
|
parser.add_argument("--branch", help="Optional branch name")
|
||||||
|
parser.add_argument("--tag", help="Optional tag name")
|
||||||
|
parser.add_argument("--private", action="store_true", help="Push to a private repo")
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
parser = create_parser()
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
program = RegexProgram(RegexConfig())
|
||||||
|
|
||||||
|
push_kwargs: dict[str, object] = {
|
||||||
|
"with_code": args.with_code,
|
||||||
|
"private": args.private,
|
||||||
|
}
|
||||||
|
if args.commit_message:
|
||||||
|
push_kwargs["commit_message"] = args.commit_message
|
||||||
|
if args.branch:
|
||||||
|
push_kwargs["branch"] = args.branch
|
||||||
|
if args.tag:
|
||||||
|
push_kwargs["tag"] = args.tag
|
||||||
|
|
||||||
|
program.push_to_hub(args.repo, **push_kwargs)
|
||||||
|
print(f"Pushed to {args.repo}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
7
pyproject.toml
Normal file
7
pyproject.toml
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
[project]
|
||||||
|
name = "regspy"
|
||||||
|
version = "0.1.0"
|
||||||
|
description = "Add your description here"
|
||||||
|
readme = "README.md"
|
||||||
|
requires-python = ">=3.13"
|
||||||
|
dependencies = ["dspy>=3.1.2", "grex>=1.0.2", "modaic>=0.10.3"]
|
||||||
10
regspy/__init__.py
Normal file
10
regspy/__init__.py
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
"""RegSpy package exports."""
|
||||||
|
|
||||||
|
from .cli import (
|
||||||
|
RegexConfig,
|
||||||
|
RegexProgram,
|
||||||
|
compile_and_save,
|
||||||
|
generate_regex,
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = ["RegexConfig", "RegexProgram", "compile_and_save", "generate_regex"]
|
||||||
1340
regspy/cli.py
Normal file
1340
regspy/cli.py
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user