CiteGeist/docs/architecture-current.md

88 lines
2.7 KiB
Markdown

# CiteGeist Current Architecture
## Overview
CiteGeist is currently designed as a local BibTeX-native tooling system with:
- BibTeX parsing and storage
- Local text search (FTS5)
- Entry provenance tracking
- Citation graph traversal
- Topic-based expansion
## Core Modules
### Source Management
- **sources.py**: `SourceClient` class for HTTP requests with caching and retry logic
- Base HTTP client with JSON/XML/text support
- Built-in retry with exponential backoff
- Cache directory support
### Metadata Resolution
- **resolve.py**: `MetadataResolver` class for entry resolution
- DOI → CrossRef lookup
- PMID → PubMed lookup
- arXiv, DBLP, OpenAlex lookup
- Title search fallback with best-match selection
- DataCite integration
- Returns `Resolution` objects with provenance
### Storage
- **storage.py**: `BibliographyStore` class (SQLite)
- Tables: entries, creators, entry_creators, identifiers, relations, topics, entry_topics, field_provenance, relation_provenance
- FTS5 text search integration
- Field-level provenance tracking
- Citation graph support (cites, cited_by edges)
### BibTeX Processing
- **bibtex.py**: BibEntry dataclass and parsing/rendering
- BibTeX → BibEntry conversion
- BibEntry → BibTeX rendering
- Citation key generation
### CLI and Server
- **cli.py**: Command-line interface
- **app_server.py**: Local HTTP server for UI/JSON API
- **app_api.py**: JSON API adapter surface
### Expansion and Discovery
- **expand.py**: Citation graph expansion workflows
- **extract.py**: Plaintext reference extraction
- **bootstrap.py**: Topic bootstrap and expansion
## Current State Summary
**Completed/Usable:**
- BibTeX parsing and storage
- Identifier-based resolution (DOI, PMID, arXiv, DBLP, OpenAlex)
- Title search with best-match selection
- Citation graph traversal and expansion
- Field provenance tracking
- Local search with FTS5
- Topic-based discovery workflows
**Not Yet Implemented (from new roadmap):**
- Plugin-based source architecture
- Multi-source record merging
- PGVector embeddings
- Full-text OA link retrieval
- Semantic Scholar integration
- OpenCitations integration
- Unified API endpoints for multi-source queries
## Data Flow
1. **Ingest**: BibTeX file → parse → store in entries table
2. **Resolve**: Entry → resolve_doi/resolve_pmid/resolve_arxiv → fetch metadata → merge with existing
3. **Expand**: Start from entry → traverse citation edges → discover new entries
4. **Search**: Query FTS5 index → retrieve relevant entries
5. **Export**: Entries → render BibTeX → output file
## Database Schema
SQLite-based storage with:
- Normalized entry fields
- Creator relationships
- Identifier mapping
- Citation relations
- Topic associations
- Field provenance metadata