2.7 KiB
2.7 KiB
CiteGeist Current Architecture
Overview
CiteGeist is currently designed as a local BibTeX-native tooling system with:
- BibTeX parsing and storage
- Local text search (FTS5)
- Entry provenance tracking
- Citation graph traversal
- Topic-based expansion
Core Modules
Source Management
- sources.py:
SourceClientclass for HTTP requests with caching and retry logic- Base HTTP client with JSON/XML/text support
- Built-in retry with exponential backoff
- Cache directory support
Metadata Resolution
- resolve.py:
MetadataResolverclass for entry resolution- DOI → CrossRef lookup
- PMID → PubMed lookup
- arXiv, DBLP, OpenAlex lookup
- Title search fallback with best-match selection
- DataCite integration
- Returns
Resolutionobjects with provenance
Storage
- storage.py:
BibliographyStoreclass (SQLite)- Tables: entries, creators, entry_creators, identifiers, relations, topics, entry_topics, field_provenance, relation_provenance
- FTS5 text search integration
- Field-level provenance tracking
- Citation graph support (cites, cited_by edges)
BibTeX Processing
- bibtex.py: BibEntry dataclass and parsing/rendering
- BibTeX → BibEntry conversion
- BibEntry → BibTeX rendering
- Citation key generation
CLI and Server
- cli.py: Command-line interface
- app_server.py: Local HTTP server for UI/JSON API
- app_api.py: JSON API adapter surface
Expansion and Discovery
- expand.py: Citation graph expansion workflows
- extract.py: Plaintext reference extraction
- bootstrap.py: Topic bootstrap and expansion
Current State Summary
Completed/Usable:
- BibTeX parsing and storage
- Identifier-based resolution (DOI, PMID, arXiv, DBLP, OpenAlex)
- Title search with best-match selection
- Citation graph traversal and expansion
- Field provenance tracking
- Local search with FTS5
- Topic-based discovery workflows
Not Yet Implemented (from new roadmap):
- Plugin-based source architecture
- Multi-source record merging
- PGVector embeddings
- Full-text OA link retrieval
- Semantic Scholar integration
- OpenCitations integration
- Unified API endpoints for multi-source queries
Data Flow
- Ingest: BibTeX file → parse → store in entries table
- Resolve: Entry → resolve_doi/resolve_pmid/resolve_arxiv → fetch metadata → merge with existing
- Expand: Start from entry → traverse citation edges → discover new entries
- Search: Query FTS5 index → retrieve relevant entries
- Export: Entries → render BibTeX → output file
Database Schema
SQLite-based storage with:
- Normalized entry fields
- Creator relationships
- Identifier mapping
- Citation relations
- Topic associations
- Field provenance metadata