# CiteGeist Current Architecture ## Overview CiteGeist is currently designed as a local BibTeX-native tooling system with: - BibTeX parsing and storage - Local text search (FTS5) - Entry provenance tracking - Citation graph traversal - Topic-based expansion ## Core Modules ### Source Management - **sources.py**: `SourceClient` class for HTTP requests with caching and retry logic - Base HTTP client with JSON/XML/text support - Built-in retry with exponential backoff - Cache directory support ### Metadata Resolution - **resolve.py**: `MetadataResolver` class for entry resolution - DOI → CrossRef lookup - PMID → PubMed lookup - arXiv, DBLP, OpenAlex lookup - Title search fallback with best-match selection - DataCite integration - Returns `Resolution` objects with provenance ### Storage - **storage.py**: `BibliographyStore` class (SQLite) - Tables: entries, creators, entry_creators, identifiers, relations, topics, entry_topics, field_provenance, relation_provenance - FTS5 text search integration - Field-level provenance tracking - Citation graph support (cites, cited_by edges) ### BibTeX Processing - **bibtex.py**: BibEntry dataclass and parsing/rendering - BibTeX → BibEntry conversion - BibEntry → BibTeX rendering - Citation key generation ### CLI and Server - **cli.py**: Command-line interface - **app_server.py**: Local HTTP server for UI/JSON API - **app_api.py**: JSON API adapter surface ### Expansion and Discovery - **expand.py**: Citation graph expansion workflows - **extract.py**: Plaintext reference extraction - **bootstrap.py**: Topic bootstrap and expansion ## Current State Summary **Completed/Usable:** - BibTeX parsing and storage - Identifier-based resolution (DOI, PMID, arXiv, DBLP, OpenAlex) - Title search with best-match selection - Citation graph traversal and expansion - Field provenance tracking - Local search with FTS5 - Topic-based discovery workflows **Not Yet Implemented (from new roadmap):** - Plugin-based source architecture - Multi-source record merging - PGVector embeddings - Full-text OA link retrieval - Semantic Scholar integration - OpenCitations integration - Unified API endpoints for multi-source queries ## Data Flow 1. **Ingest**: BibTeX file → parse → store in entries table 2. **Resolve**: Entry → resolve_doi/resolve_pmid/resolve_arxiv → fetch metadata → merge with existing 3. **Expand**: Start from entry → traverse citation edges → discover new entries 4. **Search**: Query FTS5 index → retrieve relevant entries 5. **Export**: Entries → render BibTeX → output file ## Database Schema SQLite-based storage with: - Normalized entry fields - Creator relationships - Identifier mapping - Citation relations - Topic associations - Field provenance metadata