2.7 KiB

Raw Blame History

CiteGeist Current Architecture

Overview

CiteGeist is currently designed as a local BibTeX-native tooling system with:

BibTeX parsing and storage
Local text search (FTS5)
Entry provenance tracking
Citation graph traversal
Topic-based expansion

Core Modules

Source Management

sources.py: SourceClient class for HTTP requests with caching and retry logic
- Base HTTP client with JSON/XML/text support
- Built-in retry with exponential backoff
- Cache directory support

Metadata Resolution

resolve.py: MetadataResolver class for entry resolution
- DOI → CrossRef lookup
- PMID → PubMed lookup
- arXiv, DBLP, OpenAlex lookup
- Title search fallback with best-match selection
- DataCite integration
- Returns Resolution objects with provenance

Storage

storage.py: BibliographyStore class (SQLite)
- Tables: entries, creators, entry_creators, identifiers, relations, topics, entry_topics, field_provenance, relation_provenance
- FTS5 text search integration
- Field-level provenance tracking
- Citation graph support (cites, cited_by edges)

BibTeX Processing

bibtex.py: BibEntry dataclass and parsing/rendering
- BibTeX → BibEntry conversion
- BibEntry → BibTeX rendering
- Citation key generation

CLI and Server

cli.py: Command-line interface
app_server.py: Local HTTP server for UI/JSON API
app_api.py: JSON API adapter surface

Expansion and Discovery

expand.py: Citation graph expansion workflows
extract.py: Plaintext reference extraction
bootstrap.py: Topic bootstrap and expansion

Current State Summary

Completed/Usable:

BibTeX parsing and storage
Identifier-based resolution (DOI, PMID, arXiv, DBLP, OpenAlex)
Title search with best-match selection
Citation graph traversal and expansion
Field provenance tracking
Local search with FTS5
Topic-based discovery workflows

Not Yet Implemented (from new roadmap):

Plugin-based source architecture
Multi-source record merging
PGVector embeddings
Full-text OA link retrieval
Semantic Scholar integration
OpenCitations integration
Unified API endpoints for multi-source queries

Data Flow

Ingest: BibTeX file → parse → store in entries table
Resolve: Entry → resolve_doi/resolve_pmid/resolve_arxiv → fetch metadata → merge with existing
Expand: Start from entry → traverse citation edges → discover new entries
Search: Query FTS5 index → retrieve relevant entries
Export: Entries → render BibTeX → output file

Database Schema

SQLite-based storage with:

Normalized entry fields
Creator relationships
Identifier mapping
Citation relations
Topic associations
Field provenance metadata