# CiteGeist Roadmap: Sources-First Expansion ## Purpose The primary question is not “how do we redesign CiteGeist around a new storage engine?” The primary question is “which additional open bibliographic sources should CiteGeist incorporate next?” This roadmap treats the current SQLite-based local workflow as the baseline and focuses on source evaluation, source integration order, and reviewable source behavior. ## Baseline Already present in the repository: - local BibTeX ingest, review, export, and graph traversal - metadata resolution from `Crossref`, `PubMed`, `Europe PMC`, `OpenAlex`, `Semantic Scholar`, `DBLP`, `arXiv`, and `DataCite` - citation-graph expansion using `Crossref` and `OpenAlex` - repository harvesting via `OAI-PMH` That means the next planning step is source prioritization, not another platform pivot. ## Phase 0: Reframe Scope Goal: Put source-incorporation decisions ahead of database and vector-search ambitions. Tasks: - [x] identify which source integrations already exist - [x] separate “source expansion” work from “new database/vector stack” work - [x] document the source landscape and recommended order Deliverables: - `/docs/source-landscape.md` - `/src/citegeist/sources/catalog.py` ## Phase 1: Tighten The Source Layer Goal: Make the new source abstraction useful for the repository that already exists, rather than speculative infrastructure. Tasks: - [x] keep the compatibility bridge to the existing `SourceClient` - [x] fix the initial `CrossRefSource` implementation so normalization works - [x] make config-driven registry loading work for known concrete sources - [x] add a code-backed source catalog for planning and prioritization Deliverables: - `/src/citegeist/sources/base.py` - `/src/citegeist/sources/registry.py` - `/src/citegeist/sources/crossref.py` - `/src/citegeist/sources/catalog.py` ## Phase 2: Highest-Value Open Source Additions Goal: Incorporate the next open sources that materially improve the current workflow. Priority order: 1. `OpenAIRE` only if repository-acquisition scope expands Tasks: - [x] add `OpenCitations` DOI-to-citation and DOI-to-reference lookup - [x] merge `OpenCitations` edges into the existing graph-expansion workflow with provenance - [x] add `Unpaywall` DOI-to-OA-link enrichment - [x] expose OA-link enrichment in a dedicated CLI flow - [x] add `Europe PMC` as a biomedical metadata/fulltext complement to `PubMed` - [x] add `Semantic Scholar` as a broader scientific metadata complement across biological and physical sciences Why these first: - `OpenCitations` directly answers the open-citation-coverage gap - `Unpaywall` now solves access-link enrichment without forcing a storage redesign - `Europe PMC` now improves biomedical metadata and OA/fulltext coverage without changing the storage model - `Semantic Scholar` now improves broader biological and physical sciences coverage without changing the storage model ## Phase 3: Evaluate Optional Sources, Do Not Commit Prematurely Goal: Assess sources that may be useful, but are not clearly the next source-first move. Candidates: - `OpenAIRE` Tasks: - [ ] document API limits, openness constraints, and integration risk - [ ] decide whether each source belongs in core resolution, graph expansion, or corpus acquisition - [ ] avoid adding sources that duplicate existing coverage without a clear payoff ## Deferred Work These are valid future ideas, but they are not the current planning driver: - a second database schema - pgvector integration - embedding-first search - large-scale canonical-work reconstruction The repository already has a working local storage/search path. Those ideas should only return to the front of the plan if a concrete source-integration need forces them there. ## Immediate Next Steps 1. Land the source inventory and source-layer cleanup. 2. Reassess whether `OpenAIRE` is worth adding for repository-acquisition breadth.