3.9 KiB
CiteGeist Roadmap: Sources-First Expansion
Purpose
The primary question is not “how do we redesign CiteGeist around a new storage engine?” The primary question is “which additional open bibliographic sources should CiteGeist incorporate next?”
This roadmap treats the current SQLite-based local workflow as the baseline and focuses on source evaluation, source integration order, and reviewable source behavior.
Baseline
Already present in the repository:
- local BibTeX ingest, review, export, and graph traversal
- metadata resolution from
Crossref,PubMed,Europe PMC,OpenAlex,Semantic Scholar,DBLP,arXiv, andDataCite - citation-graph expansion using
CrossrefandOpenAlex - repository harvesting via
OAI-PMH
That means the next planning step is source prioritization, not another platform pivot.
Phase 0: Reframe Scope
Goal:
Put source-incorporation decisions ahead of database and vector-search ambitions.
Tasks:
- identify which source integrations already exist
- separate “source expansion” work from “new database/vector stack” work
- document the source landscape and recommended order
Deliverables:
/docs/source-landscape.md/src/citegeist/sources/catalog.py
Phase 1: Tighten The Source Layer
Goal:
Make the new source abstraction useful for the repository that already exists, rather than speculative infrastructure.
Tasks:
- keep the compatibility bridge to the existing
SourceClient - fix the initial
CrossRefSourceimplementation so normalization works - make config-driven registry loading work for known concrete sources
- add a code-backed source catalog for planning and prioritization
Deliverables:
/src/citegeist/sources/base.py/src/citegeist/sources/registry.py/src/citegeist/sources/crossref.py/src/citegeist/sources/catalog.py
Phase 2: Highest-Value Open Source Additions
Goal:
Incorporate the next open sources that materially improve the current workflow.
Priority order:
OpenAIREonly if repository-acquisition scope expands
Tasks:
- add
OpenCitationsDOI-to-citation and DOI-to-reference lookup - merge
OpenCitationsedges into the existing graph-expansion workflow with provenance - add
UnpaywallDOI-to-OA-link enrichment - expose OA-link enrichment in a dedicated CLI flow
- add
Europe PMCas a biomedical metadata/fulltext complement toPubMed - add
Semantic Scholaras a broader scientific metadata complement across biological and physical sciences
Why these first:
OpenCitationsdirectly answers the open-citation-coverage gapUnpaywallnow solves access-link enrichment without forcing a storage redesignEurope PMCnow improves biomedical metadata and OA/fulltext coverage without changing the storage modelSemantic Scholarnow improves broader biological and physical sciences coverage without changing the storage model
Phase 3: Evaluate Optional Sources, Do Not Commit Prematurely
Goal:
Assess sources that may be useful, but are not clearly the next source-first move.
Candidates:
OpenAIRE
Tasks:
- document API limits, openness constraints, and integration risk
- decide whether each source belongs in core resolution, graph expansion, or corpus acquisition
- avoid adding sources that duplicate existing coverage without a clear payoff
Deferred Work
These are valid future ideas, but they are not the current planning driver:
- a second database schema
- pgvector integration
- embedding-first search
- large-scale canonical-work reconstruction
The repository already has a working local storage/search path. Those ideas should only return to the front of the plan if a concrete source-integration need forces them there.
Immediate Next Steps
- Land the source inventory and source-layer cleanup.
- Reassess whether
OpenAIREis worth adding for repository-acquisition breadth.