CiteGeist/new-roadmap.md

3.9 KiB

CiteGeist Roadmap: Sources-First Expansion

Purpose

The primary question is not “how do we redesign CiteGeist around a new storage engine?” The primary question is “which additional open bibliographic sources should CiteGeist incorporate next?”

This roadmap treats the current SQLite-based local workflow as the baseline and focuses on source evaluation, source integration order, and reviewable source behavior.

Baseline

Already present in the repository:

  • local BibTeX ingest, review, export, and graph traversal
  • metadata resolution from Crossref, PubMed, Europe PMC, OpenAlex, Semantic Scholar, DBLP, arXiv, and DataCite
  • citation-graph expansion using Crossref and OpenAlex
  • repository harvesting via OAI-PMH

That means the next planning step is source prioritization, not another platform pivot.

Phase 0: Reframe Scope

Goal:

Put source-incorporation decisions ahead of database and vector-search ambitions.

Tasks:

  • identify which source integrations already exist
  • separate “source expansion” work from “new database/vector stack” work
  • document the source landscape and recommended order

Deliverables:

  • /docs/source-landscape.md
  • /src/citegeist/sources/catalog.py

Phase 1: Tighten The Source Layer

Goal:

Make the new source abstraction useful for the repository that already exists, rather than speculative infrastructure.

Tasks:

  • keep the compatibility bridge to the existing SourceClient
  • fix the initial CrossRefSource implementation so normalization works
  • make config-driven registry loading work for known concrete sources
  • add a code-backed source catalog for planning and prioritization

Deliverables:

  • /src/citegeist/sources/base.py
  • /src/citegeist/sources/registry.py
  • /src/citegeist/sources/crossref.py
  • /src/citegeist/sources/catalog.py

Phase 2: Highest-Value Open Source Additions

Goal:

Incorporate the next open sources that materially improve the current workflow.

Priority order:

  1. OpenAIRE only if repository-acquisition scope expands

Tasks:

  • add OpenCitations DOI-to-citation and DOI-to-reference lookup
  • merge OpenCitations edges into the existing graph-expansion workflow with provenance
  • add Unpaywall DOI-to-OA-link enrichment
  • expose OA-link enrichment in a dedicated CLI flow
  • add Europe PMC as a biomedical metadata/fulltext complement to PubMed
  • add Semantic Scholar as a broader scientific metadata complement across biological and physical sciences

Why these first:

  • OpenCitations directly answers the open-citation-coverage gap
  • Unpaywall now solves access-link enrichment without forcing a storage redesign
  • Europe PMC now improves biomedical metadata and OA/fulltext coverage without changing the storage model
  • Semantic Scholar now improves broader biological and physical sciences coverage without changing the storage model

Phase 3: Evaluate Optional Sources, Do Not Commit Prematurely

Goal:

Assess sources that may be useful, but are not clearly the next source-first move.

Candidates:

  • OpenAIRE

Tasks:

  • document API limits, openness constraints, and integration risk
  • decide whether each source belongs in core resolution, graph expansion, or corpus acquisition
  • avoid adding sources that duplicate existing coverage without a clear payoff

Deferred Work

These are valid future ideas, but they are not the current planning driver:

  • a second database schema
  • pgvector integration
  • embedding-first search
  • large-scale canonical-work reconstruction

The repository already has a working local storage/search path. Those ideas should only return to the front of the plan if a concrete source-integration need forces them there.

Immediate Next Steps

  1. Land the source inventory and source-layer cleanup.
  2. Reassess whether OpenAIRE is worth adding for repository-acquisition breadth.