CiteGeist/docs/implementation-progress.md

3.9 KiB

CiteGeist Sources-First Progress

Last Updated: 2026-04-25

This document tracks the refocused plan for source incorporation. The working question is which additional open bibliographic sources CiteGeist should integrate next, not whether it needs a new storage platform first.


Phase 0: Scope Reframe COMPLETE

Status: Completed

Deliverables:

  • /docs/source-landscape.md - source inventory and recommendation document
  • /src/citegeist/sources/catalog.py - code-backed source catalog

Completed:

  • Identified which source integrations already exist in the repository
  • Split source-expansion planning from database/vector-search ambitions
  • Prioritized open-source additions by workflow value

Phase 1: Source Layer Tightening COMPLETE

Status: Completed

Deliverables:

  • /src/citegeist/sources/base.py - Base BibliographicSource interface
  • /src/citegeist/sources/registry.py - Registry for known concrete sources
  • /src/citegeist/sources/crossref.py - Repaired CrossRef source implementation
  • /src/citegeist/sources/catalog.py - Open-source inventory
  • /src/citegeist/sources/__init__.py - Package initialization
  • /tests/test_sources_plugin.py - Source plugin tests
  • /tests/test_sources_catalog.py - Source catalog and registry tests

Completed:

  • Created BibliographicSource abstract base class
  • Repaired SourceRegistry so config-backed loading resolves real source classes
  • Fixed CrossRefSource normalization for direct lookup and search-style payloads
  • Replaced path-specific compatibility loading with repo-relative loading
  • Added a source catalog that captures current status and next-priority sources

Features:

  • Abstract interface for source plugins
  • Registry for known source discovery and instantiation
  • Config-driven enable/disable for known source types
  • Source prioritization metadata
  • Compatibility with the existing SourceClient-based resolver/expander code

Current Integrated Sources AVAILABLE

  • Crossref
  • OpenAlex
  • OpenCitations
  • Unpaywall
  • PubMed
  • Europe PMC
  • Semantic Scholar
  • DataCite
  • DBLP
  • arXiv
  • OAI-PMH

These are already sufficient for a credible local enrichment-and-discovery workflow. The next work should complement them rather than restart infrastructure underneath them.


Phase 2: Next Source Additions 🚧 IN PROGRESS

Status: In Progress

Priority Order:

  1. OpenAIRE only if repository-acquisition scope expands

Completed Deliverables:

  • OpenCitations adapter for DOI citation/reference lookup
  • OpenCitations graph expansion support in CLI and topic expansion flows
  • Unpaywall adapter for DOI OA-link enrichment
  • enrich-oa CLI flow for applying OA metadata to stored entries
  • Europe PMC biomedical resolver/search integration
  • Semantic Scholar broad-science resolver/search integration

Planned Deliverables:

  • Decide whether repository-acquisition breadth needs another dedicated source

Rationale:

  • OpenCitations now improves open citation-edge coverage
  • Unpaywall now improves access-link enrichment
  • Europe PMC now improves biomedical metadata and OA/fulltext coverage
  • Semantic Scholar now improves broader biological and physical sciences coverage
  • neither requires a new database architecture to become useful

Phase 3: Optional Source Evaluation PLANNED

Status: Planned

  • OpenAIRE

Decision Rule:

  • add them only if they solve a concrete discovery or acquisition gap that current open sources do not already cover well

Explicitly Deferred

  • second-schema redesign work
  • pgvector integration
  • embedding-first retrieval
  • broad canonical-work reconstruction

Summary

Completed: scope reframe and source-layer cleanup Planned next: OpenAIRE reevaluation Deferred: database/vector expansion work not required by the source question