3.9 KiB
CiteGeist Sources-First Progress
Last Updated: 2026-04-25
This document tracks the refocused plan for source incorporation. The working question is which additional open bibliographic sources CiteGeist should integrate next, not whether it needs a new storage platform first.
Phase 0: Scope Reframe ✅ COMPLETE
Status: Completed
Deliverables:
- ✅
/docs/source-landscape.md- source inventory and recommendation document - ✅
/src/citegeist/sources/catalog.py- code-backed source catalog
Completed:
- Identified which source integrations already exist in the repository
- Split source-expansion planning from database/vector-search ambitions
- Prioritized open-source additions by workflow value
Phase 1: Source Layer Tightening ✅ COMPLETE
Status: Completed
Deliverables:
- ✅
/src/citegeist/sources/base.py- BaseBibliographicSourceinterface - ✅
/src/citegeist/sources/registry.py- Registry for known concrete sources - ✅
/src/citegeist/sources/crossref.py- Repaired CrossRef source implementation - ✅
/src/citegeist/sources/catalog.py- Open-source inventory - ✅
/src/citegeist/sources/__init__.py- Package initialization - ✅
/tests/test_sources_plugin.py- Source plugin tests - ✅
/tests/test_sources_catalog.py- Source catalog and registry tests
Completed:
- ✅ Created
BibliographicSourceabstract base class - ✅ Repaired
SourceRegistryso config-backed loading resolves real source classes - ✅ Fixed
CrossRefSourcenormalization for direct lookup and search-style payloads - ✅ Replaced path-specific compatibility loading with repo-relative loading
- ✅ Added a source catalog that captures current status and next-priority sources
Features:
- Abstract interface for source plugins
- Registry for known source discovery and instantiation
- Config-driven enable/disable for known source types
- Source prioritization metadata
- Compatibility with the existing
SourceClient-based resolver/expander code
Current Integrated Sources ✅ AVAILABLE
CrossrefOpenAlexOpenCitationsUnpaywallPubMedEurope PMCSemantic ScholarDataCiteDBLParXivOAI-PMH
These are already sufficient for a credible local enrichment-and-discovery workflow. The next work should complement them rather than restart infrastructure underneath them.
Phase 2: Next Source Additions 🚧 IN PROGRESS
Status: In Progress
Priority Order:
OpenAIREonly if repository-acquisition scope expands
Completed Deliverables:
- ✅ OpenCitations adapter for DOI citation/reference lookup
- ✅ OpenCitations graph expansion support in CLI and topic expansion flows
- ✅ Unpaywall adapter for DOI OA-link enrichment
- ✅
enrich-oaCLI flow for applying OA metadata to stored entries - ✅ Europe PMC biomedical resolver/search integration
- ✅ Semantic Scholar broad-science resolver/search integration
Planned Deliverables:
- ⏳ Decide whether repository-acquisition breadth needs another dedicated source
Rationale:
OpenCitationsnow improves open citation-edge coverageUnpaywallnow improves access-link enrichmentEurope PMCnow improves biomedical metadata and OA/fulltext coverageSemantic Scholarnow improves broader biological and physical sciences coverage- neither requires a new database architecture to become useful
Phase 3: Optional Source Evaluation ⏳ PLANNED
Status: Planned
OpenAIRE
Decision Rule:
- add them only if they solve a concrete discovery or acquisition gap that current open sources do not already cover well
Explicitly Deferred
- second-schema redesign work
- pgvector integration
- embedding-first retrieval
- broad canonical-work reconstruction
Summary
Completed: scope reframe and source-layer cleanup
Planned next: OpenAIRE reevaluation
Deferred: database/vector expansion work not required by the source question