CiteGeist/docs/implementation-progress.md

123 lines
3.9 KiB
Markdown

# CiteGeist Sources-First Progress
**Last Updated:** 2026-04-25
This document tracks the refocused plan for source incorporation. The working question is which additional open bibliographic sources CiteGeist should integrate next, not whether it needs a new storage platform first.
---
## Phase 0: Scope Reframe ✅ COMPLETE
**Status:** Completed
**Deliverables:**
-`/docs/source-landscape.md` - source inventory and recommendation document
-`/src/citegeist/sources/catalog.py` - code-backed source catalog
**Completed:**
- Identified which source integrations already exist in the repository
- Split source-expansion planning from database/vector-search ambitions
- Prioritized open-source additions by workflow value
---
## Phase 1: Source Layer Tightening ✅ COMPLETE
**Status:** Completed
**Deliverables:**
-`/src/citegeist/sources/base.py` - Base `BibliographicSource` interface
-`/src/citegeist/sources/registry.py` - Registry for known concrete sources
-`/src/citegeist/sources/crossref.py` - Repaired CrossRef source implementation
-`/src/citegeist/sources/catalog.py` - Open-source inventory
-`/src/citegeist/sources/__init__.py` - Package initialization
-`/tests/test_sources_plugin.py` - Source plugin tests
-`/tests/test_sources_catalog.py` - Source catalog and registry tests
**Completed:**
- ✅ Created `BibliographicSource` abstract base class
- ✅ Repaired `SourceRegistry` so config-backed loading resolves real source classes
- ✅ Fixed `CrossRefSource` normalization for direct lookup and search-style payloads
- ✅ Replaced path-specific compatibility loading with repo-relative loading
- ✅ Added a source catalog that captures current status and next-priority sources
**Features:**
- Abstract interface for source plugins
- Registry for known source discovery and instantiation
- Config-driven enable/disable for known source types
- Source prioritization metadata
- Compatibility with the existing `SourceClient`-based resolver/expander code
---
## Current Integrated Sources ✅ AVAILABLE
- `Crossref`
- `OpenAlex`
- `OpenCitations`
- `Unpaywall`
- `PubMed`
- `Europe PMC`
- `Semantic Scholar`
- `DataCite`
- `DBLP`
- `arXiv`
- `OAI-PMH`
These are already sufficient for a credible local enrichment-and-discovery workflow. The next work should complement them rather than restart infrastructure underneath them.
---
## Phase 2: Next Source Additions 🚧 IN PROGRESS
**Status:** In Progress
**Priority Order:**
1. `OpenAIRE` only if repository-acquisition scope expands
**Completed Deliverables:**
- ✅ OpenCitations adapter for DOI citation/reference lookup
- ✅ OpenCitations graph expansion support in CLI and topic expansion flows
- ✅ Unpaywall adapter for DOI OA-link enrichment
-`enrich-oa` CLI flow for applying OA metadata to stored entries
- ✅ Europe PMC biomedical resolver/search integration
- ✅ Semantic Scholar broad-science resolver/search integration
**Planned Deliverables:**
- ⏳ Decide whether repository-acquisition breadth needs another dedicated source
**Rationale:**
- `OpenCitations` now improves open citation-edge coverage
- `Unpaywall` now improves access-link enrichment
- `Europe PMC` now improves biomedical metadata and OA/fulltext coverage
- `Semantic Scholar` now improves broader biological and physical sciences coverage
- neither requires a new database architecture to become useful
---
## Phase 3: Optional Source Evaluation ⏳ PLANNED
**Status:** Planned
- `OpenAIRE`
**Decision Rule:**
- add them only if they solve a concrete discovery or acquisition gap that current open sources do not already cover well
---
## Explicitly Deferred
- second-schema redesign work
- pgvector integration
- embedding-first retrieval
- broad canonical-work reconstruction
---
## Summary
**Completed:** scope reframe and source-layer cleanup
**Planned next:** `OpenAIRE` reevaluation
**Deferred:** database/vector expansion work not required by the source question