123 lines
3.9 KiB
Markdown
123 lines
3.9 KiB
Markdown
# CiteGeist Sources-First Progress
|
|
|
|
**Last Updated:** 2026-04-25
|
|
|
|
This document tracks the refocused plan for source incorporation. The working question is which additional open bibliographic sources CiteGeist should integrate next, not whether it needs a new storage platform first.
|
|
|
|
---
|
|
|
|
## Phase 0: Scope Reframe ✅ COMPLETE
|
|
|
|
**Status:** Completed
|
|
|
|
**Deliverables:**
|
|
- ✅ `/docs/source-landscape.md` - source inventory and recommendation document
|
|
- ✅ `/src/citegeist/sources/catalog.py` - code-backed source catalog
|
|
|
|
**Completed:**
|
|
- Identified which source integrations already exist in the repository
|
|
- Split source-expansion planning from database/vector-search ambitions
|
|
- Prioritized open-source additions by workflow value
|
|
|
|
---
|
|
|
|
## Phase 1: Source Layer Tightening ✅ COMPLETE
|
|
|
|
**Status:** Completed
|
|
|
|
**Deliverables:**
|
|
- ✅ `/src/citegeist/sources/base.py` - Base `BibliographicSource` interface
|
|
- ✅ `/src/citegeist/sources/registry.py` - Registry for known concrete sources
|
|
- ✅ `/src/citegeist/sources/crossref.py` - Repaired CrossRef source implementation
|
|
- ✅ `/src/citegeist/sources/catalog.py` - Open-source inventory
|
|
- ✅ `/src/citegeist/sources/__init__.py` - Package initialization
|
|
- ✅ `/tests/test_sources_plugin.py` - Source plugin tests
|
|
- ✅ `/tests/test_sources_catalog.py` - Source catalog and registry tests
|
|
|
|
**Completed:**
|
|
- ✅ Created `BibliographicSource` abstract base class
|
|
- ✅ Repaired `SourceRegistry` so config-backed loading resolves real source classes
|
|
- ✅ Fixed `CrossRefSource` normalization for direct lookup and search-style payloads
|
|
- ✅ Replaced path-specific compatibility loading with repo-relative loading
|
|
- ✅ Added a source catalog that captures current status and next-priority sources
|
|
|
|
**Features:**
|
|
- Abstract interface for source plugins
|
|
- Registry for known source discovery and instantiation
|
|
- Config-driven enable/disable for known source types
|
|
- Source prioritization metadata
|
|
- Compatibility with the existing `SourceClient`-based resolver/expander code
|
|
|
|
---
|
|
|
|
## Current Integrated Sources ✅ AVAILABLE
|
|
|
|
- `Crossref`
|
|
- `OpenAlex`
|
|
- `OpenCitations`
|
|
- `Unpaywall`
|
|
- `PubMed`
|
|
- `Europe PMC`
|
|
- `Semantic Scholar`
|
|
- `DataCite`
|
|
- `DBLP`
|
|
- `arXiv`
|
|
- `OAI-PMH`
|
|
|
|
These are already sufficient for a credible local enrichment-and-discovery workflow. The next work should complement them rather than restart infrastructure underneath them.
|
|
|
|
---
|
|
|
|
## Phase 2: Next Source Additions 🚧 IN PROGRESS
|
|
|
|
**Status:** In Progress
|
|
|
|
**Priority Order:**
|
|
1. `OpenAIRE` only if repository-acquisition scope expands
|
|
|
|
**Completed Deliverables:**
|
|
- ✅ OpenCitations adapter for DOI citation/reference lookup
|
|
- ✅ OpenCitations graph expansion support in CLI and topic expansion flows
|
|
- ✅ Unpaywall adapter for DOI OA-link enrichment
|
|
- ✅ `enrich-oa` CLI flow for applying OA metadata to stored entries
|
|
- ✅ Europe PMC biomedical resolver/search integration
|
|
- ✅ Semantic Scholar broad-science resolver/search integration
|
|
|
|
**Planned Deliverables:**
|
|
- ⏳ Decide whether repository-acquisition breadth needs another dedicated source
|
|
|
|
**Rationale:**
|
|
- `OpenCitations` now improves open citation-edge coverage
|
|
- `Unpaywall` now improves access-link enrichment
|
|
- `Europe PMC` now improves biomedical metadata and OA/fulltext coverage
|
|
- `Semantic Scholar` now improves broader biological and physical sciences coverage
|
|
- neither requires a new database architecture to become useful
|
|
|
|
---
|
|
|
|
## Phase 3: Optional Source Evaluation ⏳ PLANNED
|
|
|
|
**Status:** Planned
|
|
|
|
- `OpenAIRE`
|
|
|
|
**Decision Rule:**
|
|
- add them only if they solve a concrete discovery or acquisition gap that current open sources do not already cover well
|
|
|
|
---
|
|
|
|
## Explicitly Deferred
|
|
|
|
- second-schema redesign work
|
|
- pgvector integration
|
|
- embedding-first retrieval
|
|
- broad canonical-work reconstruction
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
**Completed:** scope reframe and source-layer cleanup
|
|
**Planned next:** `OpenAIRE` reevaluation
|
|
**Deferred:** database/vector expansion work not required by the source question
|