5.4 KiB
5.4 KiB
CiteGeist Multi-Source File Structure
Date: 2026-04-25
Project Structure
/home/netuser/dev/CiteGeist/
├── db/
│ └── migrations/
│ └── 0001_multisource.sql ✅ NEW - Multi-source schema
│
├── docs/
│ ├── architecture-current.md ✅ NEW - Current architecture docs
│ ├── implementation-progress.md ✅ NEW - Implementation progress tracker
│ ├── schema-current.sql ✅ NEW - Current schema SQL
│ └── file-structure.md ✅ NEW - This file
│
├── src/citegeist/
│ ├── sources/ ✅ NEW - Source plugin architecture
│ │ ├── __init__.py ✅ NEW - Package exports
│ │ ├── __all__.py ✅ NEW - Public API
│ │ ├── base.py ✅ NEW - Base BibliographicSource class
│ │ ├── registry.py ✅ NEW - SourceRegistry implementation
│ │ ├── crossref.py ✅ NEW - CrossRef source plugin
│ │ └── _old_sources_compat.py ✅ NEW - Backward compatibility
│ │
│ ├── resolver/ ✅ NEW - Identifier resolution
│ │ ├── __init__.py ✅ NEW - Module exports
│ │ └── identifiers.py ✅ NEW - Extract, normalize, resolve
│ │
│ ├── db/ ✅ NEW - Database operations
│ │ └── __init__.py 🚧 TO DO - Database client
│ │
│ ├── ... (existing files)
│ ├── sources.py 📦 Existing - Old SourceClient
│ ├── resolve.py 📦 Existing - MetadataResolver
│ └── storage.py 📦 Existing - BibliographyStore
│
└── tests/
├── test_sources_plugin.py ✅ NEW - Source plugin tests
└── test_resolver_identifiers.py ✅ NEW - Identifier tests
Module Documentation
New Modules
src/citegeist/sources/
Plugin architecture for bibliographic sources.
Classes:
BibliographicSource- Abstract base class for source pluginsSourceRecord- Raw source record dataclassCitationEdge- Citation relationship dataclassSourceRegistry- Manages source plugins
Plugin:
CrossRefSource- CrossRef API implementation
src/citegeist/resolver/
Identifier extraction, normalization, and resolution.
Classes:
IdentifierExtractor- Extract identifiers from entry fieldsIdentifierNormalizer- Normalize identifiers to canonical formIdentifierResolver- Resolve identifiers with lookup priority
Functions:
extract_identifiers()- Quick identifier extractionnormalize_identifier()- Quick normalizationget_primary_identifier()- Get primary identifierresolve_identifiers()- Resolve all identifiers
src/citegeist/db/
Database operations (to be implemented).
Planned:
- Database client for works table
- Migration runner
- Query builders
db/migrations/0001_multisource.sql
Multi-source database schema migration.
Tables:
works- Canonical work metadatawork_identifiers- Multi-scheme identifierssource_records- Raw API responsescitations- Citation graphwork_embeddings- Vector embeddings
Existing Modules (Preserved)
src/citegeist/sources.py- Old SourceClient (backward compatible)src/citegeist/resolve.py- Old MetadataResolversrc/citegeist/storage.py- Old BibliographyStore
Test Coverage
New Tests:
tests/test_sources_plugin.py(7 tests)tests/test_resolver_identifiers.py(17 tests)
Total: 24 tests passing
Dependencies
New Dependencies Required:
- No new Python packages (uses stdlib only)
Planned Dependencies (Future phases):
pgvector- PostgreSQL vector extensionsentence-transformers- Local embedding modelfastapi- API frameworkunpaywall- OA link retrieval (if needed)
Implementation Status
Completed (100%)
- ✅ Phase 0: Baseline Audit
- ✅ Phase 1: Source Plugin Architecture
- ✅ Phase 2: Identifier Resolution Layer
In Progress (50%)
- 🚧 Phase 3: Database Schema Upgrade
Pending (0%)
- ⏳ Phase 4: High-Value Source Integrations
- ⏳ Phase 5: Merge & Deduplication Engine
- ⏳ Phase 6: Citation Graph Construction
- ⏳ Phase 7: Embedding Pipeline
- ⏳ Phase 8: Full-Text Retrieval Layer
- ⏳ Phase 9: API Layer
- ⏳ Phase 10: Ranking & Relevance
- ⏳ Phase 12: Observability & QA
- ⏳ Phase 13: Performance Optimization
Quick Start
# Register a source
from citegeist.sources import SourceRegistry, CrossRefSource
registry = SourceRegistry()
registry.register(CrossRefSource, name='crossref', config={})
# Get source instance
source = registry.get('crossref')
entry = source.lookup_by_doi('10.1234/example')
# Resolve identifiers
from citegeist.resolver import resolve_identifiers
fields = {'doi': '10.1234/example', 'title': 'Test'}
resolved = resolve_identifiers(fields)
# Returns [('doi', '10.1234/example'), ('title', 'test title')]
Next Steps
- ✅ Phase 0-2: Complete
- 🚧 Phase 3: Implement Python interface for database operations
- ⏳ Phase 4: Add Unpaywall, Semantic Scholar, OpenCitations integrations
- ⏳ Phase 5: Build merge engine