# CiteGeist Multi-Source File Structure **Date:** 2026-04-25 ## Project Structure ``` /home/netuser/dev/CiteGeist/ ├── db/ │ └── migrations/ │ └── 0001_multisource.sql ✅ NEW - Multi-source schema │ ├── docs/ │ ├── architecture-current.md ✅ NEW - Current architecture docs │ ├── implementation-progress.md ✅ NEW - Implementation progress tracker │ ├── schema-current.sql ✅ NEW - Current schema SQL │ └── file-structure.md ✅ NEW - This file │ ├── src/citegeist/ │ ├── sources/ ✅ NEW - Source plugin architecture │ │ ├── __init__.py ✅ NEW - Package exports │ │ ├── __all__.py ✅ NEW - Public API │ │ ├── base.py ✅ NEW - Base BibliographicSource class │ │ ├── registry.py ✅ NEW - SourceRegistry implementation │ │ ├── crossref.py ✅ NEW - CrossRef source plugin │ │ └── _old_sources_compat.py ✅ NEW - Backward compatibility │ │ │ ├── resolver/ ✅ NEW - Identifier resolution │ │ ├── __init__.py ✅ NEW - Module exports │ │ └── identifiers.py ✅ NEW - Extract, normalize, resolve │ │ │ ├── db/ ✅ NEW - Database operations │ │ └── __init__.py 🚧 TO DO - Database client │ │ │ ├── ... (existing files) │ ├── sources.py 📦 Existing - Old SourceClient │ ├── resolve.py 📦 Existing - MetadataResolver │ └── storage.py 📦 Existing - BibliographyStore │ └── tests/ ├── test_sources_plugin.py ✅ NEW - Source plugin tests └── test_resolver_identifiers.py ✅ NEW - Identifier tests ``` ## Module Documentation ### New Modules #### `src/citegeist/sources/` Plugin architecture for bibliographic sources. **Classes:** - `BibliographicSource` - Abstract base class for source plugins - `SourceRecord` - Raw source record dataclass - `CitationEdge` - Citation relationship dataclass - `SourceRegistry` - Manages source plugins **Plugin:** - `CrossRefSource` - CrossRef API implementation #### `src/citegeist/resolver/` Identifier extraction, normalization, and resolution. **Classes:** - `IdentifierExtractor` - Extract identifiers from entry fields - `IdentifierNormalizer` - Normalize identifiers to canonical form - `IdentifierResolver` - Resolve identifiers with lookup priority **Functions:** - `extract_identifiers()` - Quick identifier extraction - `normalize_identifier()` - Quick normalization - `get_primary_identifier()` - Get primary identifier - `resolve_identifiers()` - Resolve all identifiers #### `src/citegeist/db/` Database operations (to be implemented). **Planned:** - Database client for works table - Migration runner - Query builders #### `db/migrations/0001_multisource.sql` Multi-source database schema migration. **Tables:** 1. `works` - Canonical work metadata 2. `work_identifiers` - Multi-scheme identifiers 3. `source_records` - Raw API responses 4. `citations` - Citation graph 5. `work_embeddings` - Vector embeddings ### Existing Modules (Preserved) - `src/citegeist/sources.py` - Old SourceClient (backward compatible) - `src/citegeist/resolve.py` - Old MetadataResolver - `src/citegeist/storage.py` - Old BibliographyStore ## Test Coverage **New Tests:** - `tests/test_sources_plugin.py` (7 tests) - `tests/test_resolver_identifiers.py` (17 tests) **Total:** 24 tests passing ## Dependencies **New Dependencies Required:** - No new Python packages (uses stdlib only) **Planned Dependencies (Future phases):** - `pgvector` - PostgreSQL vector extension - `sentence-transformers` - Local embedding model - `fastapi` - API framework - `unpaywall` - OA link retrieval (if needed) ## Implementation Status ### Completed (100%) - ✅ Phase 0: Baseline Audit - ✅ Phase 1: Source Plugin Architecture - ✅ Phase 2: Identifier Resolution Layer ### In Progress (50%) - 🚧 Phase 3: Database Schema Upgrade ### Pending (0%) - ⏳ Phase 4: High-Value Source Integrations - ⏳ Phase 5: Merge & Deduplication Engine - ⏳ Phase 6: Citation Graph Construction - ⏳ Phase 7: Embedding Pipeline - ⏳ Phase 8: Full-Text Retrieval Layer - ⏳ Phase 9: API Layer - ⏳ Phase 10: Ranking & Relevance - ⏳ Phase 12: Observability & QA - ⏳ Phase 13: Performance Optimization ## Quick Start ```python # Register a source from citegeist.sources import SourceRegistry, CrossRefSource registry = SourceRegistry() registry.register(CrossRefSource, name='crossref', config={}) # Get source instance source = registry.get('crossref') entry = source.lookup_by_doi('10.1234/example') # Resolve identifiers from citegeist.resolver import resolve_identifiers fields = {'doi': '10.1234/example', 'title': 'Test'} resolved = resolve_identifiers(fields) # Returns [('doi', '10.1234/example'), ('title', 'test title')] ``` ## Next Steps 1. ✅ Phase 0-2: Complete 2. 🚧 Phase 3: Implement Python interface for database operations 3. ⏳ Phase 4: Add Unpaywall, Semantic Scholar, OpenCitations integrations 4. ⏳ Phase 5: Build merge engine