166 lines
5.4 KiB
Markdown
166 lines
5.4 KiB
Markdown
# CiteGeist Multi-Source File Structure
|
|
|
|
**Date:** 2026-04-25
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
/home/netuser/dev/CiteGeist/
|
|
├── db/
|
|
│ └── migrations/
|
|
│ └── 0001_multisource.sql ✅ NEW - Multi-source schema
|
|
│
|
|
├── docs/
|
|
│ ├── architecture-current.md ✅ NEW - Current architecture docs
|
|
│ ├── implementation-progress.md ✅ NEW - Implementation progress tracker
|
|
│ ├── schema-current.sql ✅ NEW - Current schema SQL
|
|
│ └── file-structure.md ✅ NEW - This file
|
|
│
|
|
├── src/citegeist/
|
|
│ ├── sources/ ✅ NEW - Source plugin architecture
|
|
│ │ ├── __init__.py ✅ NEW - Package exports
|
|
│ │ ├── __all__.py ✅ NEW - Public API
|
|
│ │ ├── base.py ✅ NEW - Base BibliographicSource class
|
|
│ │ ├── registry.py ✅ NEW - SourceRegistry implementation
|
|
│ │ ├── crossref.py ✅ NEW - CrossRef source plugin
|
|
│ │ └── _old_sources_compat.py ✅ NEW - Backward compatibility
|
|
│ │
|
|
│ ├── resolver/ ✅ NEW - Identifier resolution
|
|
│ │ ├── __init__.py ✅ NEW - Module exports
|
|
│ │ └── identifiers.py ✅ NEW - Extract, normalize, resolve
|
|
│ │
|
|
│ ├── db/ ✅ NEW - Database operations
|
|
│ │ └── __init__.py 🚧 TO DO - Database client
|
|
│ │
|
|
│ ├── ... (existing files)
|
|
│ ├── sources.py 📦 Existing - Old SourceClient
|
|
│ ├── resolve.py 📦 Existing - MetadataResolver
|
|
│ └── storage.py 📦 Existing - BibliographyStore
|
|
│
|
|
└── tests/
|
|
├── test_sources_plugin.py ✅ NEW - Source plugin tests
|
|
└── test_resolver_identifiers.py ✅ NEW - Identifier tests
|
|
```
|
|
|
|
## Module Documentation
|
|
|
|
### New Modules
|
|
|
|
#### `src/citegeist/sources/`
|
|
Plugin architecture for bibliographic sources.
|
|
|
|
**Classes:**
|
|
- `BibliographicSource` - Abstract base class for source plugins
|
|
- `SourceRecord` - Raw source record dataclass
|
|
- `CitationEdge` - Citation relationship dataclass
|
|
- `SourceRegistry` - Manages source plugins
|
|
|
|
**Plugin:**
|
|
- `CrossRefSource` - CrossRef API implementation
|
|
|
|
#### `src/citegeist/resolver/`
|
|
Identifier extraction, normalization, and resolution.
|
|
|
|
**Classes:**
|
|
- `IdentifierExtractor` - Extract identifiers from entry fields
|
|
- `IdentifierNormalizer` - Normalize identifiers to canonical form
|
|
- `IdentifierResolver` - Resolve identifiers with lookup priority
|
|
|
|
**Functions:**
|
|
- `extract_identifiers()` - Quick identifier extraction
|
|
- `normalize_identifier()` - Quick normalization
|
|
- `get_primary_identifier()` - Get primary identifier
|
|
- `resolve_identifiers()` - Resolve all identifiers
|
|
|
|
#### `src/citegeist/db/`
|
|
Database operations (to be implemented).
|
|
|
|
**Planned:**
|
|
- Database client for works table
|
|
- Migration runner
|
|
- Query builders
|
|
|
|
#### `db/migrations/0001_multisource.sql`
|
|
Multi-source database schema migration.
|
|
|
|
**Tables:**
|
|
1. `works` - Canonical work metadata
|
|
2. `work_identifiers` - Multi-scheme identifiers
|
|
3. `source_records` - Raw API responses
|
|
4. `citations` - Citation graph
|
|
5. `work_embeddings` - Vector embeddings
|
|
|
|
### Existing Modules (Preserved)
|
|
|
|
- `src/citegeist/sources.py` - Old SourceClient (backward compatible)
|
|
- `src/citegeist/resolve.py` - Old MetadataResolver
|
|
- `src/citegeist/storage.py` - Old BibliographyStore
|
|
|
|
## Test Coverage
|
|
|
|
**New Tests:**
|
|
- `tests/test_sources_plugin.py` (7 tests)
|
|
- `tests/test_resolver_identifiers.py` (17 tests)
|
|
|
|
**Total:** 24 tests passing
|
|
|
|
## Dependencies
|
|
|
|
**New Dependencies Required:**
|
|
- No new Python packages (uses stdlib only)
|
|
|
|
**Planned Dependencies (Future phases):**
|
|
- `pgvector` - PostgreSQL vector extension
|
|
- `sentence-transformers` - Local embedding model
|
|
- `fastapi` - API framework
|
|
- `unpaywall` - OA link retrieval (if needed)
|
|
|
|
## Implementation Status
|
|
|
|
### Completed (100%)
|
|
- ✅ Phase 0: Baseline Audit
|
|
- ✅ Phase 1: Source Plugin Architecture
|
|
- ✅ Phase 2: Identifier Resolution Layer
|
|
|
|
### In Progress (50%)
|
|
- 🚧 Phase 3: Database Schema Upgrade
|
|
|
|
### Pending (0%)
|
|
- ⏳ Phase 4: High-Value Source Integrations
|
|
- ⏳ Phase 5: Merge & Deduplication Engine
|
|
- ⏳ Phase 6: Citation Graph Construction
|
|
- ⏳ Phase 7: Embedding Pipeline
|
|
- ⏳ Phase 8: Full-Text Retrieval Layer
|
|
- ⏳ Phase 9: API Layer
|
|
- ⏳ Phase 10: Ranking & Relevance
|
|
- ⏳ Phase 12: Observability & QA
|
|
- ⏳ Phase 13: Performance Optimization
|
|
|
|
## Quick Start
|
|
|
|
```python
|
|
# Register a source
|
|
from citegeist.sources import SourceRegistry, CrossRefSource
|
|
|
|
registry = SourceRegistry()
|
|
registry.register(CrossRefSource, name='crossref', config={})
|
|
|
|
# Get source instance
|
|
source = registry.get('crossref')
|
|
entry = source.lookup_by_doi('10.1234/example')
|
|
|
|
# Resolve identifiers
|
|
from citegeist.resolver import resolve_identifiers
|
|
|
|
fields = {'doi': '10.1234/example', 'title': 'Test'}
|
|
resolved = resolve_identifiers(fields)
|
|
# Returns [('doi', '10.1234/example'), ('title', 'test title')]
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ Phase 0-2: Complete
|
|
2. 🚧 Phase 3: Implement Python interface for database operations
|
|
3. ⏳ Phase 4: Add Unpaywall, Semantic Scholar, OpenCitations integrations
|
|
4. ⏳ Phase 5: Build merge engine
|