CiteGeist/docs/file-structure.md

166 lines
5.4 KiB
Markdown

# CiteGeist Multi-Source File Structure
**Date:** 2026-04-25
## Project Structure
```
/home/netuser/dev/CiteGeist/
├── db/
│ └── migrations/
│ └── 0001_multisource.sql ✅ NEW - Multi-source schema
├── docs/
│ ├── architecture-current.md ✅ NEW - Current architecture docs
│ ├── implementation-progress.md ✅ NEW - Implementation progress tracker
│ ├── schema-current.sql ✅ NEW - Current schema SQL
│ └── file-structure.md ✅ NEW - This file
├── src/citegeist/
│ ├── sources/ ✅ NEW - Source plugin architecture
│ │ ├── __init__.py ✅ NEW - Package exports
│ │ ├── __all__.py ✅ NEW - Public API
│ │ ├── base.py ✅ NEW - Base BibliographicSource class
│ │ ├── registry.py ✅ NEW - SourceRegistry implementation
│ │ ├── crossref.py ✅ NEW - CrossRef source plugin
│ │ └── _old_sources_compat.py ✅ NEW - Backward compatibility
│ │
│ ├── resolver/ ✅ NEW - Identifier resolution
│ │ ├── __init__.py ✅ NEW - Module exports
│ │ └── identifiers.py ✅ NEW - Extract, normalize, resolve
│ │
│ ├── db/ ✅ NEW - Database operations
│ │ └── __init__.py 🚧 TO DO - Database client
│ │
│ ├── ... (existing files)
│ ├── sources.py 📦 Existing - Old SourceClient
│ ├── resolve.py 📦 Existing - MetadataResolver
│ └── storage.py 📦 Existing - BibliographyStore
└── tests/
├── test_sources_plugin.py ✅ NEW - Source plugin tests
└── test_resolver_identifiers.py ✅ NEW - Identifier tests
```
## Module Documentation
### New Modules
#### `src/citegeist/sources/`
Plugin architecture for bibliographic sources.
**Classes:**
- `BibliographicSource` - Abstract base class for source plugins
- `SourceRecord` - Raw source record dataclass
- `CitationEdge` - Citation relationship dataclass
- `SourceRegistry` - Manages source plugins
**Plugin:**
- `CrossRefSource` - CrossRef API implementation
#### `src/citegeist/resolver/`
Identifier extraction, normalization, and resolution.
**Classes:**
- `IdentifierExtractor` - Extract identifiers from entry fields
- `IdentifierNormalizer` - Normalize identifiers to canonical form
- `IdentifierResolver` - Resolve identifiers with lookup priority
**Functions:**
- `extract_identifiers()` - Quick identifier extraction
- `normalize_identifier()` - Quick normalization
- `get_primary_identifier()` - Get primary identifier
- `resolve_identifiers()` - Resolve all identifiers
#### `src/citegeist/db/`
Database operations (to be implemented).
**Planned:**
- Database client for works table
- Migration runner
- Query builders
#### `db/migrations/0001_multisource.sql`
Multi-source database schema migration.
**Tables:**
1. `works` - Canonical work metadata
2. `work_identifiers` - Multi-scheme identifiers
3. `source_records` - Raw API responses
4. `citations` - Citation graph
5. `work_embeddings` - Vector embeddings
### Existing Modules (Preserved)
- `src/citegeist/sources.py` - Old SourceClient (backward compatible)
- `src/citegeist/resolve.py` - Old MetadataResolver
- `src/citegeist/storage.py` - Old BibliographyStore
## Test Coverage
**New Tests:**
- `tests/test_sources_plugin.py` (7 tests)
- `tests/test_resolver_identifiers.py` (17 tests)
**Total:** 24 tests passing
## Dependencies
**New Dependencies Required:**
- No new Python packages (uses stdlib only)
**Planned Dependencies (Future phases):**
- `pgvector` - PostgreSQL vector extension
- `sentence-transformers` - Local embedding model
- `fastapi` - API framework
- `unpaywall` - OA link retrieval (if needed)
## Implementation Status
### Completed (100%)
- ✅ Phase 0: Baseline Audit
- ✅ Phase 1: Source Plugin Architecture
- ✅ Phase 2: Identifier Resolution Layer
### In Progress (50%)
- 🚧 Phase 3: Database Schema Upgrade
### Pending (0%)
- ⏳ Phase 4: High-Value Source Integrations
- ⏳ Phase 5: Merge & Deduplication Engine
- ⏳ Phase 6: Citation Graph Construction
- ⏳ Phase 7: Embedding Pipeline
- ⏳ Phase 8: Full-Text Retrieval Layer
- ⏳ Phase 9: API Layer
- ⏳ Phase 10: Ranking & Relevance
- ⏳ Phase 12: Observability & QA
- ⏳ Phase 13: Performance Optimization
## Quick Start
```python
# Register a source
from citegeist.sources import SourceRegistry, CrossRefSource
registry = SourceRegistry()
registry.register(CrossRefSource, name='crossref', config={})
# Get source instance
source = registry.get('crossref')
entry = source.lookup_by_doi('10.1234/example')
# Resolve identifiers
from citegeist.resolver import resolve_identifiers
fields = {'doi': '10.1234/example', 'title': 'Test'}
resolved = resolve_identifiers(fields)
# Returns [('doi', '10.1234/example'), ('title', 'test title')]
```
## Next Steps
1. ✅ Phase 0-2: Complete
2. 🚧 Phase 3: Implement Python interface for database operations
3. ⏳ Phase 4: Add Unpaywall, Semantic Scholar, OpenCitations integrations
4. ⏳ Phase 5: Build merge engine