CiteGeist/docs/file-structure.md

5.4 KiB

CiteGeist Multi-Source File Structure

Date: 2026-04-25

Project Structure

/home/netuser/dev/CiteGeist/
├── db/
│   └── migrations/
│       └── 0001_multisource.sql           ✅ NEW - Multi-source schema
│
├── docs/
│   ├── architecture-current.md             ✅ NEW - Current architecture docs
│   ├── implementation-progress.md          ✅ NEW - Implementation progress tracker
│   ├── schema-current.sql                  ✅ NEW - Current schema SQL
│   └── file-structure.md                   ✅ NEW - This file
│
├── src/citegeist/
│   ├── sources/                            ✅ NEW - Source plugin architecture
│   │   ├── __init__.py                     ✅ NEW - Package exports
│   │   ├── __all__.py                      ✅ NEW - Public API
│   │   ├── base.py                         ✅ NEW - Base BibliographicSource class
│   │   ├── registry.py                     ✅ NEW - SourceRegistry implementation
│   │   ├── crossref.py                     ✅ NEW - CrossRef source plugin
│   │   └── _old_sources_compat.py          ✅ NEW - Backward compatibility
│   │
│   ├── resolver/                           ✅ NEW - Identifier resolution
│   │   ├── __init__.py                     ✅ NEW - Module exports
│   │   └── identifiers.py                  ✅ NEW - Extract, normalize, resolve
│   │
│   ├── db/                                 ✅ NEW - Database operations
│   │   └── __init__.py                     🚧 TO DO - Database client
│   │
│   ├── ... (existing files)
│   ├── sources.py                          📦 Existing - Old SourceClient
│   ├── resolve.py                          📦 Existing - MetadataResolver
│   └── storage.py                          📦 Existing - BibliographyStore
│
└── tests/
    ├── test_sources_plugin.py              ✅ NEW - Source plugin tests
    └── test_resolver_identifiers.py        ✅ NEW - Identifier tests

Module Documentation

New Modules

src/citegeist/sources/

Plugin architecture for bibliographic sources.

Classes:

  • BibliographicSource - Abstract base class for source plugins
  • SourceRecord - Raw source record dataclass
  • CitationEdge - Citation relationship dataclass
  • SourceRegistry - Manages source plugins

Plugin:

  • CrossRefSource - CrossRef API implementation

src/citegeist/resolver/

Identifier extraction, normalization, and resolution.

Classes:

  • IdentifierExtractor - Extract identifiers from entry fields
  • IdentifierNormalizer - Normalize identifiers to canonical form
  • IdentifierResolver - Resolve identifiers with lookup priority

Functions:

  • extract_identifiers() - Quick identifier extraction
  • normalize_identifier() - Quick normalization
  • get_primary_identifier() - Get primary identifier
  • resolve_identifiers() - Resolve all identifiers

src/citegeist/db/

Database operations (to be implemented).

Planned:

  • Database client for works table
  • Migration runner
  • Query builders

db/migrations/0001_multisource.sql

Multi-source database schema migration.

Tables:

  1. works - Canonical work metadata
  2. work_identifiers - Multi-scheme identifiers
  3. source_records - Raw API responses
  4. citations - Citation graph
  5. work_embeddings - Vector embeddings

Existing Modules (Preserved)

  • src/citegeist/sources.py - Old SourceClient (backward compatible)
  • src/citegeist/resolve.py - Old MetadataResolver
  • src/citegeist/storage.py - Old BibliographyStore

Test Coverage

New Tests:

  • tests/test_sources_plugin.py (7 tests)
  • tests/test_resolver_identifiers.py (17 tests)

Total: 24 tests passing

Dependencies

New Dependencies Required:

  • No new Python packages (uses stdlib only)

Planned Dependencies (Future phases):

  • pgvector - PostgreSQL vector extension
  • sentence-transformers - Local embedding model
  • fastapi - API framework
  • unpaywall - OA link retrieval (if needed)

Implementation Status

Completed (100%)

  • Phase 0: Baseline Audit
  • Phase 1: Source Plugin Architecture
  • Phase 2: Identifier Resolution Layer

In Progress (50%)

  • 🚧 Phase 3: Database Schema Upgrade

Pending (0%)

  • Phase 4: High-Value Source Integrations
  • Phase 5: Merge & Deduplication Engine
  • Phase 6: Citation Graph Construction
  • Phase 7: Embedding Pipeline
  • Phase 8: Full-Text Retrieval Layer
  • Phase 9: API Layer
  • Phase 10: Ranking & Relevance
  • Phase 12: Observability & QA
  • Phase 13: Performance Optimization

Quick Start

# Register a source
from citegeist.sources import SourceRegistry, CrossRefSource

registry = SourceRegistry()
registry.register(CrossRefSource, name='crossref', config={})

# Get source instance
source = registry.get('crossref')
entry = source.lookup_by_doi('10.1234/example')

# Resolve identifiers
from citegeist.resolver import resolve_identifiers

fields = {'doi': '10.1234/example', 'title': 'Test'}
resolved = resolve_identifiers(fields)
# Returns [('doi', '10.1234/example'), ('title', 'test title')]

Next Steps

  1. Phase 0-2: Complete
  2. 🚧 Phase 3: Implement Python interface for database operations
  3. Phase 4: Add Unpaywall, Semantic Scholar, OpenCitations integrations
  4. Phase 5: Build merge engine