8.1 KiB
Roadmap
This roadmap prioritizes a usable local research workflow over breadth of integrations.
The first objective is not to support every metadata source. The first objective is to make one end-to-end path work reliably:
- ingest draft references,
- normalize and store them,
- enrich them,
- traverse citation links,
- export reviewed BibTeX.
Prioritization Principles
- prioritize steps that make the system usable by a single researcher on a local machine;
- prioritize deterministic infrastructure before network integrations;
- keep every stage inspectable and auditable;
- treat verification and provenance as core features, not cleanup work;
- defer heavy semantic infrastructure until the local corpus model is stable.
Current Baseline
Completed:
- lightweight BibTeX parsing;
- SQLite storage for entries, creators, identifiers, and relations;
- local text search using SQLite FTS5 when available;
- standalone verification/disambiguation output for free-text references and partial BibTeX with auditable match metadata;
- tests for ingest, relation storage, and search.
Comparison Notes From Related Repos
The adjacent TOA-Bib-Updater and VeriBib repositories are useful prior art, but they contribute different things:
VeriBibcontributes a good pre-ingest verification pattern: inspect ambiguous strings or partial BibTeX, rank candidates from legal metadata sources, and emit explicit audit fields instead of silently trusting a single match.TOA-Bib-Updatercontributes process discipline more than core data modeling: resumable long-running jobs, preserved source artifacts, and generated review outputs for manual inspection.
citegeist should absorb those ideas where they improve the main local research workflow:
- keep verification and auditability in the core package, not just entry resolution after ingest;
- keep resumable manifests and review exports for large acquisition workflows, especially example pipelines and batch imports;
- avoid coupling the core model to brittle source-specific scraping logic.
Phase 1: Core Ingestion And Export
Priority: P0
Goal:
Make citegeist useful as a local BibTeX workbench even before online enrichment is added.
Tasks:
- add BibTeX export from the normalized database back into stable, readable BibTeX;
- add a small CLI for
ingest,show,search, andexport; - store field provenance metadata alongside imported and edited fields;
- add schema support for entry status such as
draft,enriched,reviewed, andexported; - add fixture-driven tests for round-tripping BibTeX through ingest and export.
Why this comes first:
- without export, the project is not yet useful in a LaTeX workflow;
- without a CLI, the package is a library demo rather than a tool;
- without provenance and state, later enrichment work becomes hard to audit.
Exit criteria:
- a user can ingest a
.bibfile, inspect entries, search locally, and export a reviewed.bib; - round-trip tests show no unexpected field loss for supported entry types.
Phase 2: Reference Extraction
Priority: P0
Goal: Turn raw reference text into draft entries that can enter the main pipeline.
Tasks:
- add parsers for bibliography-section lines and plain-text reference lists;
- define a draft-entry schema for incomplete references with confidence markers;
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
- add normalization for author names, years, title casing, and page ranges;
- prefer sentence-boundary venue detection over naive keyword splits so title text containing words like
reportis not truncated; - repair partially extracted venue stubs such as
Occas.orProc.by reparsing the full raw reference line when the structured fields are obviously incomplete; - preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match;
- build gold-test fixtures from real, messy reference examples.
Why this is next:
- this addresses the project’s first unique bottleneck: getting rough references into structured form;
- enrichment is much more effective once draft references are normalized.
Exit criteria:
- a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked;
- tests cover common article, book, chapter, proceedings, report, and abbreviation-heavy legacy references.
Phase 3: Metadata Enrichment
Priority: P1
Goal: Resolve draft or partial entries against external scholarly sources and merge improved metadata safely.
Tasks:
- define a resolver interface with deterministic merge rules;
- implement first-party resolvers for DOI/Crossref, DBLP, and arXiv;
- add identifier-first resolution, then title/author/year fallback search;
- store merge provenance per field and resolution attempt logs;
- flag conflicts rather than silently overwriting disputed values.
Why this is P1 rather than the first phase:
- enrichment quality depends on the ingestion and provenance model being correct first;
- it is easier to test deterministic merge behavior once local workflows already exist.
Exit criteria:
- an incomplete entry can be enriched from at least one authoritative source;
- conflicting fields remain visible for review instead of being lost.
Phase 4: Citation Graph Expansion
Priority: P1
Goal: Use citation edges as a discovery engine rather than just metadata storage.
Tasks:
- support explicit
citesandcited_byedge ingestion with source provenance; - add graph expansion commands starting from one or more seed entries;
- track edge discovery source, timestamp, and confidence;
- add filters for depth, source type, year range, and reviewed status;
- expose unresolved nodes so the user can decide what to enrich next.
Why this matters:
- this is central to literature discovery rather than mere bibliography cleanup;
- it turns the database into a research navigation tool.
Exit criteria:
- starting from one or more seed entries, a user can expand outward through citation edges and persist newly discovered nodes;
- graph traversal results can be exported as BibTeX candidates for review.
Phase 5: Search And Ranking
Priority: P2
Goal: Improve discovery quality inside the local corpus.
Tasks:
- refine FTS ranking across title, abstract, keywords, and fulltext;
- add saved search queries and result filters;
- add optional embedding-backed semantic search behind a pluggable interface;
- support hybrid ranking that combines lexical matching, identifiers, and citation proximity;
- add benchmarking fixtures for retrieval quality on a few research topics.
Why this is later:
- FTS is already enough to support early workflows;
- embedding infrastructure is expensive and should wait until the corpus schema stabilizes.
Exit criteria:
- local search is useful on realistic corpora without requiring external services;
- semantic indexing is optional and does not displace the simpler local search path.
Phase 6: Corpus Acquisition Pipelines
Priority: P2
Goal: Broaden source acquisition without mixing that complexity into the core model.
Tasks:
- add source adapters for open-access theses and dissertation repositories;
- add support for harvesting publisher citation pages and preprint metadata pages;
- define per-source import provenance and rate-limit behavior;
- separate source-specific scraping logic from normalized entry storage;
- add regression fixtures for representative public sources.
Why this is later:
- acquisition breadth is useful, but only after the core ingest/enrich/review loop is solid;
- source adapters are brittle and should sit on top of a stable model.
Exit criteria:
- new public corpora can be imported through adapters without changing the storage core;
- imported entries retain their source provenance and can be reviewed like any other entry.
Suggested Next Three Tasks
- Add a CLI module with
ingest,search,show, andexport. - Implement BibTeX export from the normalized store.
- Add provenance tables and entry review status fields.
These three tasks complete the first usable local workflow and should be treated as the immediate sprint.