204 lines
8.1 KiB
Markdown
204 lines
8.1 KiB
Markdown
# Roadmap
|
||
|
||
This roadmap prioritizes a usable local research workflow over breadth of integrations.
|
||
|
||
The first objective is not to support every metadata source. The first objective is to make one end-to-end path work reliably:
|
||
|
||
1. ingest draft references,
|
||
2. normalize and store them,
|
||
3. enrich them,
|
||
4. traverse citation links,
|
||
5. export reviewed BibTeX.
|
||
|
||
## Prioritization Principles
|
||
|
||
- prioritize steps that make the system usable by a single researcher on a local machine;
|
||
- prioritize deterministic infrastructure before network integrations;
|
||
- keep every stage inspectable and auditable;
|
||
- treat verification and provenance as core features, not cleanup work;
|
||
- defer heavy semantic infrastructure until the local corpus model is stable.
|
||
|
||
## Current Baseline
|
||
|
||
Completed:
|
||
|
||
- lightweight BibTeX parsing;
|
||
- SQLite storage for entries, creators, identifiers, and relations;
|
||
- local text search using SQLite FTS5 when available;
|
||
- standalone verification/disambiguation output for free-text references and partial BibTeX with auditable match metadata;
|
||
- tests for ingest, relation storage, and search.
|
||
|
||
## Comparison Notes From Related Repos
|
||
|
||
The adjacent `TOA-Bib-Updater` and `VeriBib` repositories are useful prior art, but they contribute different things:
|
||
|
||
- `VeriBib` contributes a good pre-ingest verification pattern: inspect ambiguous strings or partial BibTeX, rank candidates from legal metadata sources, and emit explicit audit fields instead of silently trusting a single match.
|
||
- `TOA-Bib-Updater` contributes process discipline more than core data modeling: resumable long-running jobs, preserved source artifacts, and generated review outputs for manual inspection.
|
||
|
||
`citegeist` should absorb those ideas where they improve the main local research workflow:
|
||
|
||
1. keep verification and auditability in the core package, not just entry resolution after ingest;
|
||
2. keep resumable manifests and review exports for large acquisition workflows, especially example pipelines and batch imports;
|
||
3. avoid coupling the core model to brittle source-specific scraping logic.
|
||
|
||
## Phase 1: Core Ingestion And Export
|
||
|
||
Priority: P0
|
||
|
||
Goal:
|
||
Make `citegeist` useful as a local BibTeX workbench even before online enrichment is added.
|
||
|
||
Tasks:
|
||
|
||
- add BibTeX export from the normalized database back into stable, readable BibTeX;
|
||
- add a small CLI for `ingest`, `show`, `search`, and `export`;
|
||
- store field provenance metadata alongside imported and edited fields;
|
||
- add schema support for entry status such as `draft`, `enriched`, `reviewed`, and `exported`;
|
||
- add fixture-driven tests for round-tripping BibTeX through ingest and export.
|
||
|
||
Why this comes first:
|
||
|
||
- without export, the project is not yet useful in a LaTeX workflow;
|
||
- without a CLI, the package is a library demo rather than a tool;
|
||
- without provenance and state, later enrichment work becomes hard to audit.
|
||
|
||
Exit criteria:
|
||
|
||
- a user can ingest a `.bib` file, inspect entries, search locally, and export a reviewed `.bib`;
|
||
- round-trip tests show no unexpected field loss for supported entry types.
|
||
|
||
## Phase 2: Reference Extraction
|
||
|
||
Priority: P0
|
||
|
||
Goal:
|
||
Turn raw reference text into draft entries that can enter the main pipeline.
|
||
|
||
Tasks:
|
||
|
||
- add parsers for bibliography-section lines and plain-text reference lists;
|
||
- define a draft-entry schema for incomplete references with confidence markers;
|
||
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
|
||
- add normalization for author names, years, title casing, and page ranges;
|
||
- prefer sentence-boundary venue detection over naive keyword splits so title text containing words like `report` is not truncated;
|
||
- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously
|
||
incomplete;
|
||
- preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match;
|
||
- build gold-test fixtures from real, messy reference examples.
|
||
|
||
Why this is next:
|
||
|
||
- this addresses the project’s first unique bottleneck: getting rough references into structured form;
|
||
- enrichment is much more effective once draft references are normalized.
|
||
|
||
Exit criteria:
|
||
|
||
- a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked;
|
||
- tests cover common article, book, chapter, proceedings, report, and abbreviation-heavy legacy references.
|
||
|
||
## Phase 3: Metadata Enrichment
|
||
|
||
Priority: P1
|
||
|
||
Goal:
|
||
Resolve draft or partial entries against external scholarly sources and merge improved metadata safely.
|
||
|
||
Tasks:
|
||
|
||
- define a resolver interface with deterministic merge rules;
|
||
- implement first-party resolvers for DOI/Crossref, DBLP, and arXiv;
|
||
- add identifier-first resolution, then title/author/year fallback search;
|
||
- store merge provenance per field and resolution attempt logs;
|
||
- flag conflicts rather than silently overwriting disputed values.
|
||
|
||
Why this is P1 rather than the first phase:
|
||
|
||
- enrichment quality depends on the ingestion and provenance model being correct first;
|
||
- it is easier to test deterministic merge behavior once local workflows already exist.
|
||
|
||
Exit criteria:
|
||
|
||
- an incomplete entry can be enriched from at least one authoritative source;
|
||
- conflicting fields remain visible for review instead of being lost.
|
||
|
||
## Phase 4: Citation Graph Expansion
|
||
|
||
Priority: P1
|
||
|
||
Goal:
|
||
Use citation edges as a discovery engine rather than just metadata storage.
|
||
|
||
Tasks:
|
||
|
||
- support explicit `cites` and `cited_by` edge ingestion with source provenance;
|
||
- add graph expansion commands starting from one or more seed entries;
|
||
- track edge discovery source, timestamp, and confidence;
|
||
- add filters for depth, source type, year range, and reviewed status;
|
||
- expose unresolved nodes so the user can decide what to enrich next.
|
||
|
||
Why this matters:
|
||
- this is central to literature discovery rather than mere bibliography cleanup;
|
||
- it turns the database into a research navigation tool.
|
||
|
||
Exit criteria:
|
||
|
||
- starting from one or more seed entries, a user can expand outward through citation edges and persist newly discovered nodes;
|
||
- graph traversal results can be exported as BibTeX candidates for review.
|
||
|
||
## Phase 5: Search And Ranking
|
||
|
||
Priority: P2
|
||
|
||
Goal:
|
||
Improve discovery quality inside the local corpus.
|
||
|
||
Tasks:
|
||
|
||
- refine FTS ranking across title, abstract, keywords, and fulltext;
|
||
- add saved search queries and result filters;
|
||
- add optional embedding-backed semantic search behind a pluggable interface;
|
||
- support hybrid ranking that combines lexical matching, identifiers, and citation proximity;
|
||
- add benchmarking fixtures for retrieval quality on a few research topics.
|
||
|
||
Why this is later:
|
||
|
||
- FTS is already enough to support early workflows;
|
||
- embedding infrastructure is expensive and should wait until the corpus schema stabilizes.
|
||
|
||
Exit criteria:
|
||
|
||
- local search is useful on realistic corpora without requiring external services;
|
||
- semantic indexing is optional and does not displace the simpler local search path.
|
||
|
||
## Phase 6: Corpus Acquisition Pipelines
|
||
|
||
Priority: P2
|
||
|
||
Goal:
|
||
Broaden source acquisition without mixing that complexity into the core model.
|
||
|
||
Tasks:
|
||
- add source adapters for open-access theses and dissertation repositories;
|
||
- add support for harvesting publisher citation pages and preprint metadata pages;
|
||
- define per-source import provenance and rate-limit behavior;
|
||
- separate source-specific scraping logic from normalized entry storage;
|
||
- add regression fixtures for representative public sources.
|
||
|
||
Why this is later:
|
||
|
||
- acquisition breadth is useful, but only after the core ingest/enrich/review loop is solid;
|
||
- source adapters are brittle and should sit on top of a stable model.
|
||
|
||
Exit criteria:
|
||
|
||
- new public corpora can be imported through adapters without changing the storage core;
|
||
- imported entries retain their source provenance and can be reviewed like any other entry.
|
||
|
||
## Suggested Next Three Tasks
|
||
|
||
1. Add a CLI module with `ingest`, `search`, `show`, and `export`.
|
||
2. Implement BibTeX export from the normalized store.
|
||
3. Add provenance tables and entry review status fields.
|
||
|
||
These three tasks complete the first usable local workflow and should be treated as the immediate sprint.
|