From fc7e1f1844c95c2d5d00227942dbf902e17e5b52 Mon Sep 17 00:00:00 2001 From: wesley Date: Sun, 29 Mar 2026 09:04:24 +0000 Subject: [PATCH] Update docs and roadmap --- ROADMAP.md | 5 ++++- examples/cli/README.md | 7 +++++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/ROADMAP.md b/ROADMAP.md index aea0822..9302eb8 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -66,6 +66,9 @@ Tasks: - define a draft-entry schema for incomplete references with confidence markers; - support ingestion of OCR- or PDF-derived plaintext bibliography sections; - add normalization for author names, years, title casing, and page ranges; +- prefer sentence-boundary venue detection over naive keyword splits so title text containing words like `report` is not truncated; +- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously incomplete; +- preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match; - build gold-test fixtures from real, messy reference examples. Why this is next: @@ -76,7 +79,7 @@ Why this is next: Exit criteria: - a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked; -- tests cover common article, book, chapter, and proceedings references. +- tests cover common article, book, chapter, proceedings, report, and abbreviation-heavy legacy references. ## Phase 3: Metadata Enrichment diff --git a/examples/cli/README.md b/examples/cli/README.md index 8fbeed3..6be5a1b 100644 --- a/examples/cli/README.md +++ b/examples/cli/README.md @@ -179,6 +179,12 @@ Write extracted BibTeX to a file: .venv/bin/python -m citegeist extract references.txt --output extracted-artificial-life.bib ``` +Extraction notes from messy legacy corpora: + +- use the full raw reference line as the repair source when the first parse leaves a truncated venue stub; +- split title from publication data at likely sentence boundaries before falling back to keyword markers, so titles containing words like `report` are not cut early; +- keep refreshed local BibTeX for unresolved entries so parser improvements can propagate even when no remote metadata source yields a match. + ### Resolve Resolve one or more entries against remote metadata: @@ -707,3 +713,4 @@ Apply curated corrections: - Some commands depend on live source access. - For topic-oriented examples, use preview mode before committing changes when possible. - The older TalkOrigins alias commands remain available, but the example-prefixed names are the preferred surface. +- For extraction work on OCR-heavy or legacy references, keep regression fixtures for abbreviation-heavy venues such as `Proc.`, `Occas. Pap.`, and `Comm. Rept.` because those are easy places for title/venue splits to go wrong.