Update docs and roadmap

This commit is contained in:
wesley 2026-03-29 09:04:24 +00:00
parent f06a68aedc
commit fc7e1f1844
2 changed files with 11 additions and 1 deletions

View File

@ -66,6 +66,9 @@ Tasks:
- define a draft-entry schema for incomplete references with confidence markers;
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
- add normalization for author names, years, title casing, and page ranges;
- prefer sentence-boundary venue detection over naive keyword splits so title text containing words like `report` is not truncated;
- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously incomplete;
- preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match;
- build gold-test fixtures from real, messy reference examples.
Why this is next:
@ -76,7 +79,7 @@ Why this is next:
Exit criteria:
- a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked;
- tests cover common article, book, chapter, and proceedings references.
- tests cover common article, book, chapter, proceedings, report, and abbreviation-heavy legacy references.
## Phase 3: Metadata Enrichment

View File

@ -179,6 +179,12 @@ Write extracted BibTeX to a file:
.venv/bin/python -m citegeist extract references.txt --output extracted-artificial-life.bib
```
Extraction notes from messy legacy corpora:
- use the full raw reference line as the repair source when the first parse leaves a truncated venue stub;
- split title from publication data at likely sentence boundaries before falling back to keyword markers, so titles containing words like `report` are not cut early;
- keep refreshed local BibTeX for unresolved entries so parser improvements can propagate even when no remote metadata source yields a match.
### Resolve
Resolve one or more entries against remote metadata:
@ -707,3 +713,4 @@ Apply curated corrections:
- Some commands depend on live source access.
- For topic-oriented examples, use preview mode before committing changes when possible.
- The older TalkOrigins alias commands remain available, but the example-prefixed names are the preferred surface.
- For extraction work on OCR-heavy or legacy references, keep regression fixtures for abbreviation-heavy venues such as `Proc.`, `Occas. Pap.`, and `Comm. Rept.` because those are easy places for title/venue splits to go wrong.