Update docs and roadmap
This commit is contained in:
parent
f06a68aedc
commit
fc7e1f1844
|
|
@ -66,6 +66,9 @@ Tasks:
|
||||||
- define a draft-entry schema for incomplete references with confidence markers;
|
- define a draft-entry schema for incomplete references with confidence markers;
|
||||||
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
|
- support ingestion of OCR- or PDF-derived plaintext bibliography sections;
|
||||||
- add normalization for author names, years, title casing, and page ranges;
|
- add normalization for author names, years, title casing, and page ranges;
|
||||||
|
- prefer sentence-boundary venue detection over naive keyword splits so title text containing words like `report` is not truncated;
|
||||||
|
- repair partially extracted venue stubs such as `Occas.` or `Proc.` by reparsing the full raw reference line when the structured fields are obviously incomplete;
|
||||||
|
- preserve improved local draft parses even when remote enrichment remains unresolved, so later parser fixes can refresh stored BibTeX without requiring a successful metadata match;
|
||||||
- build gold-test fixtures from real, messy reference examples.
|
- build gold-test fixtures from real, messy reference examples.
|
||||||
|
|
||||||
Why this is next:
|
Why this is next:
|
||||||
|
|
@ -76,7 +79,7 @@ Why this is next:
|
||||||
Exit criteria:
|
Exit criteria:
|
||||||
|
|
||||||
- a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked;
|
- a user can pass a plaintext bibliography section and receive draft BibTeX entries with unresolved fields clearly marked;
|
||||||
- tests cover common article, book, chapter, and proceedings references.
|
- tests cover common article, book, chapter, proceedings, report, and abbreviation-heavy legacy references.
|
||||||
|
|
||||||
## Phase 3: Metadata Enrichment
|
## Phase 3: Metadata Enrichment
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -179,6 +179,12 @@ Write extracted BibTeX to a file:
|
||||||
.venv/bin/python -m citegeist extract references.txt --output extracted-artificial-life.bib
|
.venv/bin/python -m citegeist extract references.txt --output extracted-artificial-life.bib
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Extraction notes from messy legacy corpora:
|
||||||
|
|
||||||
|
- use the full raw reference line as the repair source when the first parse leaves a truncated venue stub;
|
||||||
|
- split title from publication data at likely sentence boundaries before falling back to keyword markers, so titles containing words like `report` are not cut early;
|
||||||
|
- keep refreshed local BibTeX for unresolved entries so parser improvements can propagate even when no remote metadata source yields a match.
|
||||||
|
|
||||||
### Resolve
|
### Resolve
|
||||||
|
|
||||||
Resolve one or more entries against remote metadata:
|
Resolve one or more entries against remote metadata:
|
||||||
|
|
@ -707,3 +713,4 @@ Apply curated corrections:
|
||||||
- Some commands depend on live source access.
|
- Some commands depend on live source access.
|
||||||
- For topic-oriented examples, use preview mode before committing changes when possible.
|
- For topic-oriented examples, use preview mode before committing changes when possible.
|
||||||
- The older TalkOrigins alias commands remain available, but the example-prefixed names are the preferred surface.
|
- The older TalkOrigins alias commands remain available, but the example-prefixed names are the preferred surface.
|
||||||
|
- For extraction work on OCR-heavy or legacy references, keep regression fixtures for abbreviation-heavy venues such as `Proc.`, `Occas. Pap.`, and `Comm. Rept.` because those are easy places for title/venue splits to go wrong.
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue