EcoSpecies-Atlas/docs/citegeist-review-notes.md

111 lines
4.2 KiB
Markdown

## CiteGeist Review Notes
These notes capture parser issues seen while integrating CiteGeist-style extraction into EcoSpecies.
### Report-style references
Observed failure shape:
- references like `Daniell, W.C. 1872. Letters referring ... Comm. Rept. U.S. Comm. Fish & Fish. 2: 387-390.`
- extracted `title` may contain the full raw bibliography string
- abbreviated venue names such as `Comm. Rept.` are not separated cleanly from the title
Suggested upstream change in `citegeist.extract`:
- add a report-style parser path after year detection
- prefer sentence-boundary venue detection before naive keyword splits so words like `report` inside a real title do not trigger an early cut
- support abbreviation-heavy venue starters such as:
- `comm. rept.`
- `rept.`
- `proc.`
- `occas. pap.`
- `bulletin`
- `bull.`
- `memoir`
- strip trailing volume/page blobs like `2: 387-390` from the venue field
- when a first parse leaves a partial venue stub such as `Occas`, reparse the full raw reference line and prefer the fuller repaired venue/title split
### Placeholder title merge behavior
Observed failure shape:
- a raw bibliography string may survive as `title` even after DOI/title resolution finds a better title
Suggested upstream change in `citegeist.resolve.merge_entries_with_conflicts`:
- treat titles that look like raw bibliography strings as placeholders
- example heuristic:
- starts with `Surname, ... YEAR.`
- unusually long for a title
- contains a resolved shorter title as a substring after punctuation normalization
### Legacy note deduplication
Observed failure shape:
- note fragments like `ecospecies_reference_number = {160}` can be appended more than once downstream when re-merging enriched metadata
Suggested upstream change:
- when joining note fragments, split on `;`, normalize whitespace, and dedupe per fragment rather than per whole note string
### Unresolved entries should still refresh local parses
Observed failure shape:
- parser improvements may correctly rebuild `title`, venue, `volume`, `number`, and `pages`
- but if no remote metadata source matches, the stored draft BibTeX can remain unchanged unless unresolved enrichment also writes the refreshed local seed back out
Suggested upstream change:
- unresolved enrichment should still return the rebuilt local draft entry
- keep `citation_key`, normalized text, and draft BibTeX synchronized with the current local parser even when resolver status remains `unresolved`
### Returned metadata not carried through
Observed concern:
- resolver/source payloads may include bibliographic details such as:
- `volume`
- `issue` / BibTeX `number`
- `page` / BibTeX `pages`
- these should be preserved into the BibTeX entry whenever available
Current note:
- CiteGeist Crossref mapping already includes `volume`, `number`, and `pages`
- verify that all resolver paths, storage round-trips, and exports preserve those fields consistently
- OpenAlex/DataCite mappings should also be checked for analogous bibliographic fields in `biblio` / attribute payloads
### False-positive title-search acceptance
Observed failure shape:
- title search can return a thematically related but bibliographically different work
- downstream acceptance may keep some seed fields while adopting conflicting DOI/title/volume/pages from the returned match
- this is especially risky for historical references with sparse or abbreviated venue names
Suggested upstream change in `citegeist.resolve` and any title-search ranking path:
- do not fall back to the first search hit when no strong title match exists
- prefer exact or near-exact title matches only
- reject a candidate when structured seed metadata conflicts on strong fields such as:
- `year`
- venue / journal
- `volume`
- `number`
- `pages`
- treat those fields as match-validation inputs, not just merge-time metadata
### OpenAlex null-source handling
Observed failure shape:
- some OpenAlex works have `primary_location` present but `source: null`
- downstream mapping can crash if it assumes `source` is always a dictionary
Suggested upstream change:
- treat null `source` payloads as empty dictionaries
- continue mapping title, year, DOI, and `biblio` fields even when venue/source is missing