111 lines
4.2 KiB
Markdown
111 lines
4.2 KiB
Markdown
## CiteGeist Review Notes
|
|
|
|
These notes capture parser issues seen while integrating CiteGeist-style extraction into EcoSpecies.
|
|
|
|
### Report-style references
|
|
|
|
Observed failure shape:
|
|
|
|
- references like `Daniell, W.C. 1872. Letters referring ... Comm. Rept. U.S. Comm. Fish & Fish. 2: 387-390.`
|
|
- extracted `title` may contain the full raw bibliography string
|
|
- abbreviated venue names such as `Comm. Rept.` are not separated cleanly from the title
|
|
|
|
Suggested upstream change in `citegeist.extract`:
|
|
|
|
- add a report-style parser path after year detection
|
|
- prefer sentence-boundary venue detection before naive keyword splits so words like `report` inside a real title do not trigger an early cut
|
|
- support abbreviation-heavy venue starters such as:
|
|
- `comm. rept.`
|
|
- `rept.`
|
|
- `proc.`
|
|
- `occas. pap.`
|
|
- `bulletin`
|
|
- `bull.`
|
|
- `memoir`
|
|
- strip trailing volume/page blobs like `2: 387-390` from the venue field
|
|
- when a first parse leaves a partial venue stub such as `Occas`, reparse the full raw reference line and prefer the fuller repaired venue/title split
|
|
|
|
### Placeholder title merge behavior
|
|
|
|
Observed failure shape:
|
|
|
|
- a raw bibliography string may survive as `title` even after DOI/title resolution finds a better title
|
|
|
|
Suggested upstream change in `citegeist.resolve.merge_entries_with_conflicts`:
|
|
|
|
- treat titles that look like raw bibliography strings as placeholders
|
|
- example heuristic:
|
|
- starts with `Surname, ... YEAR.`
|
|
- unusually long for a title
|
|
- contains a resolved shorter title as a substring after punctuation normalization
|
|
|
|
### Legacy note deduplication
|
|
|
|
Observed failure shape:
|
|
|
|
- note fragments like `ecospecies_reference_number = {160}` can be appended more than once downstream when re-merging enriched metadata
|
|
|
|
Suggested upstream change:
|
|
|
|
- when joining note fragments, split on `;`, normalize whitespace, and dedupe per fragment rather than per whole note string
|
|
|
|
### Unresolved entries should still refresh local parses
|
|
|
|
Observed failure shape:
|
|
|
|
- parser improvements may correctly rebuild `title`, venue, `volume`, `number`, and `pages`
|
|
- but if no remote metadata source matches, the stored draft BibTeX can remain unchanged unless unresolved enrichment also writes the refreshed local seed back out
|
|
|
|
Suggested upstream change:
|
|
|
|
- unresolved enrichment should still return the rebuilt local draft entry
|
|
- keep `citation_key`, normalized text, and draft BibTeX synchronized with the current local parser even when resolver status remains `unresolved`
|
|
|
|
### Returned metadata not carried through
|
|
|
|
Observed concern:
|
|
|
|
- resolver/source payloads may include bibliographic details such as:
|
|
- `volume`
|
|
- `issue` / BibTeX `number`
|
|
- `page` / BibTeX `pages`
|
|
- these should be preserved into the BibTeX entry whenever available
|
|
|
|
Current note:
|
|
|
|
- CiteGeist Crossref mapping already includes `volume`, `number`, and `pages`
|
|
- verify that all resolver paths, storage round-trips, and exports preserve those fields consistently
|
|
- OpenAlex/DataCite mappings should also be checked for analogous bibliographic fields in `biblio` / attribute payloads
|
|
|
|
### False-positive title-search acceptance
|
|
|
|
Observed failure shape:
|
|
|
|
- title search can return a thematically related but bibliographically different work
|
|
- downstream acceptance may keep some seed fields while adopting conflicting DOI/title/volume/pages from the returned match
|
|
- this is especially risky for historical references with sparse or abbreviated venue names
|
|
|
|
Suggested upstream change in `citegeist.resolve` and any title-search ranking path:
|
|
|
|
- do not fall back to the first search hit when no strong title match exists
|
|
- prefer exact or near-exact title matches only
|
|
- reject a candidate when structured seed metadata conflicts on strong fields such as:
|
|
- `year`
|
|
- venue / journal
|
|
- `volume`
|
|
- `number`
|
|
- `pages`
|
|
- treat those fields as match-validation inputs, not just merge-time metadata
|
|
|
|
### OpenAlex null-source handling
|
|
|
|
Observed failure shape:
|
|
|
|
- some OpenAlex works have `primary_location` present but `source: null`
|
|
- downstream mapping can crash if it assumes `source` is always a dictionary
|
|
|
|
Suggested upstream change:
|
|
|
|
- treat null `source` payloads as empty dictionaries
|
|
- continue mapping title, year, DOI, and `biblio` fields even when venue/source is missing
|