## CiteGeist Review Notes These notes capture parser issues seen while integrating CiteGeist-style extraction into EcoSpecies. ### Report-style references Observed failure shape: - references like `Daniell, W.C. 1872. Letters referring ... Comm. Rept. U.S. Comm. Fish & Fish. 2: 387-390.` - extracted `title` may contain the full raw bibliography string - abbreviated venue names such as `Comm. Rept.` are not separated cleanly from the title Suggested upstream change in `citegeist.extract`: - add a report-style parser path after year detection - prefer sentence-boundary venue detection before naive keyword splits so words like `report` inside a real title do not trigger an early cut - support abbreviation-heavy venue starters such as: - `comm. rept.` - `rept.` - `proc.` - `occas. pap.` - `bulletin` - `bull.` - `memoir` - strip trailing volume/page blobs like `2: 387-390` from the venue field - when a first parse leaves a partial venue stub such as `Occas`, reparse the full raw reference line and prefer the fuller repaired venue/title split ### Placeholder title merge behavior Observed failure shape: - a raw bibliography string may survive as `title` even after DOI/title resolution finds a better title Suggested upstream change in `citegeist.resolve.merge_entries_with_conflicts`: - treat titles that look like raw bibliography strings as placeholders - example heuristic: - starts with `Surname, ... YEAR.` - unusually long for a title - contains a resolved shorter title as a substring after punctuation normalization ### Legacy note deduplication Observed failure shape: - note fragments like `ecospecies_reference_number = {160}` can be appended more than once downstream when re-merging enriched metadata Suggested upstream change: - when joining note fragments, split on `;`, normalize whitespace, and dedupe per fragment rather than per whole note string ### Unresolved entries should still refresh local parses Observed failure shape: - parser improvements may correctly rebuild `title`, venue, `volume`, `number`, and `pages` - but if no remote metadata source matches, the stored draft BibTeX can remain unchanged unless unresolved enrichment also writes the refreshed local seed back out Suggested upstream change: - unresolved enrichment should still return the rebuilt local draft entry - keep `citation_key`, normalized text, and draft BibTeX synchronized with the current local parser even when resolver status remains `unresolved` ### Returned metadata not carried through Observed concern: - resolver/source payloads may include bibliographic details such as: - `volume` - `issue` / BibTeX `number` - `page` / BibTeX `pages` - these should be preserved into the BibTeX entry whenever available Current note: - CiteGeist Crossref mapping already includes `volume`, `number`, and `pages` - verify that all resolver paths, storage round-trips, and exports preserve those fields consistently - OpenAlex/DataCite mappings should also be checked for analogous bibliographic fields in `biblio` / attribute payloads ### False-positive title-search acceptance Observed failure shape: - title search can return a thematically related but bibliographically different work - downstream acceptance may keep some seed fields while adopting conflicting DOI/title/volume/pages from the returned match - this is especially risky for historical references with sparse or abbreviated venue names Suggested upstream change in `citegeist.resolve` and any title-search ranking path: - do not fall back to the first search hit when no strong title match exists - prefer exact or near-exact title matches only - reject a candidate when structured seed metadata conflicts on strong fields such as: - `year` - venue / journal - `volume` - `number` - `pages` - treat those fields as match-validation inputs, not just merge-time metadata ### OpenAlex null-source handling Observed failure shape: - some OpenAlex works have `primary_location` present but `source: null` - downstream mapping can crash if it assumes `source` is always a dictionary Suggested upstream change: - treat null `source` payloads as empty dictionaries - continue mapping title, year, DOI, and `biblio` fields even when venue/source is missing