EcoSpecies-Atlas/docs/citegeist-review-notes.md

4.2 KiB

CiteGeist Review Notes

These notes capture parser issues seen while integrating CiteGeist-style extraction into EcoSpecies.

Report-style references

Observed failure shape:

  • references like Daniell, W.C. 1872. Letters referring ... Comm. Rept. U.S. Comm. Fish & Fish. 2: 387-390.
  • extracted title may contain the full raw bibliography string
  • abbreviated venue names such as Comm. Rept. are not separated cleanly from the title

Suggested upstream change in citegeist.extract:

  • add a report-style parser path after year detection
  • prefer sentence-boundary venue detection before naive keyword splits so words like report inside a real title do not trigger an early cut
  • support abbreviation-heavy venue starters such as:
    • comm. rept.
    • rept.
    • proc.
    • occas. pap.
    • bulletin
    • bull.
    • memoir
  • strip trailing volume/page blobs like 2: 387-390 from the venue field
  • when a first parse leaves a partial venue stub such as Occas, reparse the full raw reference line and prefer the fuller repaired venue/title split

Placeholder title merge behavior

Observed failure shape:

  • a raw bibliography string may survive as title even after DOI/title resolution finds a better title

Suggested upstream change in citegeist.resolve.merge_entries_with_conflicts:

  • treat titles that look like raw bibliography strings as placeholders
  • example heuristic:
    • starts with Surname, ... YEAR.
    • unusually long for a title
    • contains a resolved shorter title as a substring after punctuation normalization

Legacy note deduplication

Observed failure shape:

  • note fragments like ecospecies_reference_number = {160} can be appended more than once downstream when re-merging enriched metadata

Suggested upstream change:

  • when joining note fragments, split on ;, normalize whitespace, and dedupe per fragment rather than per whole note string

Unresolved entries should still refresh local parses

Observed failure shape:

  • parser improvements may correctly rebuild title, venue, volume, number, and pages
  • but if no remote metadata source matches, the stored draft BibTeX can remain unchanged unless unresolved enrichment also writes the refreshed local seed back out

Suggested upstream change:

  • unresolved enrichment should still return the rebuilt local draft entry
  • keep citation_key, normalized text, and draft BibTeX synchronized with the current local parser even when resolver status remains unresolved

Returned metadata not carried through

Observed concern:

  • resolver/source payloads may include bibliographic details such as:
    • volume
    • issue / BibTeX number
    • page / BibTeX pages
  • these should be preserved into the BibTeX entry whenever available

Current note:

  • CiteGeist Crossref mapping already includes volume, number, and pages
  • verify that all resolver paths, storage round-trips, and exports preserve those fields consistently
  • OpenAlex/DataCite mappings should also be checked for analogous bibliographic fields in biblio / attribute payloads

False-positive title-search acceptance

Observed failure shape:

  • title search can return a thematically related but bibliographically different work
  • downstream acceptance may keep some seed fields while adopting conflicting DOI/title/volume/pages from the returned match
  • this is especially risky for historical references with sparse or abbreviated venue names

Suggested upstream change in citegeist.resolve and any title-search ranking path:

  • do not fall back to the first search hit when no strong title match exists
  • prefer exact or near-exact title matches only
  • reject a candidate when structured seed metadata conflicts on strong fields such as:
    • year
    • venue / journal
    • volume
    • number
    • pages
  • treat those fields as match-validation inputs, not just merge-time metadata

OpenAlex null-source handling

Observed failure shape:

  • some OpenAlex works have primary_location present but source: null
  • downstream mapping can crash if it assumes source is always a dictionary

Suggested upstream change:

  • treat null source payloads as empty dictionaries
  • continue mapping title, year, DOI, and biblio fields even when venue/source is missing