4.2 KiB
4.2 KiB
CiteGeist Review Notes
These notes capture parser issues seen while integrating CiteGeist-style extraction into EcoSpecies.
Report-style references
Observed failure shape:
- references like
Daniell, W.C. 1872. Letters referring ... Comm. Rept. U.S. Comm. Fish & Fish. 2: 387-390. - extracted
titlemay contain the full raw bibliography string - abbreviated venue names such as
Comm. Rept.are not separated cleanly from the title
Suggested upstream change in citegeist.extract:
- add a report-style parser path after year detection
- prefer sentence-boundary venue detection before naive keyword splits so words like
reportinside a real title do not trigger an early cut - support abbreviation-heavy venue starters such as:
comm. rept.rept.proc.occas. pap.bulletinbull.memoir
- strip trailing volume/page blobs like
2: 387-390from the venue field - when a first parse leaves a partial venue stub such as
Occas, reparse the full raw reference line and prefer the fuller repaired venue/title split
Placeholder title merge behavior
Observed failure shape:
- a raw bibliography string may survive as
titleeven after DOI/title resolution finds a better title
Suggested upstream change in citegeist.resolve.merge_entries_with_conflicts:
- treat titles that look like raw bibliography strings as placeholders
- example heuristic:
- starts with
Surname, ... YEAR. - unusually long for a title
- contains a resolved shorter title as a substring after punctuation normalization
- starts with
Legacy note deduplication
Observed failure shape:
- note fragments like
ecospecies_reference_number = {160}can be appended more than once downstream when re-merging enriched metadata
Suggested upstream change:
- when joining note fragments, split on
;, normalize whitespace, and dedupe per fragment rather than per whole note string
Unresolved entries should still refresh local parses
Observed failure shape:
- parser improvements may correctly rebuild
title, venue,volume,number, andpages - but if no remote metadata source matches, the stored draft BibTeX can remain unchanged unless unresolved enrichment also writes the refreshed local seed back out
Suggested upstream change:
- unresolved enrichment should still return the rebuilt local draft entry
- keep
citation_key, normalized text, and draft BibTeX synchronized with the current local parser even when resolver status remainsunresolved
Returned metadata not carried through
Observed concern:
- resolver/source payloads may include bibliographic details such as:
volumeissue/ BibTeXnumberpage/ BibTeXpages
- these should be preserved into the BibTeX entry whenever available
Current note:
- CiteGeist Crossref mapping already includes
volume,number, andpages - verify that all resolver paths, storage round-trips, and exports preserve those fields consistently
- OpenAlex/DataCite mappings should also be checked for analogous bibliographic fields in
biblio/ attribute payloads
False-positive title-search acceptance
Observed failure shape:
- title search can return a thematically related but bibliographically different work
- downstream acceptance may keep some seed fields while adopting conflicting DOI/title/volume/pages from the returned match
- this is especially risky for historical references with sparse or abbreviated venue names
Suggested upstream change in citegeist.resolve and any title-search ranking path:
- do not fall back to the first search hit when no strong title match exists
- prefer exact or near-exact title matches only
- reject a candidate when structured seed metadata conflicts on strong fields such as:
year- venue / journal
volumenumberpages
- treat those fields as match-validation inputs, not just merge-time metadata
OpenAlex null-source handling
Observed failure shape:
- some OpenAlex works have
primary_locationpresent butsource: null - downstream mapping can crash if it assumes
sourceis always a dictionary
Suggested upstream change:
- treat null
sourcepayloads as empty dictionaries - continue mapping title, year, DOI, and
bibliofields even when venue/source is missing