10 KiB
EcoSpecies Standards Migration Plan
Problem
The current EcoSpecies ingest and document model still treats legacy local fields such as FLELMR code / species_code as if they were primary identifiers. That is useful for historical provenance, but it is the wrong long-term center of gravity for a broader, modern biodiversity knowledge system.
The same problem exists for citations:
- legacy plaintext reference blocks are treated as local document text,
- citation identity is weak or missing,
- bibliography growth is tied to what happened to appear in the historical SLH file.
The new system should preserve legacy local identifiers and references, but it should not be structurally bound to them.
Direction
Treat legacy local codes and freeform references as import-era artifacts, not canonical future-facing identifiers.
Going forward, EcoSpecies should prefer broadly recognized identifiers and registries:
- taxonomic name authority and taxon identifiers:
- Catalogue of Life IDs and release DOIs
- GBIF taxon keys
- WoRMS AphiaIDs for marine taxa
- ITIS TSNs where relevant
- optional NCBI Taxonomy IDs for research interoperability
- literature and dataset identifiers:
- DOI as the primary publication/dataset identifier
- ISBN/ISSN where DOI is absent
- OpenAlex IDs and DataCite metadata as enrichment layers
- contributor identity:
- email-based local contributor accounts now
- optional ORCID linkage later for editor and contributor identity
The system should be marine-forward because that matches the historical corpus, but not marine-exclusive. Identifier strategy should therefore be authority-aware rather than tied to a single domain-specific registry.
Authority Selection Strategy
Choose the primary taxon authority by best-fit coverage, not by a single global rule.
- marine taxa:
- prefer WoRMS AphiaID as primary when confidently matched
- retain GBIF and Catalogue of Life as crosswalks
- non-marine or mixed-domain taxa:
- prefer Catalogue of Life or GBIF as primary, depending on match quality and coverage
- retain ITIS and other relevant identifiers as crosswalks
- unresolved or conflicting cases:
- store all candidate identifiers
- require editorial review before a primary identifier is asserted
This keeps the project ready for terrestrial expansion without discarding the value of WoRMS for the present corpus.
Important Taxonomic Note
PhyloCode is relevant for clade naming, not as a general-purpose replacement for species-level registry IDs. It should not become the primary EcoSpecies species identifier layer. It may be useful later for clade-aware ontology and higher-level phylogenetic naming, but not as the main substitute for local species_code values.
Core Design Rules
- Legacy local identifiers remain preserved exactly as imported.
- Canonical taxon identity becomes multi-authority, not single-local-code.
- Citations become first-class structured entities, not just text inside a section.
- Bibliographies can be extended by topic and citation graph, not only by source-document inheritance.
- Exports keep provenance visible so readers can distinguish legacy source metadata from normalized external identifiers.
Schema Changes
Species metadata
Retain flelmr_code for provenance, but demote it to a legacy metadata field.
Add a taxon-identity layer:
taxon_name_usagetaxon_identifiertaxon_authoritytaxon_match_review
Suggested fields:
taxon_identifier.authoritytaxon_identifier.identifiertaxon_identifier.ranktaxon_identifier.labeltaxon_identifier.is_primarytaxon_identifier.source_urltaxon_identifier.asserted_bytaxon_identifier.match_confidencetaxon_identifier.review_status
Examples:
authority = "worms", identifier = "159059", label = "AphiaID"authority = "gbif", identifier = "2290910", label = "taxonKey"authority = "col", identifier = "5T7L7", label = "taxonID"authority = "itis", identifier = "161989", label = "TSN"authority = "legacy-ecospecies", identifier = "5192", label = "FLELMR"
Citation model
Move from section text to structured bibliography entities:
citationcitation_identifiercitation_relationspecies_citationdocument_node_citationbibliography_topic
Suggested citation identifier types:
- DOI
- ISBN
- ISSN
- PMID
- arXiv
- OpenAlex
- URL
Markdown / AST Changes
Update the constrained Markdown profile so metadata stops implying that species_code is canonical.
Replace the current front matter recommendation:
species_code: 5192
with a provenance-oriented shape:
legacy_identifiers:
- authority: legacy-ecospecies
identifier: 5192
label: FLELMR
taxon_identifiers:
- authority: worms
identifier: 159059
label: AphiaID
primary: true
- authority: gbif
identifier: 2290910
label: taxonKey
Also add explicit bibliography sections:
## References
- id: doi:10.1000/example
text: Smith, J. 2024. Example paper...
relation: cites
## Suggested Reading
- topic: estuarine ecology
The AST should preserve:
- legacy identifiers
- normalized taxon identifiers
- structured references
- topic links used for bibliography expansion
Import Pipeline Changes
Species identity
Import should produce:
- raw imported name fields,
- legacy local identifiers,
- unresolved candidate taxon identifiers,
- optional matched external identifiers,
- a review state for unresolved or conflicting authority matches.
Do not block ingest if no external authority match exists. Store the unresolved state explicitly.
Primary identifier assignment should be determined by:
- domain fit of the authority
- confidence of the match
- editorial review status
- future ability to crosswalk to other authorities
Citations
Split citation processing into stages:
- detect bibliography/reference sections in the imported SLH text,
- extract plaintext reference strings,
- convert plaintext references into draft structured entries,
- enrich identifiers and metadata,
- assign accepted citations back to species and document nodes,
- optionally expand bibliography by topic and citation graph.
CiteGeist Integration
../CiteGeist is a strong fit for this migration.
Observed capabilities in that repo already cover much of what EcoSpecies needs:
- extracting references from plaintext,
- converting rough references into draft structured entries,
- DOI/Crossref/DataCite/OpenAlex enrichment,
- citation graph expansion,
- topic-based bibliography expansion,
- duplicate clustering and canonicalization.
Recommended integration boundary
Do not embed CiteGeist logic directly into the EcoSpecies parser.
Instead:
- EcoSpecies exports candidate plaintext references and topic phrases.
- CiteGeist processes and enriches them into structured bibliography data.
- EcoSpecies imports reviewed citation outputs into its own
citationtables.
First integration targets
- species-level bibliography cleanup from
Referencessections - DOI resolution and identifier assignment
- duplicate detection across species bibliographies
- topic expansion for subject areas such as habitat, trophic ecology, reproduction, invasive biology, and fisheries context
Later integration targets
- node-level citation attachment
- bibliography review UI
- suggested-reading generation per species
- topic-seeded bibliography augmentation for under-cited species drafts
API Changes
Add standards-aware endpoints:
/api/species/<slug>/identifiers/api/species/<slug>/citations/api/species/<slug>/bibliography/topics/api/editor/species/<slug>/identifier-review/api/editor/species/<slug>/citation-review
Do not remove legacy fields immediately. Keep flelmr_code in payloads for compatibility while introducing:
legacy_identifierstaxon_identifiersprimary_taxon_identifier
UI Changes
The species detail page should distinguish:
- scientific name
- primary external taxon identifier
- legacy local identifiers
- bibliography
- suggested reading
Editors should see:
- unresolved authority matches
- conflicting taxon IDs
- citation enrichment candidates
- duplicate-reference clusters
Contributors should only author content and draft references; identifier normalization and bibliography publication remain editorial functions.
Migration Phases
Phase A: Demote legacy code
- Rename internal presentation from “species code” to “legacy identifier”.
- Keep
flelmr_codeonly as legacy provenance. - Add
legacy_identifiersto Markdown export and AST.
Phase B: Add external taxon identifiers
- Create taxon-identifier tables and API payloads.
- Add editor review workflows for selecting a primary authority identifier.
- Default marine taxa review toward WoRMS where available.
- Default broader cross-domain review toward Catalogue of Life and GBIF where WoRMS is not the right authority.
- Keep the model open to terrestrial species from the beginning rather than treating them as out-of-scope exceptions.
Phase C: Structured bibliography
- Create citation tables.
- Extract plaintext references from imported documents.
- Store draft citations separately from accepted citations.
Phase D: CiteGeist bridge
- Define import/export format between EcoSpecies and CiteGeist.
- Run draft-reference normalization and DOI enrichment.
- Import reviewed structured citations back into EcoSpecies.
Phase E: Topic-aware bibliography growth
- Store species topic phrases.
- Use CiteGeist topic expansion for bibliography augmentation.
- Keep added citations flagged by source type:
- imported
- resolved
- topic-expanded
- editor-added
Immediate Next Steps
- Update the Markdown profile to replace
species_codewithlegacy_identifiersplustaxon_identifiers. - Add
legacy_identifiersandtaxon_identifiersto the AST/document model. - Introduce taxon identifier tables in the PostgreSQL schema.
- Define a minimal EcoSpecies-to-CiteGeist interchange format for plaintext references and topic phrases.
- Add editor-facing citation review before attempting automatic bibliography publication.