EcoSpecies-Atlas/docs/standards-migration-plan.md

10 KiB

EcoSpecies Standards Migration Plan

Problem

The current EcoSpecies ingest and document model still treats legacy local fields such as FLELMR code / species_code as if they were primary identifiers. That is useful for historical provenance, but it is the wrong long-term center of gravity for a broader, modern biodiversity knowledge system.

The same problem exists for citations:

  • legacy plaintext reference blocks are treated as local document text,
  • citation identity is weak or missing,
  • bibliography growth is tied to what happened to appear in the historical SLH file.

The new system should preserve legacy local identifiers and references, but it should not be structurally bound to them.

Direction

Treat legacy local codes and freeform references as import-era artifacts, not canonical future-facing identifiers.

Going forward, EcoSpecies should prefer broadly recognized identifiers and registries:

  • taxonomic name authority and taxon identifiers:
    • Catalogue of Life IDs and release DOIs
    • GBIF taxon keys
    • WoRMS AphiaIDs for marine taxa
    • ITIS TSNs where relevant
    • optional NCBI Taxonomy IDs for research interoperability
  • literature and dataset identifiers:
    • DOI as the primary publication/dataset identifier
    • ISBN/ISSN where DOI is absent
    • OpenAlex IDs and DataCite metadata as enrichment layers
  • contributor identity:
    • email-based local contributor accounts now
    • optional ORCID linkage later for editor and contributor identity

The system should be marine-forward because that matches the historical corpus, but not marine-exclusive. Identifier strategy should therefore be authority-aware rather than tied to a single domain-specific registry.

Authority Selection Strategy

Choose the primary taxon authority by best-fit coverage, not by a single global rule.

  • marine taxa:
    • prefer WoRMS AphiaID as primary when confidently matched
    • retain GBIF and Catalogue of Life as crosswalks
  • non-marine or mixed-domain taxa:
    • prefer Catalogue of Life or GBIF as primary, depending on match quality and coverage
    • retain ITIS and other relevant identifiers as crosswalks
  • unresolved or conflicting cases:
    • store all candidate identifiers
    • require editorial review before a primary identifier is asserted

This keeps the project ready for terrestrial expansion without discarding the value of WoRMS for the present corpus.

Important Taxonomic Note

PhyloCode is relevant for clade naming, not as a general-purpose replacement for species-level registry IDs. It should not become the primary EcoSpecies species identifier layer. It may be useful later for clade-aware ontology and higher-level phylogenetic naming, but not as the main substitute for local species_code values.

Core Design Rules

  1. Legacy local identifiers remain preserved exactly as imported.
  2. Canonical taxon identity becomes multi-authority, not single-local-code.
  3. Citations become first-class structured entities, not just text inside a section.
  4. Bibliographies can be extended by topic and citation graph, not only by source-document inheritance.
  5. Exports keep provenance visible so readers can distinguish legacy source metadata from normalized external identifiers.

Schema Changes

Species metadata

Retain flelmr_code for provenance, but demote it to a legacy metadata field.

Add a taxon-identity layer:

  • taxon_name_usage
  • taxon_identifier
  • taxon_authority
  • taxon_match_review

Suggested fields:

  • taxon_identifier.authority
  • taxon_identifier.identifier
  • taxon_identifier.rank
  • taxon_identifier.label
  • taxon_identifier.is_primary
  • taxon_identifier.source_url
  • taxon_identifier.asserted_by
  • taxon_identifier.match_confidence
  • taxon_identifier.review_status

Examples:

  • authority = "worms", identifier = "159059", label = "AphiaID"
  • authority = "gbif", identifier = "2290910", label = "taxonKey"
  • authority = "col", identifier = "5T7L7", label = "taxonID"
  • authority = "itis", identifier = "161989", label = "TSN"
  • authority = "legacy-ecospecies", identifier = "5192", label = "FLELMR"

Citation model

Move from section text to structured bibliography entities:

  • citation
  • citation_identifier
  • citation_relation
  • species_citation
  • document_node_citation
  • bibliography_topic

Suggested citation identifier types:

  • DOI
  • ISBN
  • ISSN
  • PMID
  • arXiv
  • OpenAlex
  • URL

Markdown / AST Changes

Update the constrained Markdown profile so metadata stops implying that species_code is canonical.

Replace the current front matter recommendation:

species_code: 5192

with a provenance-oriented shape:

legacy_identifiers:
  - authority: legacy-ecospecies
    identifier: 5192
    label: FLELMR
taxon_identifiers:
  - authority: worms
    identifier: 159059
    label: AphiaID
    primary: true
  - authority: gbif
    identifier: 2290910
    label: taxonKey

Also add explicit bibliography sections:

## References

- id: doi:10.1000/example
  text: Smith, J. 2024. Example paper...
  relation: cites

## Suggested Reading

- topic: estuarine ecology

The AST should preserve:

  • legacy identifiers
  • normalized taxon identifiers
  • structured references
  • topic links used for bibliography expansion

Import Pipeline Changes

Species identity

Import should produce:

  1. raw imported name fields,
  2. legacy local identifiers,
  3. unresolved candidate taxon identifiers,
  4. optional matched external identifiers,
  5. a review state for unresolved or conflicting authority matches.

Do not block ingest if no external authority match exists. Store the unresolved state explicitly.

Primary identifier assignment should be determined by:

  1. domain fit of the authority
  2. confidence of the match
  3. editorial review status
  4. future ability to crosswalk to other authorities

Citations

Split citation processing into stages:

  1. detect bibliography/reference sections in the imported SLH text,
  2. extract plaintext reference strings,
  3. convert plaintext references into draft structured entries,
  4. enrich identifiers and metadata,
  5. assign accepted citations back to species and document nodes,
  6. optionally expand bibliography by topic and citation graph.

CiteGeist Integration

../CiteGeist is a strong fit for this migration.

Observed capabilities in that repo already cover much of what EcoSpecies needs:

  • extracting references from plaintext,
  • converting rough references into draft structured entries,
  • DOI/Crossref/DataCite/OpenAlex enrichment,
  • citation graph expansion,
  • topic-based bibliography expansion,
  • duplicate clustering and canonicalization.

Do not embed CiteGeist logic directly into the EcoSpecies parser.

Instead:

  1. EcoSpecies exports candidate plaintext references and topic phrases.
  2. CiteGeist processes and enriches them into structured bibliography data.
  3. EcoSpecies imports reviewed citation outputs into its own citation tables.

First integration targets

  • species-level bibliography cleanup from References sections
  • DOI resolution and identifier assignment
  • duplicate detection across species bibliographies
  • topic expansion for subject areas such as habitat, trophic ecology, reproduction, invasive biology, and fisheries context

Later integration targets

  • node-level citation attachment
  • bibliography review UI
  • suggested-reading generation per species
  • topic-seeded bibliography augmentation for under-cited species drafts

API Changes

Add standards-aware endpoints:

  • /api/species/<slug>/identifiers
  • /api/species/<slug>/citations
  • /api/species/<slug>/bibliography/topics
  • /api/editor/species/<slug>/identifier-review
  • /api/editor/species/<slug>/citation-review

Do not remove legacy fields immediately. Keep flelmr_code in payloads for compatibility while introducing:

  • legacy_identifiers
  • taxon_identifiers
  • primary_taxon_identifier

UI Changes

The species detail page should distinguish:

  • scientific name
  • primary external taxon identifier
  • legacy local identifiers
  • bibliography
  • suggested reading

Editors should see:

  • unresolved authority matches
  • conflicting taxon IDs
  • citation enrichment candidates
  • duplicate-reference clusters

Contributors should only author content and draft references; identifier normalization and bibliography publication remain editorial functions.

Migration Phases

Phase A: Demote legacy code

  • Rename internal presentation from “species code” to “legacy identifier”.
  • Keep flelmr_code only as legacy provenance.
  • Add legacy_identifiers to Markdown export and AST.

Phase B: Add external taxon identifiers

  • Create taxon-identifier tables and API payloads.
  • Add editor review workflows for selecting a primary authority identifier.
  • Default marine taxa review toward WoRMS where available.
  • Default broader cross-domain review toward Catalogue of Life and GBIF where WoRMS is not the right authority.
  • Keep the model open to terrestrial species from the beginning rather than treating them as out-of-scope exceptions.

Phase C: Structured bibliography

  • Create citation tables.
  • Extract plaintext references from imported documents.
  • Store draft citations separately from accepted citations.

Phase D: CiteGeist bridge

  • Define import/export format between EcoSpecies and CiteGeist.
  • Run draft-reference normalization and DOI enrichment.
  • Import reviewed structured citations back into EcoSpecies.

Phase E: Topic-aware bibliography growth

  • Store species topic phrases.
  • Use CiteGeist topic expansion for bibliography augmentation.
  • Keep added citations flagged by source type:
    • imported
    • resolved
    • topic-expanded
    • editor-added

Immediate Next Steps

  1. Update the Markdown profile to replace species_code with legacy_identifiers plus taxon_identifiers.
  2. Add legacy_identifiers and taxon_identifiers to the AST/document model.
  3. Introduce taxon identifier tables in the PostgreSQL schema.
  4. Define a minimal EcoSpecies-to-CiteGeist interchange format for plaintext references and topic phrases.
  5. Add editor-facing citation review before attempting automatic bibliography publication.