# EcoSpecies Standards Migration Plan ## Problem The current EcoSpecies ingest and document model still treats legacy local fields such as `FLELMR code` / `species_code` as if they were primary identifiers. That is useful for historical provenance, but it is the wrong long-term center of gravity for a broader, modern biodiversity knowledge system. The same problem exists for citations: - legacy plaintext reference blocks are treated as local document text, - citation identity is weak or missing, - bibliography growth is tied to what happened to appear in the historical SLH file. The new system should preserve legacy local identifiers and references, but it should not be structurally bound to them. ## Direction Treat legacy local codes and freeform references as import-era artifacts, not canonical future-facing identifiers. Going forward, EcoSpecies should prefer broadly recognized identifiers and registries: - taxonomic name authority and taxon identifiers: - Catalogue of Life IDs and release DOIs - GBIF taxon keys - WoRMS AphiaIDs for marine taxa - ITIS TSNs where relevant - optional NCBI Taxonomy IDs for research interoperability - literature and dataset identifiers: - DOI as the primary publication/dataset identifier - ISBN/ISSN where DOI is absent - OpenAlex IDs and DataCite metadata as enrichment layers - contributor identity: - email-based local contributor accounts now - optional ORCID linkage later for editor and contributor identity The system should be marine-forward because that matches the historical corpus, but not marine-exclusive. Identifier strategy should therefore be authority-aware rather than tied to a single domain-specific registry. ## Authority Selection Strategy Choose the primary taxon authority by best-fit coverage, not by a single global rule. - marine taxa: - prefer WoRMS AphiaID as primary when confidently matched - retain GBIF and Catalogue of Life as crosswalks - non-marine or mixed-domain taxa: - prefer Catalogue of Life or GBIF as primary, depending on match quality and coverage - retain ITIS and other relevant identifiers as crosswalks - unresolved or conflicting cases: - store all candidate identifiers - require editorial review before a primary identifier is asserted This keeps the project ready for terrestrial expansion without discarding the value of WoRMS for the present corpus. ## Important Taxonomic Note PhyloCode is relevant for clade naming, not as a general-purpose replacement for species-level registry IDs. It should not become the primary EcoSpecies species identifier layer. It may be useful later for clade-aware ontology and higher-level phylogenetic naming, but not as the main substitute for local `species_code` values. ## Core Design Rules 1. Legacy local identifiers remain preserved exactly as imported. 2. Canonical taxon identity becomes multi-authority, not single-local-code. 3. Citations become first-class structured entities, not just text inside a section. 4. Bibliographies can be extended by topic and citation graph, not only by source-document inheritance. 5. Exports keep provenance visible so readers can distinguish legacy source metadata from normalized external identifiers. ## Schema Changes ### Species metadata Retain `flelmr_code` for provenance, but demote it to a legacy metadata field. Add a taxon-identity layer: - `taxon_name_usage` - `taxon_identifier` - `taxon_authority` - `taxon_match_review` Suggested fields: - `taxon_identifier.authority` - `taxon_identifier.identifier` - `taxon_identifier.rank` - `taxon_identifier.label` - `taxon_identifier.is_primary` - `taxon_identifier.source_url` - `taxon_identifier.asserted_by` - `taxon_identifier.match_confidence` - `taxon_identifier.review_status` Examples: - `authority = "worms", identifier = "159059", label = "AphiaID"` - `authority = "gbif", identifier = "2290910", label = "taxonKey"` - `authority = "col", identifier = "5T7L7", label = "taxonID"` - `authority = "itis", identifier = "161989", label = "TSN"` - `authority = "legacy-ecospecies", identifier = "5192", label = "FLELMR"` ### Citation model Move from section text to structured bibliography entities: - `citation` - `citation_identifier` - `citation_relation` - `species_citation` - `document_node_citation` - `bibliography_topic` Suggested citation identifier types: - DOI - ISBN - ISSN - PMID - arXiv - OpenAlex - URL ## Markdown / AST Changes Update the constrained Markdown profile so metadata stops implying that `species_code` is canonical. Replace the current front matter recommendation: ```md species_code: 5192 ``` with a provenance-oriented shape: ```md legacy_identifiers: - authority: legacy-ecospecies identifier: 5192 label: FLELMR taxon_identifiers: - authority: worms identifier: 159059 label: AphiaID primary: true - authority: gbif identifier: 2290910 label: taxonKey ``` Also add explicit bibliography sections: ```md ## References - id: doi:10.1000/example text: Smith, J. 2024. Example paper... relation: cites ## Suggested Reading - topic: estuarine ecology ``` The AST should preserve: - legacy identifiers - normalized taxon identifiers - structured references - topic links used for bibliography expansion ## Import Pipeline Changes ### Species identity Import should produce: 1. raw imported name fields, 2. legacy local identifiers, 3. unresolved candidate taxon identifiers, 4. optional matched external identifiers, 5. a review state for unresolved or conflicting authority matches. Do not block ingest if no external authority match exists. Store the unresolved state explicitly. Primary identifier assignment should be determined by: 1. domain fit of the authority 2. confidence of the match 3. editorial review status 4. future ability to crosswalk to other authorities ### Citations Split citation processing into stages: 1. detect bibliography/reference sections in the imported SLH text, 2. extract plaintext reference strings, 3. convert plaintext references into draft structured entries, 4. enrich identifiers and metadata, 5. assign accepted citations back to species and document nodes, 6. optionally expand bibliography by topic and citation graph. ## CiteGeist Integration `../CiteGeist` is a strong fit for this migration. Observed capabilities in that repo already cover much of what EcoSpecies needs: - extracting references from plaintext, - converting rough references into draft structured entries, - DOI/Crossref/DataCite/OpenAlex enrichment, - citation graph expansion, - topic-based bibliography expansion, - duplicate clustering and canonicalization. ### Recommended integration boundary Do not embed CiteGeist logic directly into the EcoSpecies parser. Instead: 1. EcoSpecies exports candidate plaintext references and topic phrases. 2. CiteGeist processes and enriches them into structured bibliography data. 3. EcoSpecies imports reviewed citation outputs into its own `citation` tables. ### First integration targets - species-level bibliography cleanup from `References` sections - DOI resolution and identifier assignment - duplicate detection across species bibliographies - topic expansion for subject areas such as habitat, trophic ecology, reproduction, invasive biology, and fisheries context ### Later integration targets - node-level citation attachment - bibliography review UI - suggested-reading generation per species - topic-seeded bibliography augmentation for under-cited species drafts ## API Changes Add standards-aware endpoints: - `/api/species//identifiers` - `/api/species//citations` - `/api/species//bibliography/topics` - `/api/editor/species//identifier-review` - `/api/editor/species//citation-review` Do not remove legacy fields immediately. Keep `flelmr_code` in payloads for compatibility while introducing: - `legacy_identifiers` - `taxon_identifiers` - `primary_taxon_identifier` ## UI Changes The species detail page should distinguish: - scientific name - primary external taxon identifier - legacy local identifiers - bibliography - suggested reading Editors should see: - unresolved authority matches - conflicting taxon IDs - citation enrichment candidates - duplicate-reference clusters Contributors should only author content and draft references; identifier normalization and bibliography publication remain editorial functions. ## Migration Phases ### Phase A: Demote legacy code - Rename internal presentation from “species code” to “legacy identifier”. - Keep `flelmr_code` only as legacy provenance. - Add `legacy_identifiers` to Markdown export and AST. ### Phase B: Add external taxon identifiers - Create taxon-identifier tables and API payloads. - Add editor review workflows for selecting a primary authority identifier. - Default marine taxa review toward WoRMS where available. - Default broader cross-domain review toward Catalogue of Life and GBIF where WoRMS is not the right authority. - Keep the model open to terrestrial species from the beginning rather than treating them as out-of-scope exceptions. ### Phase C: Structured bibliography - Create citation tables. - Extract plaintext references from imported documents. - Store draft citations separately from accepted citations. ### Phase D: CiteGeist bridge - Define import/export format between EcoSpecies and CiteGeist. - Run draft-reference normalization and DOI enrichment. - Import reviewed structured citations back into EcoSpecies. ### Phase E: Topic-aware bibliography growth - Store species topic phrases. - Use CiteGeist topic expansion for bibliography augmentation. - Keep added citations flagged by source type: - imported - resolved - topic-expanded - editor-added ## Immediate Next Steps 1. Update the Markdown profile to replace `species_code` with `legacy_identifiers` plus `taxon_identifiers`. 2. Add `legacy_identifiers` and `taxon_identifiers` to the AST/document model. 3. Introduce taxon identifier tables in the PostgreSQL schema. 4. Define a minimal EcoSpecies-to-CiteGeist interchange format for plaintext references and topic phrases. 5. Add editor-facing citation review before attempting automatic bibliography publication.