316 lines
10 KiB
Markdown
316 lines
10 KiB
Markdown
# EcoSpecies Standards Migration Plan
|
|
|
|
## Problem
|
|
|
|
The current EcoSpecies ingest and document model still treats legacy local fields such as `FLELMR code` / `species_code` as if they were primary identifiers. That is useful for historical provenance, but it is the wrong long-term center of gravity for a broader, modern biodiversity knowledge system.
|
|
|
|
The same problem exists for citations:
|
|
|
|
- legacy plaintext reference blocks are treated as local document text,
|
|
- citation identity is weak or missing,
|
|
- bibliography growth is tied to what happened to appear in the historical SLH file.
|
|
|
|
The new system should preserve legacy local identifiers and references, but it should not be structurally bound to them.
|
|
|
|
## Direction
|
|
|
|
Treat legacy local codes and freeform references as import-era artifacts, not canonical future-facing identifiers.
|
|
|
|
Going forward, EcoSpecies should prefer broadly recognized identifiers and registries:
|
|
|
|
- taxonomic name authority and taxon identifiers:
|
|
- Catalogue of Life IDs and release DOIs
|
|
- GBIF taxon keys
|
|
- WoRMS AphiaIDs for marine taxa
|
|
- ITIS TSNs where relevant
|
|
- optional NCBI Taxonomy IDs for research interoperability
|
|
- literature and dataset identifiers:
|
|
- DOI as the primary publication/dataset identifier
|
|
- ISBN/ISSN where DOI is absent
|
|
- OpenAlex IDs and DataCite metadata as enrichment layers
|
|
- contributor identity:
|
|
- email-based local contributor accounts now
|
|
- optional ORCID linkage later for editor and contributor identity
|
|
|
|
The system should be marine-forward because that matches the historical corpus, but not marine-exclusive. Identifier strategy should therefore be authority-aware rather than tied to a single domain-specific registry.
|
|
|
|
## Authority Selection Strategy
|
|
|
|
Choose the primary taxon authority by best-fit coverage, not by a single global rule.
|
|
|
|
- marine taxa:
|
|
- prefer WoRMS AphiaID as primary when confidently matched
|
|
- retain GBIF and Catalogue of Life as crosswalks
|
|
- non-marine or mixed-domain taxa:
|
|
- prefer Catalogue of Life or GBIF as primary, depending on match quality and coverage
|
|
- retain ITIS and other relevant identifiers as crosswalks
|
|
- unresolved or conflicting cases:
|
|
- store all candidate identifiers
|
|
- require editorial review before a primary identifier is asserted
|
|
|
|
This keeps the project ready for terrestrial expansion without discarding the value of WoRMS for the present corpus.
|
|
|
|
## Important Taxonomic Note
|
|
|
|
PhyloCode is relevant for clade naming, not as a general-purpose replacement for species-level registry IDs. It should not become the primary EcoSpecies species identifier layer. It may be useful later for clade-aware ontology and higher-level phylogenetic naming, but not as the main substitute for local `species_code` values.
|
|
|
|
## Core Design Rules
|
|
|
|
1. Legacy local identifiers remain preserved exactly as imported.
|
|
2. Canonical taxon identity becomes multi-authority, not single-local-code.
|
|
3. Citations become first-class structured entities, not just text inside a section.
|
|
4. Bibliographies can be extended by topic and citation graph, not only by source-document inheritance.
|
|
5. Exports keep provenance visible so readers can distinguish legacy source metadata from normalized external identifiers.
|
|
|
|
## Schema Changes
|
|
|
|
### Species metadata
|
|
|
|
Retain `flelmr_code` for provenance, but demote it to a legacy metadata field.
|
|
|
|
Add a taxon-identity layer:
|
|
|
|
- `taxon_name_usage`
|
|
- `taxon_identifier`
|
|
- `taxon_authority`
|
|
- `taxon_match_review`
|
|
|
|
Suggested fields:
|
|
|
|
- `taxon_identifier.authority`
|
|
- `taxon_identifier.identifier`
|
|
- `taxon_identifier.rank`
|
|
- `taxon_identifier.label`
|
|
- `taxon_identifier.is_primary`
|
|
- `taxon_identifier.source_url`
|
|
- `taxon_identifier.asserted_by`
|
|
- `taxon_identifier.match_confidence`
|
|
- `taxon_identifier.review_status`
|
|
|
|
Examples:
|
|
|
|
- `authority = "worms", identifier = "159059", label = "AphiaID"`
|
|
- `authority = "gbif", identifier = "2290910", label = "taxonKey"`
|
|
- `authority = "col", identifier = "5T7L7", label = "taxonID"`
|
|
- `authority = "itis", identifier = "161989", label = "TSN"`
|
|
- `authority = "legacy-ecospecies", identifier = "5192", label = "FLELMR"`
|
|
|
|
### Citation model
|
|
|
|
Move from section text to structured bibliography entities:
|
|
|
|
- `citation`
|
|
- `citation_identifier`
|
|
- `citation_relation`
|
|
- `species_citation`
|
|
- `document_node_citation`
|
|
- `bibliography_topic`
|
|
|
|
Suggested citation identifier types:
|
|
|
|
- DOI
|
|
- ISBN
|
|
- ISSN
|
|
- PMID
|
|
- arXiv
|
|
- OpenAlex
|
|
- URL
|
|
|
|
## Markdown / AST Changes
|
|
|
|
Update the constrained Markdown profile so metadata stops implying that `species_code` is canonical.
|
|
|
|
Replace the current front matter recommendation:
|
|
|
|
```md
|
|
species_code: 5192
|
|
```
|
|
|
|
with a provenance-oriented shape:
|
|
|
|
```md
|
|
legacy_identifiers:
|
|
- authority: legacy-ecospecies
|
|
identifier: 5192
|
|
label: FLELMR
|
|
taxon_identifiers:
|
|
- authority: worms
|
|
identifier: 159059
|
|
label: AphiaID
|
|
primary: true
|
|
- authority: gbif
|
|
identifier: 2290910
|
|
label: taxonKey
|
|
```
|
|
|
|
Also add explicit bibliography sections:
|
|
|
|
```md
|
|
## References
|
|
|
|
- id: doi:10.1000/example
|
|
text: Smith, J. 2024. Example paper...
|
|
relation: cites
|
|
|
|
## Suggested Reading
|
|
|
|
- topic: estuarine ecology
|
|
```
|
|
|
|
The AST should preserve:
|
|
|
|
- legacy identifiers
|
|
- normalized taxon identifiers
|
|
- structured references
|
|
- topic links used for bibliography expansion
|
|
|
|
## Import Pipeline Changes
|
|
|
|
### Species identity
|
|
|
|
Import should produce:
|
|
|
|
1. raw imported name fields,
|
|
2. legacy local identifiers,
|
|
3. unresolved candidate taxon identifiers,
|
|
4. optional matched external identifiers,
|
|
5. a review state for unresolved or conflicting authority matches.
|
|
|
|
Do not block ingest if no external authority match exists. Store the unresolved state explicitly.
|
|
|
|
Primary identifier assignment should be determined by:
|
|
|
|
1. domain fit of the authority
|
|
2. confidence of the match
|
|
3. editorial review status
|
|
4. future ability to crosswalk to other authorities
|
|
|
|
### Citations
|
|
|
|
Split citation processing into stages:
|
|
|
|
1. detect bibliography/reference sections in the imported SLH text,
|
|
2. extract plaintext reference strings,
|
|
3. convert plaintext references into draft structured entries,
|
|
4. enrich identifiers and metadata,
|
|
5. assign accepted citations back to species and document nodes,
|
|
6. optionally expand bibliography by topic and citation graph.
|
|
|
|
## CiteGeist Integration
|
|
|
|
`../CiteGeist` is a strong fit for this migration.
|
|
|
|
Observed capabilities in that repo already cover much of what EcoSpecies needs:
|
|
|
|
- extracting references from plaintext,
|
|
- converting rough references into draft structured entries,
|
|
- DOI/Crossref/DataCite/OpenAlex enrichment,
|
|
- citation graph expansion,
|
|
- topic-based bibliography expansion,
|
|
- duplicate clustering and canonicalization.
|
|
|
|
### Recommended integration boundary
|
|
|
|
Do not embed CiteGeist logic directly into the EcoSpecies parser.
|
|
|
|
Instead:
|
|
|
|
1. EcoSpecies exports candidate plaintext references and topic phrases.
|
|
2. CiteGeist processes and enriches them into structured bibliography data.
|
|
3. EcoSpecies imports reviewed citation outputs into its own `citation` tables.
|
|
|
|
### First integration targets
|
|
|
|
- species-level bibliography cleanup from `References` sections
|
|
- DOI resolution and identifier assignment
|
|
- duplicate detection across species bibliographies
|
|
- topic expansion for subject areas such as habitat, trophic ecology, reproduction, invasive biology, and fisheries context
|
|
|
|
### Later integration targets
|
|
|
|
- node-level citation attachment
|
|
- bibliography review UI
|
|
- suggested-reading generation per species
|
|
- topic-seeded bibliography augmentation for under-cited species drafts
|
|
|
|
## API Changes
|
|
|
|
Add standards-aware endpoints:
|
|
|
|
- `/api/species/<slug>/identifiers`
|
|
- `/api/species/<slug>/citations`
|
|
- `/api/species/<slug>/bibliography/topics`
|
|
- `/api/editor/species/<slug>/identifier-review`
|
|
- `/api/editor/species/<slug>/citation-review`
|
|
|
|
Do not remove legacy fields immediately. Keep `flelmr_code` in payloads for compatibility while introducing:
|
|
|
|
- `legacy_identifiers`
|
|
- `taxon_identifiers`
|
|
- `primary_taxon_identifier`
|
|
|
|
## UI Changes
|
|
|
|
The species detail page should distinguish:
|
|
|
|
- scientific name
|
|
- primary external taxon identifier
|
|
- legacy local identifiers
|
|
- bibliography
|
|
- suggested reading
|
|
|
|
Editors should see:
|
|
|
|
- unresolved authority matches
|
|
- conflicting taxon IDs
|
|
- citation enrichment candidates
|
|
- duplicate-reference clusters
|
|
|
|
Contributors should only author content and draft references; identifier normalization and bibliography publication remain editorial functions.
|
|
|
|
## Migration Phases
|
|
|
|
### Phase A: Demote legacy code
|
|
|
|
- Rename internal presentation from “species code” to “legacy identifier”.
|
|
- Keep `flelmr_code` only as legacy provenance.
|
|
- Add `legacy_identifiers` to Markdown export and AST.
|
|
|
|
### Phase B: Add external taxon identifiers
|
|
|
|
- Create taxon-identifier tables and API payloads.
|
|
- Add editor review workflows for selecting a primary authority identifier.
|
|
- Default marine taxa review toward WoRMS where available.
|
|
- Default broader cross-domain review toward Catalogue of Life and GBIF where WoRMS is not the right authority.
|
|
- Keep the model open to terrestrial species from the beginning rather than treating them as out-of-scope exceptions.
|
|
|
|
### Phase C: Structured bibliography
|
|
|
|
- Create citation tables.
|
|
- Extract plaintext references from imported documents.
|
|
- Store draft citations separately from accepted citations.
|
|
|
|
### Phase D: CiteGeist bridge
|
|
|
|
- Define import/export format between EcoSpecies and CiteGeist.
|
|
- Run draft-reference normalization and DOI enrichment.
|
|
- Import reviewed structured citations back into EcoSpecies.
|
|
|
|
### Phase E: Topic-aware bibliography growth
|
|
|
|
- Store species topic phrases.
|
|
- Use CiteGeist topic expansion for bibliography augmentation.
|
|
- Keep added citations flagged by source type:
|
|
- imported
|
|
- resolved
|
|
- topic-expanded
|
|
- editor-added
|
|
|
|
## Immediate Next Steps
|
|
|
|
1. Update the Markdown profile to replace `species_code` with `legacy_identifiers` plus `taxon_identifiers`.
|
|
2. Add `legacy_identifiers` and `taxon_identifiers` to the AST/document model.
|
|
3. Introduce taxon identifier tables in the PostgreSQL schema.
|
|
4. Define a minimal EcoSpecies-to-CiteGeist interchange format for plaintext references and topic phrases.
|
|
5. Add editor-facing citation review before attempting automatic bibliography publication.
|