7.8 KiB

Raw Blame History

EcoSpecies Modernization Roadmap

Current Status

As of 2026-03-27, the repo is no longer at the pure planning stage. The following pieces are already implemented and working in the live stack:

Docker Compose deployment with explicit ecospecies-... container names
path-based hosting support for /apps/ecospecies
in-repo-only source directory resolution with safe path validation
legacy SLH ingest into PostgreSQL-backed species, sections, citations, audit, and document records
editor/admin workflows for draft, review, publish, archive, and audit history
contributor registration and draft-authoring workflow with token-based access
structured Markdown document storage and editor/API round-trip
persisted taxon identifier scaffolding with legacy identifiers separated from future-facing external identifiers
citation extraction, review, enrichment, batch enrichment, candidate matching, and reviewed-candidate selection/addition
citation persistence back into the structured Markdown source of truth

The roadmap below has been updated to reflect that actual state.

Target Product

Create a Docker Compose-based, open-source EcoSpecies successor that:

ingests legacy SLH text files and future species submissions
exposes a stable API for species, sections, citations, and ecological linkages
provides a responsive public web app
supports researcher/editor workflows for curation and publishing
generates exports aligned with legacy reporting needs and future FLELMR-style outputs

Recommended Stack

Core platform

Backend: Python API service
Primary datastore: PostgreSQL
Search/indexing: PostgreSQL full-text initially, optional Meilisearch/OpenSearch later
Frontend: static SPA or React-based client once requirements stabilize
Deployment/runtime: Docker Compose for development and small-scale deployment

Why this stack

permissive licenses
strong support for text ingestion, APIs, and data processing
easy local development
clear path from prototype to production

Product Capabilities By Phase

Phase 0: Discovery and migration planning

Status: completed

Inventory legacy assets and user-facing capabilities.
Capture the replacement architecture and ingestion strategy.
Define acknowledgements, provenance, and licensing boundaries.

Phase 1: Ingestion foundation

Status: substantially complete, with parser refinement ongoing

Parse legacy .txt SLH inputs into structured JSON records.
Normalize common metadata: title, scientific name, common name, FLELMR/EcoSpecies code, headings, references.
Create ingest diagnostics to flag malformed files and missing metadata.
Continue parser refinement for legacy edge cases in headings, citations, and historical bibliography formats.

Phase 2: Public read experience

Status: implemented baseline

Species listing and search.
Species detail view with section navigation.
Provenance and acknowledgement display.
Summary metrics on corpus coverage.
Path-based deployment under /apps/ecospecies.

Phase 3: Structured persistence and editorial workflow

Status: implemented baseline, with editor UX still maturing

PostgreSQL-backed persistence for species, sections, citations, documents, taxon identifiers, and audit history.
Editor-safe import jobs and audit metadata.
Raw-source preservation alongside normalized records.
Authentication and role-based access for admin/editor/contributor workflows.
Persisted editorial workflow state for draft, review, published, and archived records.
Structured Markdown document storage and round-trip editing.
Citation review, enrichment, candidate selection, and reviewed-candidate addition.
Contributor draft creation and owner-scoped editing.

Phase 4: Standards-aware identity and bibliography

Status: partially implemented

Preserve legacy local identifiers as provenance.
Persist taxon identifiers separately from legacy identifiers.
Expose legacy_identifiers, taxon_identifiers, and primary_taxon_* API fields.
Persist structured citation records with DOI/OpenAlex/DataCite-style enrichment fields.
Continue toward multi-authority identifier review, richer citation entities, and CiteGeist-backed bibliography expansion.

Phase 5: Editor ergonomics and advanced review

Status: in progress

Structured Markdown editor is live.
Citation match-review dialog is live.
Remaining work:
- CodeMirror-based Markdown editor with folding
- inline parser diagnostics in the editor
- richer citation diff/review affordances
- clearer document-node and citation provenance in the UI

Phase 6: Linkages and visualization

Status: not started

Model predator/prey, habitat, and ecological association edges.
Add graph endpoints and species-relationship views.
Support public-friendly visual explanations and expert filters.

Phase 7: Reports and export

Status: partially implemented

JSON and Markdown exports exist through the API/document model.
Structured Markdown is now the primary human-readable editor/export format.
Remaining work:
- recreate legacy-like text/RTF export
- support export profiles for legacy compatibility and standards-forward outputs
- improve citation/bibliography export fidelity

Phase 8: Assisted research workflows

Status: planned

Add local-LLM-assisted extraction and drafting in a human-review loop.
Integrate bibliography tooling for citation consolidation and topic expansion.
Support candidate-species intake for records not yet in the historical corpus.
Restrict assisted drafting and publication actions to authenticated editorial roles.

Data Model Direction

Initial core entities:

species
source_document
document_section
citation
taxon_identifier
citation_identifier
bibliography_topic
taxon
linkage
media_asset
ingest_run

Key design rules:

preserve raw source text
retain provenance and import timestamps
separate public published records from draft/editor states
make sections addressable for citation and graph linking
prefer a canonical document AST over direct projection from free-form source text

LLM Extension Strategy

Use local models only for assistive tasks, never silent publication:

extracting candidate structured fields from new SLH text
suggesting missing headings or linkage labels
clustering similar citations
resolving bibliography entries toward DOI/OpenAlex/DataCite where available
treating local legacy codes as provenance, not canonical identifiers
drafting summaries for editor review

Guardrails:

raw text remains authoritative
all generated content is marked as draft
every automated extraction stores source spans where possible

Near-Term Priorities

Add CodeMirror-based folding and structure-aware editing to the Markdown document editor.
Expand taxon identifier review workflows for WoRMS, GBIF, Catalogue of Life, and related authorities.
Deepen citation quality controls, including better parsed-field visibility and stricter/manual review loops where resolver confidence is weak.
Add CiteGeist-style topic expansion and bibliography-suggestion review for under-cited species.
Improve document export fidelity so reviewed citations and standards-based identifiers are clearly represented in Markdown and downstream exports.
Begin the first ecological-linkage data model and API endpoints once citation/identifier workflows stabilize.

Definition Of Done For The Initial Milestone

docker compose up starts a working API and frontend.
The system can enumerate the legacy corpus and show parsed species detail for real SLH files.
Editors can curate structured Markdown documents and citations through authenticated workflows.
Contributors can register, create drafts, and edit only their own submissions.
Project docs describe both the implemented modernization state and the next phases.

7.8 KiB Raw Blame History