EcoSpecies-Atlas/docs/roadmap.md

7.8 KiB

EcoSpecies Modernization Roadmap

Current Status

As of 2026-03-27, the repo is no longer at the pure planning stage. The following pieces are already implemented and working in the live stack:

  • Docker Compose deployment with explicit ecospecies-... container names
  • path-based hosting support for /apps/ecospecies
  • in-repo-only source directory resolution with safe path validation
  • legacy SLH ingest into PostgreSQL-backed species, sections, citations, audit, and document records
  • editor/admin workflows for draft, review, publish, archive, and audit history
  • contributor registration and draft-authoring workflow with token-based access
  • structured Markdown document storage and editor/API round-trip
  • persisted taxon identifier scaffolding with legacy identifiers separated from future-facing external identifiers
  • citation extraction, review, enrichment, batch enrichment, candidate matching, and reviewed-candidate selection/addition
  • citation persistence back into the structured Markdown source of truth

The roadmap below has been updated to reflect that actual state.

Target Product

Create a Docker Compose-based, open-source EcoSpecies successor that:

  • ingests legacy SLH text files and future species submissions
  • exposes a stable API for species, sections, citations, and ecological linkages
  • provides a responsive public web app
  • supports researcher/editor workflows for curation and publishing
  • generates exports aligned with legacy reporting needs and future FLELMR-style outputs

Core platform

  • Backend: Python API service
  • Primary datastore: PostgreSQL
  • Search/indexing: PostgreSQL full-text initially, optional Meilisearch/OpenSearch later
  • Frontend: static SPA or React-based client once requirements stabilize
  • Deployment/runtime: Docker Compose for development and small-scale deployment

Why this stack

  • permissive licenses
  • strong support for text ingestion, APIs, and data processing
  • easy local development
  • clear path from prototype to production

Product Capabilities By Phase

Phase 0: Discovery and migration planning

Status: completed

  • Inventory legacy assets and user-facing capabilities.
  • Capture the replacement architecture and ingestion strategy.
  • Define acknowledgements, provenance, and licensing boundaries.

Phase 1: Ingestion foundation

Status: substantially complete, with parser refinement ongoing

  • Parse legacy .txt SLH inputs into structured JSON records.
  • Normalize common metadata: title, scientific name, common name, FLELMR/EcoSpecies code, headings, references.
  • Create ingest diagnostics to flag malformed files and missing metadata.
  • Continue parser refinement for legacy edge cases in headings, citations, and historical bibliography formats.

Phase 2: Public read experience

Status: implemented baseline

  • Species listing and search.
  • Species detail view with section navigation.
  • Provenance and acknowledgement display.
  • Summary metrics on corpus coverage.
  • Path-based deployment under /apps/ecospecies.

Phase 3: Structured persistence and editorial workflow

Status: implemented baseline, with editor UX still maturing

  • PostgreSQL-backed persistence for species, sections, citations, documents, taxon identifiers, and audit history.
  • Editor-safe import jobs and audit metadata.
  • Raw-source preservation alongside normalized records.
  • Authentication and role-based access for admin/editor/contributor workflows.
  • Persisted editorial workflow state for draft, review, published, and archived records.
  • Structured Markdown document storage and round-trip editing.
  • Citation review, enrichment, candidate selection, and reviewed-candidate addition.
  • Contributor draft creation and owner-scoped editing.

Phase 4: Standards-aware identity and bibliography

Status: partially implemented

  • Preserve legacy local identifiers as provenance.
  • Persist taxon identifiers separately from legacy identifiers.
  • Expose legacy_identifiers, taxon_identifiers, and primary_taxon_* API fields.
  • Persist structured citation records with DOI/OpenAlex/DataCite-style enrichment fields.
  • Continue toward multi-authority identifier review, richer citation entities, and CiteGeist-backed bibliography expansion.

Phase 5: Editor ergonomics and advanced review

Status: in progress

  • Structured Markdown editor is live.
  • Citation match-review dialog is live.
  • Remaining work:
    • CodeMirror-based Markdown editor with folding
    • inline parser diagnostics in the editor
    • richer citation diff/review affordances
    • clearer document-node and citation provenance in the UI

Phase 6: Linkages and visualization

Status: not started

  • Model predator/prey, habitat, and ecological association edges.
  • Add graph endpoints and species-relationship views.
  • Support public-friendly visual explanations and expert filters.

Phase 7: Reports and export

Status: partially implemented

  • JSON and Markdown exports exist through the API/document model.
  • Structured Markdown is now the primary human-readable editor/export format.
  • Remaining work:
    • recreate legacy-like text/RTF export
    • support export profiles for legacy compatibility and standards-forward outputs
    • improve citation/bibliography export fidelity

Phase 8: Assisted research workflows

Status: planned

  • Add local-LLM-assisted extraction and drafting in a human-review loop.
  • Integrate bibliography tooling for citation consolidation and topic expansion.
  • Support candidate-species intake for records not yet in the historical corpus.
  • Restrict assisted drafting and publication actions to authenticated editorial roles.

Data Model Direction

Initial core entities:

  • species
  • source_document
  • document_section
  • citation
  • taxon_identifier
  • citation_identifier
  • bibliography_topic
  • taxon
  • linkage
  • media_asset
  • ingest_run

Key design rules:

  • preserve raw source text
  • retain provenance and import timestamps
  • separate public published records from draft/editor states
  • make sections addressable for citation and graph linking
  • prefer a canonical document AST over direct projection from free-form source text

LLM Extension Strategy

Use local models only for assistive tasks, never silent publication:

  • extracting candidate structured fields from new SLH text
  • suggesting missing headings or linkage labels
  • clustering similar citations
  • resolving bibliography entries toward DOI/OpenAlex/DataCite where available
  • treating local legacy codes as provenance, not canonical identifiers
  • drafting summaries for editor review

Guardrails:

  • raw text remains authoritative
  • all generated content is marked as draft
  • every automated extraction stores source spans where possible

Near-Term Priorities

  1. Add CodeMirror-based folding and structure-aware editing to the Markdown document editor.
  2. Expand taxon identifier review workflows for WoRMS, GBIF, Catalogue of Life, and related authorities.
  3. Deepen citation quality controls, including better parsed-field visibility and stricter/manual review loops where resolver confidence is weak.
  4. Add CiteGeist-style topic expansion and bibliography-suggestion review for under-cited species.
  5. Improve document export fidelity so reviewed citations and standards-based identifiers are clearly represented in Markdown and downstream exports.
  6. Begin the first ecological-linkage data model and API endpoints once citation/identifier workflows stabilize.

Definition Of Done For The Initial Milestone

  • docker compose up starts a working API and frontend.
  • The system can enumerate the legacy corpus and show parsed species detail for real SLH files.
  • Editors can curate structured Markdown documents and citations through authenticated workflows.
  • Contributors can register, create drafts, and edit only their own submissions.
  • Project docs describe both the implemented modernization state and the next phases.