# EcoSpecies Modernization Roadmap ## Current Status As of 2026-03-27, the repo is no longer at the pure planning stage. The following pieces are already implemented and working in the live stack: - Docker Compose deployment with explicit `ecospecies-...` container names - path-based hosting support for `/apps/ecospecies` - in-repo-only source directory resolution with safe path validation - legacy SLH ingest into PostgreSQL-backed species, sections, citations, audit, and document records - editor/admin workflows for draft, review, publish, archive, and audit history - contributor registration and draft-authoring workflow with token-based access - structured Markdown document storage and editor/API round-trip - persisted taxon identifier scaffolding with legacy identifiers separated from future-facing external identifiers - citation extraction, review, enrichment, batch enrichment, candidate matching, and reviewed-candidate selection/addition - citation persistence back into the structured Markdown source of truth The roadmap below has been updated to reflect that actual state. ## Target Product Create a Docker Compose-based, open-source EcoSpecies successor that: - ingests legacy SLH text files and future species submissions - exposes a stable API for species, sections, citations, and ecological linkages - provides a responsive public web app - supports researcher/editor workflows for curation and publishing - generates exports aligned with legacy reporting needs and future FLELMR-style outputs ## Recommended Stack ### Core platform - Backend: Python API service - Primary datastore: PostgreSQL - Search/indexing: PostgreSQL full-text initially, optional Meilisearch/OpenSearch later - Frontend: static SPA or React-based client once requirements stabilize - Deployment/runtime: Docker Compose for development and small-scale deployment ### Why this stack - permissive licenses - strong support for text ingestion, APIs, and data processing - easy local development - clear path from prototype to production ## Product Capabilities By Phase ### Phase 0: Discovery and migration planning Status: completed - Inventory legacy assets and user-facing capabilities. - Capture the replacement architecture and ingestion strategy. - Define acknowledgements, provenance, and licensing boundaries. ### Phase 1: Ingestion foundation Status: substantially complete, with parser refinement ongoing - Parse legacy `.txt` SLH inputs into structured JSON records. - Normalize common metadata: title, scientific name, common name, FLELMR/EcoSpecies code, headings, references. - Create ingest diagnostics to flag malformed files and missing metadata. - Continue parser refinement for legacy edge cases in headings, citations, and historical bibliography formats. ### Phase 2: Public read experience Status: implemented baseline - Species listing and search. - Species detail view with section navigation. - Provenance and acknowledgement display. - Summary metrics on corpus coverage. - Path-based deployment under `/apps/ecospecies`. ### Phase 3: Structured persistence and editorial workflow Status: implemented baseline, with editor UX still maturing - PostgreSQL-backed persistence for species, sections, citations, documents, taxon identifiers, and audit history. - Editor-safe import jobs and audit metadata. - Raw-source preservation alongside normalized records. - Authentication and role-based access for admin/editor/contributor workflows. - Persisted editorial workflow state for draft, review, published, and archived records. - Structured Markdown document storage and round-trip editing. - Citation review, enrichment, candidate selection, and reviewed-candidate addition. - Contributor draft creation and owner-scoped editing. ### Phase 4: Standards-aware identity and bibliography Status: partially implemented - Preserve legacy local identifiers as provenance. - Persist taxon identifiers separately from legacy identifiers. - Expose `legacy_identifiers`, `taxon_identifiers`, and `primary_taxon_*` API fields. - Persist structured citation records with DOI/OpenAlex/DataCite-style enrichment fields. - Continue toward multi-authority identifier review, richer citation entities, and CiteGeist-backed bibliography expansion. ### Phase 5: Editor ergonomics and advanced review Status: in progress - Structured Markdown editor is live. - Citation match-review dialog is live. - Remaining work: - CodeMirror-based Markdown editor with folding - inline parser diagnostics in the editor - richer citation diff/review affordances - clearer document-node and citation provenance in the UI ### Phase 6: Linkages and visualization Status: not started - Model predator/prey, habitat, and ecological association edges. - Add graph endpoints and species-relationship views. - Support public-friendly visual explanations and expert filters. ### Phase 7: Reports and export Status: partially implemented - JSON and Markdown exports exist through the API/document model. - Structured Markdown is now the primary human-readable editor/export format. - Remaining work: - recreate legacy-like text/RTF export - support export profiles for legacy compatibility and standards-forward outputs - improve citation/bibliography export fidelity ### Phase 8: Assisted research workflows Status: planned - Add local-LLM-assisted extraction and drafting in a human-review loop. - Integrate bibliography tooling for citation consolidation and topic expansion. - Support candidate-species intake for records not yet in the historical corpus. - Restrict assisted drafting and publication actions to authenticated editorial roles. ## Data Model Direction Initial core entities: - `species` - `source_document` - `document_section` - `citation` - `taxon_identifier` - `citation_identifier` - `bibliography_topic` - `taxon` - `linkage` - `media_asset` - `ingest_run` Key design rules: - preserve raw source text - retain provenance and import timestamps - separate public published records from draft/editor states - make sections addressable for citation and graph linking - prefer a canonical document AST over direct projection from free-form source text ## LLM Extension Strategy Use local models only for assistive tasks, never silent publication: - extracting candidate structured fields from new SLH text - suggesting missing headings or linkage labels - clustering similar citations - resolving bibliography entries toward DOI/OpenAlex/DataCite where available - treating local legacy codes as provenance, not canonical identifiers - drafting summaries for editor review Guardrails: - raw text remains authoritative - all generated content is marked as draft - every automated extraction stores source spans where possible ## Near-Term Priorities 1. Add CodeMirror-based folding and structure-aware editing to the Markdown document editor. 2. Expand taxon identifier review workflows for WoRMS, GBIF, Catalogue of Life, and related authorities. 3. Deepen citation quality controls, including better parsed-field visibility and stricter/manual review loops where resolver confidence is weak. 4. Add CiteGeist-style topic expansion and bibliography-suggestion review for under-cited species. 5. Improve document export fidelity so reviewed citations and standards-based identifiers are clearly represented in Markdown and downstream exports. 6. Begin the first ecological-linkage data model and API endpoints once citation/identifier workflows stabilize. ## Definition Of Done For The Initial Milestone - `docker compose up` starts a working API and frontend. - The system can enumerate the legacy corpus and show parsed species detail for real SLH files. - Editors can curate structured Markdown documents and citations through authenticated workflows. - Contributors can register, create drafts, and edit only their own submissions. - Project docs describe both the implemented modernization state and the next phases.