196 lines
7.8 KiB
Markdown
196 lines
7.8 KiB
Markdown
# EcoSpecies Modernization Roadmap
|
|
|
|
## Current Status
|
|
|
|
As of 2026-03-27, the repo is no longer at the pure planning stage. The following pieces are already implemented and working in the live stack:
|
|
|
|
- Docker Compose deployment with explicit `ecospecies-...` container names
|
|
- path-based hosting support for `/apps/ecospecies`
|
|
- in-repo-only source directory resolution with safe path validation
|
|
- legacy SLH ingest into PostgreSQL-backed species, sections, citations, audit, and document records
|
|
- editor/admin workflows for draft, review, publish, archive, and audit history
|
|
- contributor registration and draft-authoring workflow with token-based access
|
|
- structured Markdown document storage and editor/API round-trip
|
|
- persisted taxon identifier scaffolding with legacy identifiers separated from future-facing external identifiers
|
|
- citation extraction, review, enrichment, batch enrichment, candidate matching, and reviewed-candidate selection/addition
|
|
- citation persistence back into the structured Markdown source of truth
|
|
|
|
The roadmap below has been updated to reflect that actual state.
|
|
|
|
## Target Product
|
|
|
|
Create a Docker Compose-based, open-source EcoSpecies successor that:
|
|
|
|
- ingests legacy SLH text files and future species submissions
|
|
- exposes a stable API for species, sections, citations, and ecological linkages
|
|
- provides a responsive public web app
|
|
- supports researcher/editor workflows for curation and publishing
|
|
- generates exports aligned with legacy reporting needs and future FLELMR-style outputs
|
|
|
|
## Recommended Stack
|
|
|
|
### Core platform
|
|
|
|
- Backend: Python API service
|
|
- Primary datastore: PostgreSQL
|
|
- Search/indexing: PostgreSQL full-text initially, optional Meilisearch/OpenSearch later
|
|
- Frontend: static SPA or React-based client once requirements stabilize
|
|
- Deployment/runtime: Docker Compose for development and small-scale deployment
|
|
|
|
### Why this stack
|
|
|
|
- permissive licenses
|
|
- strong support for text ingestion, APIs, and data processing
|
|
- easy local development
|
|
- clear path from prototype to production
|
|
|
|
## Product Capabilities By Phase
|
|
|
|
### Phase 0: Discovery and migration planning
|
|
|
|
Status: completed
|
|
|
|
- Inventory legacy assets and user-facing capabilities.
|
|
- Capture the replacement architecture and ingestion strategy.
|
|
- Define acknowledgements, provenance, and licensing boundaries.
|
|
|
|
### Phase 1: Ingestion foundation
|
|
|
|
Status: substantially complete, with parser refinement ongoing
|
|
|
|
- Parse legacy `.txt` SLH inputs into structured JSON records.
|
|
- Normalize common metadata: title, scientific name, common name, FLELMR/EcoSpecies code, headings, references.
|
|
- Create ingest diagnostics to flag malformed files and missing metadata.
|
|
- Continue parser refinement for legacy edge cases in headings, citations, and historical bibliography formats.
|
|
|
|
### Phase 2: Public read experience
|
|
|
|
Status: implemented baseline
|
|
|
|
- Species listing and search.
|
|
- Species detail view with section navigation.
|
|
- Provenance and acknowledgement display.
|
|
- Summary metrics on corpus coverage.
|
|
- Path-based deployment under `/apps/ecospecies`.
|
|
|
|
### Phase 3: Structured persistence and editorial workflow
|
|
|
|
Status: implemented baseline, with editor UX still maturing
|
|
|
|
- PostgreSQL-backed persistence for species, sections, citations, documents, taxon identifiers, and audit history.
|
|
- Editor-safe import jobs and audit metadata.
|
|
- Raw-source preservation alongside normalized records.
|
|
- Authentication and role-based access for admin/editor/contributor workflows.
|
|
- Persisted editorial workflow state for draft, review, published, and archived records.
|
|
- Structured Markdown document storage and round-trip editing.
|
|
- Citation review, enrichment, candidate selection, and reviewed-candidate addition.
|
|
- Contributor draft creation and owner-scoped editing.
|
|
|
|
### Phase 4: Standards-aware identity and bibliography
|
|
|
|
Status: partially implemented
|
|
|
|
- Preserve legacy local identifiers as provenance.
|
|
- Persist taxon identifiers separately from legacy identifiers.
|
|
- Expose `legacy_identifiers`, `taxon_identifiers`, and `primary_taxon_*` API fields.
|
|
- Persist structured citation records with DOI/OpenAlex/DataCite-style enrichment fields.
|
|
- Continue toward multi-authority identifier review, richer citation entities, and CiteGeist-backed bibliography expansion.
|
|
|
|
### Phase 5: Editor ergonomics and advanced review
|
|
|
|
Status: in progress
|
|
|
|
- Structured Markdown editor is live.
|
|
- Citation match-review dialog is live.
|
|
- Remaining work:
|
|
- CodeMirror-based Markdown editor with folding
|
|
- inline parser diagnostics in the editor
|
|
- richer citation diff/review affordances
|
|
- clearer document-node and citation provenance in the UI
|
|
|
|
### Phase 6: Linkages and visualization
|
|
|
|
Status: not started
|
|
|
|
- Model predator/prey, habitat, and ecological association edges.
|
|
- Add graph endpoints and species-relationship views.
|
|
- Support public-friendly visual explanations and expert filters.
|
|
|
|
### Phase 7: Reports and export
|
|
|
|
Status: partially implemented
|
|
|
|
- JSON and Markdown exports exist through the API/document model.
|
|
- Structured Markdown is now the primary human-readable editor/export format.
|
|
- Remaining work:
|
|
- recreate legacy-like text/RTF export
|
|
- support export profiles for legacy compatibility and standards-forward outputs
|
|
- improve citation/bibliography export fidelity
|
|
|
|
### Phase 8: Assisted research workflows
|
|
|
|
Status: planned
|
|
|
|
- Add local-LLM-assisted extraction and drafting in a human-review loop.
|
|
- Integrate bibliography tooling for citation consolidation and topic expansion.
|
|
- Support candidate-species intake for records not yet in the historical corpus.
|
|
- Restrict assisted drafting and publication actions to authenticated editorial roles.
|
|
|
|
## Data Model Direction
|
|
|
|
Initial core entities:
|
|
|
|
- `species`
|
|
- `source_document`
|
|
- `document_section`
|
|
- `citation`
|
|
- `taxon_identifier`
|
|
- `citation_identifier`
|
|
- `bibliography_topic`
|
|
- `taxon`
|
|
- `linkage`
|
|
- `media_asset`
|
|
- `ingest_run`
|
|
|
|
Key design rules:
|
|
|
|
- preserve raw source text
|
|
- retain provenance and import timestamps
|
|
- separate public published records from draft/editor states
|
|
- make sections addressable for citation and graph linking
|
|
- prefer a canonical document AST over direct projection from free-form source text
|
|
|
|
## LLM Extension Strategy
|
|
|
|
Use local models only for assistive tasks, never silent publication:
|
|
|
|
- extracting candidate structured fields from new SLH text
|
|
- suggesting missing headings or linkage labels
|
|
- clustering similar citations
|
|
- resolving bibliography entries toward DOI/OpenAlex/DataCite where available
|
|
- treating local legacy codes as provenance, not canonical identifiers
|
|
- drafting summaries for editor review
|
|
|
|
Guardrails:
|
|
|
|
- raw text remains authoritative
|
|
- all generated content is marked as draft
|
|
- every automated extraction stores source spans where possible
|
|
|
|
## Near-Term Priorities
|
|
|
|
1. Add CodeMirror-based folding and structure-aware editing to the Markdown document editor.
|
|
2. Expand taxon identifier review workflows for WoRMS, GBIF, Catalogue of Life, and related authorities.
|
|
3. Deepen citation quality controls, including better parsed-field visibility and stricter/manual review loops where resolver confidence is weak.
|
|
4. Add CiteGeist-style topic expansion and bibliography-suggestion review for under-cited species.
|
|
5. Improve document export fidelity so reviewed citations and standards-based identifiers are clearly represented in Markdown and downstream exports.
|
|
6. Begin the first ecological-linkage data model and API endpoints once citation/identifier workflows stabilize.
|
|
|
|
## Definition Of Done For The Initial Milestone
|
|
|
|
- `docker compose up` starts a working API and frontend.
|
|
- The system can enumerate the legacy corpus and show parsed species detail for real SLH files.
|
|
- Editors can curate structured Markdown documents and citations through authenticated workflows.
|
|
- Contributors can register, create drafts, and edit only their own submissions.
|
|
- Project docs describe both the implemented modernization state and the next phases.
|