8.9 KiB
Structured Markdown Document Plan
Goal
Replace the current flat, parser-heavy free-form text handling with a document model that is:
- human-readable in plaintext
- editable in the browser with hierarchy folding
- permissive-license friendly
- suitable for first-pass conversion from legacy SLH text files
- suitable as the primary export format for a species life history
- able to project cleanly into a flexible database model with greater hierarchical depth
Recommendation
Adopt a constrained Markdown-based authoring format as the primary human-facing document format, backed by an internal hierarchical document AST and a relational projection layer in PostgreSQL.
Use this three-layer model:
- Source and export format: constrained EcoSpecies Markdown
- Canonical application representation: hierarchical AST
- Database representation: relational projection for querying, indexing, publishing, and editorial workflows
This avoids treating raw free-form text as both the storage format and the parser input.
Why Markdown Instead Of Org
Markdown is the better fit for this codebase and licensing requirement because:
- it is familiar to most users
- it is easier to constrain than Org
- it maps naturally to hierarchical headings
- it works well with CodeMirror folding
- it does not require adopting GPL or AGPL editor code
Org-style authoring remains conceptually attractive, but embedding Org-specific tooling such as organice would introduce copyleft code, which is not aligned with a permissive-only implementation strategy.
EcoSpecies Markdown Profile
The format should be Markdown-like, but intentionally narrower than unrestricted Markdown.
Metadata
Use YAML front matter for canonical metadata fields:
---
title: American Oyster
common_name: American Oyster
scientific_name: Crassostrea virginica
legacy_identifiers:
- authority: legacy-ecospecies
identifier: 5192
label: FLELMR
taxon_identifiers:
- authority: worms
identifier: 159059
label: AphiaID
primary: true
source_file: American Oyster SLH NOAA SEA.txt
publication_status: published
---
Recommended canonical fields:
titlecommon_namescientific_namelegacy_identifierstaxon_identifiersprimary_taxon_authoritysource_filepublication_statussource_formatlegacy_import_id
Hierarchy
Use headings as the sole structure-bearing primitive.
Example:
---
title: American Oyster
common_name: American Oyster
scientific_name: Crassostrea virginica
legacy_identifiers:
- authority: legacy-ecospecies
identifier: 5192
label: FLELMR
---
## Summary
Short editor-reviewed abstract.
## Habitat
### Type
Estuarine.
### Substrate
Hard bottom, shell, mud flats, and other suitable settlement surfaces.
## Reproduction
### Season
Spawning occurs from spring through fall in much of the Gulf.
Rules:
- Heading depth is meaningful.
- Skip-level headings should be rejected or normalized.
- Body text belongs to the nearest preceding heading.
#level is optional if the document title already exists in front matter.- Tables, lists, and citations are allowed only where explicitly supported.
- Arbitrary embedded HTML should be disallowed.
Citations
Keep citations readable in Markdown but structured enough to parse.
Preferred first-pass shape:
## Citations
- [7] Ahmed, M. 1975. Speciation in living oysters. Advances in Marine Biology 13:357-397.
- [15] Andrews, J.D. 1979. Pelecypoda: Ostreidae. Reproduction of Marine Invertebrates...
This is intentionally simpler than trying to infer citations from arbitrary prose.
Canonical AST
Markdown should not be the sole internal representation. Parse it into an AST that preserves hierarchy explicitly.
Example conceptual shape:
{
"metadata": {
"title": "American Oyster",
"common_name": "American Oyster",
"scientific_name": "Crassostrea virginica",
"legacy_identifiers": [
{
"authority": "legacy-ecospecies",
"identifier": "5192",
"label": "FLELMR"
}
]
},
"nodes": [
{
"id": "n1",
"type": "section",
"depth": 2,
"title": "Summary",
"body": "Short editor-reviewed abstract.",
"children": []
},
{
"id": "n2",
"type": "section",
"depth": 2,
"title": "Habitat",
"body": "",
"children": [
{
"id": "n3",
"type": "section",
"depth": 3,
"title": "Type",
"body": "Estuarine.",
"children": []
}
]
}
]
}
Required AST properties:
- arbitrary hierarchical depth
- stable node identifiers
- separate metadata from body structure
- support for editor audit and provenance
- support for extracting source spans from imported legacy text when available
Database Direction
The current flat document_section model should evolve into a general document tree.
Suggested core tables:
species_documentspecies_document_nodespecies_document_node_revisionspecies_document_metadatacitationspecies_document_export
Suggested species_document_node fields:
iddocument_idparent_idpositiondepthnode_typetitlebody_markdownbody_plaintextsource_headingsource_span_startsource_span_end
This enables:
- greater hierarchical depth
- stable editor operations on subtrees
- future insertion of machine-extracted nested content
- simplified export back to Markdown
Import Flow
The legacy text parser should no longer attempt to infer the final database structure directly.
Instead:
- Parse raw legacy text into a best-effort intermediate tree.
- Normalize extracted metadata.
- Emit constrained Markdown.
- Parse constrained Markdown into AST.
- Persist AST and project relationally.
- Record diagnostics on uncertain conversions.
This changes the parser’s role from “infer final structure perfectly” to “produce a reviewable first draft”.
Editor Flow
The web editor should operate primarily on the Markdown representation, with a structured parse running on save or preview.
Recommended behavior:
- fold by heading depth in CodeMirror
- validate front matter and heading structure
- preview rendered sections
- show parser diagnostics inline
- save both Markdown source and parsed AST
The editor should reject or flag:
- invalid front matter
- duplicate canonical metadata keys
- heading depth jumps
- malformed citation entries in structured sections
Export Policy
Markdown should be the primary export format for a species life history.
Export targets:
- constrained Markdown for editorial interchange
- JSON AST for machine workflows
- derived relational/API payloads for the application
- optional report-oriented exports later
The export path should be:
- database document tree -> canonical AST -> constrained Markdown
This ensures the exported plaintext remains stable and human-readable.
Migration Strategy
Stage 1: Introduce the document model
- add AST schema and persistence tables
- keep existing section-based reads working
- build Markdown import/export helpers
Stage 2: Convert current parser output
- map current parsed sections into Markdown drafts
- preserve existing metadata and diagnostics
- store generated Markdown alongside current records
Stage 3: Introduce Markdown editor
- add CodeMirror-based editor with heading folding
- add validation for front matter and heading structure
- add round-trip save through AST
Stage 4: Move public reads to the new document model
- generate current API responses from the hierarchical document tree
- keep compatibility shims for legacy flat sections where needed
Stage 5: Expand structured extraction
- add deeper parsing for habitat, reproduction, citations, and linkages
- add richer projections from AST to relational tables
Immediate Implementation Tasks
Recommended first engineering tasks:
- Define the constrained Markdown grammar and validation rules.
- Design the AST schema and PostgreSQL tables.
- Add Markdown import/export utilities in the API service.
- Prototype a CodeMirror editor with heading folding.
- Add a migration command that converts current species records into Markdown drafts.
- Preserve current endpoints while introducing the document-tree backing model.
Non-Goals For The First Pass
- full unrestricted Markdown feature support
- WYSIWYG editing
- arbitrary embedded HTML
- perfect citation parsing from all legacy free text
- replacing every existing API shape immediately
Decision Summary
The planned direction is:
- constrained Markdown as the editable and exportable document format
- internal AST as the canonical application representation
- relational projection for queryable application state
- CodeMirror-based browser editing with heading folding
This is the most practical path toward human-editable hierarchy, permissive-only implementation, cleaner parsing, and deeper long-term document structure.