339 lines
8.9 KiB
Markdown
339 lines
8.9 KiB
Markdown
# Structured Markdown Document Plan
|
||
|
||
## Goal
|
||
|
||
Replace the current flat, parser-heavy free-form text handling with a document model that is:
|
||
|
||
- human-readable in plaintext
|
||
- editable in the browser with hierarchy folding
|
||
- permissive-license friendly
|
||
- suitable for first-pass conversion from legacy SLH text files
|
||
- suitable as the primary export format for a species life history
|
||
- able to project cleanly into a flexible database model with greater hierarchical depth
|
||
|
||
## Recommendation
|
||
|
||
Adopt a constrained Markdown-based authoring format as the primary human-facing document format, backed by an internal hierarchical document AST and a relational projection layer in PostgreSQL.
|
||
|
||
Use this three-layer model:
|
||
|
||
1. Source and export format: constrained EcoSpecies Markdown
|
||
2. Canonical application representation: hierarchical AST
|
||
3. Database representation: relational projection for querying, indexing, publishing, and editorial workflows
|
||
|
||
This avoids treating raw free-form text as both the storage format and the parser input.
|
||
|
||
## Why Markdown Instead Of Org
|
||
|
||
Markdown is the better fit for this codebase and licensing requirement because:
|
||
|
||
- it is familiar to most users
|
||
- it is easier to constrain than Org
|
||
- it maps naturally to hierarchical headings
|
||
- it works well with CodeMirror folding
|
||
- it does not require adopting GPL or AGPL editor code
|
||
|
||
Org-style authoring remains conceptually attractive, but embedding Org-specific tooling such as organice would introduce copyleft code, which is not aligned with a permissive-only implementation strategy.
|
||
|
||
## EcoSpecies Markdown Profile
|
||
|
||
The format should be Markdown-like, but intentionally narrower than unrestricted Markdown.
|
||
|
||
### Metadata
|
||
|
||
Use YAML front matter for canonical metadata fields:
|
||
|
||
```md
|
||
---
|
||
title: American Oyster
|
||
common_name: American Oyster
|
||
scientific_name: Crassostrea virginica
|
||
legacy_identifiers:
|
||
- authority: legacy-ecospecies
|
||
identifier: 5192
|
||
label: FLELMR
|
||
taxon_identifiers:
|
||
- authority: worms
|
||
identifier: 159059
|
||
label: AphiaID
|
||
primary: true
|
||
source_file: American Oyster SLH NOAA SEA.txt
|
||
publication_status: published
|
||
---
|
||
```
|
||
|
||
Recommended canonical fields:
|
||
|
||
- `title`
|
||
- `common_name`
|
||
- `scientific_name`
|
||
- `legacy_identifiers`
|
||
- `taxon_identifiers`
|
||
- `primary_taxon_authority`
|
||
- `source_file`
|
||
- `publication_status`
|
||
- `source_format`
|
||
- `legacy_import_id`
|
||
|
||
### Hierarchy
|
||
|
||
Use headings as the sole structure-bearing primitive.
|
||
|
||
Example:
|
||
|
||
```md
|
||
---
|
||
title: American Oyster
|
||
common_name: American Oyster
|
||
scientific_name: Crassostrea virginica
|
||
legacy_identifiers:
|
||
- authority: legacy-ecospecies
|
||
identifier: 5192
|
||
label: FLELMR
|
||
---
|
||
|
||
## Summary
|
||
Short editor-reviewed abstract.
|
||
|
||
## Habitat
|
||
|
||
### Type
|
||
Estuarine.
|
||
|
||
### Substrate
|
||
Hard bottom, shell, mud flats, and other suitable settlement surfaces.
|
||
|
||
## Reproduction
|
||
|
||
### Season
|
||
Spawning occurs from spring through fall in much of the Gulf.
|
||
```
|
||
|
||
Rules:
|
||
|
||
- Heading depth is meaningful.
|
||
- Skip-level headings should be rejected or normalized.
|
||
- Body text belongs to the nearest preceding heading.
|
||
- `#` level is optional if the document title already exists in front matter.
|
||
- Tables, lists, and citations are allowed only where explicitly supported.
|
||
- Arbitrary embedded HTML should be disallowed.
|
||
|
||
### Citations
|
||
|
||
Keep citations readable in Markdown but structured enough to parse.
|
||
|
||
Preferred first-pass shape:
|
||
|
||
```md
|
||
## Citations
|
||
|
||
- [7] Ahmed, M. 1975. Speciation in living oysters. Advances in Marine Biology 13:357-397.
|
||
- [15] Andrews, J.D. 1979. Pelecypoda: Ostreidae. Reproduction of Marine Invertebrates...
|
||
```
|
||
|
||
This is intentionally simpler than trying to infer citations from arbitrary prose.
|
||
|
||
## Canonical AST
|
||
|
||
Markdown should not be the sole internal representation. Parse it into an AST that preserves hierarchy explicitly.
|
||
|
||
Example conceptual shape:
|
||
|
||
```json
|
||
{
|
||
"metadata": {
|
||
"title": "American Oyster",
|
||
"common_name": "American Oyster",
|
||
"scientific_name": "Crassostrea virginica",
|
||
"legacy_identifiers": [
|
||
{
|
||
"authority": "legacy-ecospecies",
|
||
"identifier": "5192",
|
||
"label": "FLELMR"
|
||
}
|
||
]
|
||
},
|
||
"nodes": [
|
||
{
|
||
"id": "n1",
|
||
"type": "section",
|
||
"depth": 2,
|
||
"title": "Summary",
|
||
"body": "Short editor-reviewed abstract.",
|
||
"children": []
|
||
},
|
||
{
|
||
"id": "n2",
|
||
"type": "section",
|
||
"depth": 2,
|
||
"title": "Habitat",
|
||
"body": "",
|
||
"children": [
|
||
{
|
||
"id": "n3",
|
||
"type": "section",
|
||
"depth": 3,
|
||
"title": "Type",
|
||
"body": "Estuarine.",
|
||
"children": []
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
Required AST properties:
|
||
|
||
- arbitrary hierarchical depth
|
||
- stable node identifiers
|
||
- separate metadata from body structure
|
||
- support for editor audit and provenance
|
||
- support for extracting source spans from imported legacy text when available
|
||
|
||
## Database Direction
|
||
|
||
The current flat `document_section` model should evolve into a general document tree.
|
||
|
||
Suggested core tables:
|
||
|
||
- `species_document`
|
||
- `species_document_node`
|
||
- `species_document_node_revision`
|
||
- `species_document_metadata`
|
||
- `citation`
|
||
- `species_document_export`
|
||
|
||
Suggested `species_document_node` fields:
|
||
|
||
- `id`
|
||
- `document_id`
|
||
- `parent_id`
|
||
- `position`
|
||
- `depth`
|
||
- `node_type`
|
||
- `title`
|
||
- `body_markdown`
|
||
- `body_plaintext`
|
||
- `source_heading`
|
||
- `source_span_start`
|
||
- `source_span_end`
|
||
|
||
This enables:
|
||
|
||
- greater hierarchical depth
|
||
- stable editor operations on subtrees
|
||
- future insertion of machine-extracted nested content
|
||
- simplified export back to Markdown
|
||
|
||
## Import Flow
|
||
|
||
The legacy text parser should no longer attempt to infer the final database structure directly.
|
||
|
||
Instead:
|
||
|
||
1. Parse raw legacy text into a best-effort intermediate tree.
|
||
2. Normalize extracted metadata.
|
||
3. Emit constrained Markdown.
|
||
4. Parse constrained Markdown into AST.
|
||
5. Persist AST and project relationally.
|
||
6. Record diagnostics on uncertain conversions.
|
||
|
||
This changes the parser’s role from “infer final structure perfectly” to “produce a reviewable first draft”.
|
||
|
||
## Editor Flow
|
||
|
||
The web editor should operate primarily on the Markdown representation, with a structured parse running on save or preview.
|
||
|
||
Recommended behavior:
|
||
|
||
- fold by heading depth in CodeMirror
|
||
- validate front matter and heading structure
|
||
- preview rendered sections
|
||
- show parser diagnostics inline
|
||
- save both Markdown source and parsed AST
|
||
|
||
The editor should reject or flag:
|
||
|
||
- invalid front matter
|
||
- duplicate canonical metadata keys
|
||
- heading depth jumps
|
||
- malformed citation entries in structured sections
|
||
|
||
## Export Policy
|
||
|
||
Markdown should be the primary export format for a species life history.
|
||
|
||
Export targets:
|
||
|
||
- constrained Markdown for editorial interchange
|
||
- JSON AST for machine workflows
|
||
- derived relational/API payloads for the application
|
||
- optional report-oriented exports later
|
||
|
||
The export path should be:
|
||
|
||
- database document tree -> canonical AST -> constrained Markdown
|
||
|
||
This ensures the exported plaintext remains stable and human-readable.
|
||
|
||
## Migration Strategy
|
||
|
||
### Stage 1: Introduce the document model
|
||
|
||
- add AST schema and persistence tables
|
||
- keep existing section-based reads working
|
||
- build Markdown import/export helpers
|
||
|
||
### Stage 2: Convert current parser output
|
||
|
||
- map current parsed sections into Markdown drafts
|
||
- preserve existing metadata and diagnostics
|
||
- store generated Markdown alongside current records
|
||
|
||
### Stage 3: Introduce Markdown editor
|
||
|
||
- add CodeMirror-based editor with heading folding
|
||
- add validation for front matter and heading structure
|
||
- add round-trip save through AST
|
||
|
||
### Stage 4: Move public reads to the new document model
|
||
|
||
- generate current API responses from the hierarchical document tree
|
||
- keep compatibility shims for legacy flat sections where needed
|
||
|
||
### Stage 5: Expand structured extraction
|
||
|
||
- add deeper parsing for habitat, reproduction, citations, and linkages
|
||
- add richer projections from AST to relational tables
|
||
|
||
## Immediate Implementation Tasks
|
||
|
||
Recommended first engineering tasks:
|
||
|
||
1. Define the constrained Markdown grammar and validation rules.
|
||
2. Design the AST schema and PostgreSQL tables.
|
||
3. Add Markdown import/export utilities in the API service.
|
||
4. Prototype a CodeMirror editor with heading folding.
|
||
5. Add a migration command that converts current species records into Markdown drafts.
|
||
6. Preserve current endpoints while introducing the document-tree backing model.
|
||
|
||
## Non-Goals For The First Pass
|
||
|
||
- full unrestricted Markdown feature support
|
||
- WYSIWYG editing
|
||
- arbitrary embedded HTML
|
||
- perfect citation parsing from all legacy free text
|
||
- replacing every existing API shape immediately
|
||
|
||
## Decision Summary
|
||
|
||
The planned direction is:
|
||
|
||
- constrained Markdown as the editable and exportable document format
|
||
- internal AST as the canonical application representation
|
||
- relational projection for queryable application state
|
||
- CodeMirror-based browser editing with heading folding
|
||
|
||
This is the most practical path toward human-editable hierarchy, permissive-only implementation, cleaner parsing, and deeper long-term document structure.
|