# Structured Markdown Document Plan ## Goal Replace the current flat, parser-heavy free-form text handling with a document model that is: - human-readable in plaintext - editable in the browser with hierarchy folding - permissive-license friendly - suitable for first-pass conversion from legacy SLH text files - suitable as the primary export format for a species life history - able to project cleanly into a flexible database model with greater hierarchical depth ## Recommendation Adopt a constrained Markdown-based authoring format as the primary human-facing document format, backed by an internal hierarchical document AST and a relational projection layer in PostgreSQL. Use this three-layer model: 1. Source and export format: constrained EcoSpecies Markdown 2. Canonical application representation: hierarchical AST 3. Database representation: relational projection for querying, indexing, publishing, and editorial workflows This avoids treating raw free-form text as both the storage format and the parser input. ## Why Markdown Instead Of Org Markdown is the better fit for this codebase and licensing requirement because: - it is familiar to most users - it is easier to constrain than Org - it maps naturally to hierarchical headings - it works well with CodeMirror folding - it does not require adopting GPL or AGPL editor code Org-style authoring remains conceptually attractive, but embedding Org-specific tooling such as organice would introduce copyleft code, which is not aligned with a permissive-only implementation strategy. ## EcoSpecies Markdown Profile The format should be Markdown-like, but intentionally narrower than unrestricted Markdown. ### Metadata Use YAML front matter for canonical metadata fields: ```md --- title: American Oyster common_name: American Oyster scientific_name: Crassostrea virginica legacy_identifiers: - authority: legacy-ecospecies identifier: 5192 label: FLELMR taxon_identifiers: - authority: worms identifier: 159059 label: AphiaID primary: true source_file: American Oyster SLH NOAA SEA.txt publication_status: published --- ``` Recommended canonical fields: - `title` - `common_name` - `scientific_name` - `legacy_identifiers` - `taxon_identifiers` - `primary_taxon_authority` - `source_file` - `publication_status` - `source_format` - `legacy_import_id` ### Hierarchy Use headings as the sole structure-bearing primitive. Example: ```md --- title: American Oyster common_name: American Oyster scientific_name: Crassostrea virginica legacy_identifiers: - authority: legacy-ecospecies identifier: 5192 label: FLELMR --- ## Summary Short editor-reviewed abstract. ## Habitat ### Type Estuarine. ### Substrate Hard bottom, shell, mud flats, and other suitable settlement surfaces. ## Reproduction ### Season Spawning occurs from spring through fall in much of the Gulf. ``` Rules: - Heading depth is meaningful. - Skip-level headings should be rejected or normalized. - Body text belongs to the nearest preceding heading. - `#` level is optional if the document title already exists in front matter. - Tables, lists, and citations are allowed only where explicitly supported. - Arbitrary embedded HTML should be disallowed. ### Citations Keep citations readable in Markdown but structured enough to parse. Preferred first-pass shape: ```md ## Citations - [7] Ahmed, M. 1975. Speciation in living oysters. Advances in Marine Biology 13:357-397. - [15] Andrews, J.D. 1979. Pelecypoda: Ostreidae. Reproduction of Marine Invertebrates... ``` This is intentionally simpler than trying to infer citations from arbitrary prose. ## Canonical AST Markdown should not be the sole internal representation. Parse it into an AST that preserves hierarchy explicitly. Example conceptual shape: ```json { "metadata": { "title": "American Oyster", "common_name": "American Oyster", "scientific_name": "Crassostrea virginica", "legacy_identifiers": [ { "authority": "legacy-ecospecies", "identifier": "5192", "label": "FLELMR" } ] }, "nodes": [ { "id": "n1", "type": "section", "depth": 2, "title": "Summary", "body": "Short editor-reviewed abstract.", "children": [] }, { "id": "n2", "type": "section", "depth": 2, "title": "Habitat", "body": "", "children": [ { "id": "n3", "type": "section", "depth": 3, "title": "Type", "body": "Estuarine.", "children": [] } ] } ] } ``` Required AST properties: - arbitrary hierarchical depth - stable node identifiers - separate metadata from body structure - support for editor audit and provenance - support for extracting source spans from imported legacy text when available ## Database Direction The current flat `document_section` model should evolve into a general document tree. Suggested core tables: - `species_document` - `species_document_node` - `species_document_node_revision` - `species_document_metadata` - `citation` - `species_document_export` Suggested `species_document_node` fields: - `id` - `document_id` - `parent_id` - `position` - `depth` - `node_type` - `title` - `body_markdown` - `body_plaintext` - `source_heading` - `source_span_start` - `source_span_end` This enables: - greater hierarchical depth - stable editor operations on subtrees - future insertion of machine-extracted nested content - simplified export back to Markdown ## Import Flow The legacy text parser should no longer attempt to infer the final database structure directly. Instead: 1. Parse raw legacy text into a best-effort intermediate tree. 2. Normalize extracted metadata. 3. Emit constrained Markdown. 4. Parse constrained Markdown into AST. 5. Persist AST and project relationally. 6. Record diagnostics on uncertain conversions. This changes the parser’s role from “infer final structure perfectly” to “produce a reviewable first draft”. ## Editor Flow The web editor should operate primarily on the Markdown representation, with a structured parse running on save or preview. Recommended behavior: - fold by heading depth in CodeMirror - validate front matter and heading structure - preview rendered sections - show parser diagnostics inline - save both Markdown source and parsed AST The editor should reject or flag: - invalid front matter - duplicate canonical metadata keys - heading depth jumps - malformed citation entries in structured sections ## Export Policy Markdown should be the primary export format for a species life history. Export targets: - constrained Markdown for editorial interchange - JSON AST for machine workflows - derived relational/API payloads for the application - optional report-oriented exports later The export path should be: - database document tree -> canonical AST -> constrained Markdown This ensures the exported plaintext remains stable and human-readable. ## Migration Strategy ### Stage 1: Introduce the document model - add AST schema and persistence tables - keep existing section-based reads working - build Markdown import/export helpers ### Stage 2: Convert current parser output - map current parsed sections into Markdown drafts - preserve existing metadata and diagnostics - store generated Markdown alongside current records ### Stage 3: Introduce Markdown editor - add CodeMirror-based editor with heading folding - add validation for front matter and heading structure - add round-trip save through AST ### Stage 4: Move public reads to the new document model - generate current API responses from the hierarchical document tree - keep compatibility shims for legacy flat sections where needed ### Stage 5: Expand structured extraction - add deeper parsing for habitat, reproduction, citations, and linkages - add richer projections from AST to relational tables ## Immediate Implementation Tasks Recommended first engineering tasks: 1. Define the constrained Markdown grammar and validation rules. 2. Design the AST schema and PostgreSQL tables. 3. Add Markdown import/export utilities in the API service. 4. Prototype a CodeMirror editor with heading folding. 5. Add a migration command that converts current species records into Markdown drafts. 6. Preserve current endpoints while introducing the document-tree backing model. ## Non-Goals For The First Pass - full unrestricted Markdown feature support - WYSIWYG editing - arbitrary embedded HTML - perfect citation parsing from all legacy free text - replacing every existing API shape immediately ## Decision Summary The planned direction is: - constrained Markdown as the editable and exportable document format - internal AST as the canonical application representation - relational projection for queryable application state - CodeMirror-based browser editing with heading folding This is the most practical path toward human-editable hierarchy, permissive-only implementation, cleaner parsing, and deeper long-term document structure.