8.9 KiB

Raw Permalink Blame History

Structured Markdown Document Plan

Goal

Replace the current flat, parser-heavy free-form text handling with a document model that is:

human-readable in plaintext
editable in the browser with hierarchy folding
permissive-license friendly
suitable for first-pass conversion from legacy SLH text files
suitable as the primary export format for a species life history
able to project cleanly into a flexible database model with greater hierarchical depth

Recommendation

Adopt a constrained Markdown-based authoring format as the primary human-facing document format, backed by an internal hierarchical document AST and a relational projection layer in PostgreSQL.

Use this three-layer model:

Source and export format: constrained EcoSpecies Markdown
Canonical application representation: hierarchical AST
Database representation: relational projection for querying, indexing, publishing, and editorial workflows

This avoids treating raw free-form text as both the storage format and the parser input.

Why Markdown Instead Of Org

Markdown is the better fit for this codebase and licensing requirement because:

it is familiar to most users
it is easier to constrain than Org
it maps naturally to hierarchical headings
it works well with CodeMirror folding
it does not require adopting GPL or AGPL editor code

Org-style authoring remains conceptually attractive, but embedding Org-specific tooling such as organice would introduce copyleft code, which is not aligned with a permissive-only implementation strategy.

EcoSpecies Markdown Profile

The format should be Markdown-like, but intentionally narrower than unrestricted Markdown.

Metadata

Use YAML front matter for canonical metadata fields:

---
title: American Oyster
common_name: American Oyster
scientific_name: Crassostrea virginica
legacy_identifiers:
  - authority: legacy-ecospecies
    identifier: 5192
    label: FLELMR
taxon_identifiers:
  - authority: worms
    identifier: 159059
    label: AphiaID
    primary: true
source_file: American Oyster SLH NOAA SEA.txt
publication_status: published
---

Recommended canonical fields:

title
common_name
scientific_name
legacy_identifiers
taxon_identifiers
primary_taxon_authority
source_file
publication_status
source_format
legacy_import_id

Hierarchy

Use headings as the sole structure-bearing primitive.

Example:

---
title: American Oyster
common_name: American Oyster
scientific_name: Crassostrea virginica
legacy_identifiers:
  - authority: legacy-ecospecies
    identifier: 5192
    label: FLELMR
---

## Summary
Short editor-reviewed abstract.

## Habitat

### Type
Estuarine.

### Substrate
Hard bottom, shell, mud flats, and other suitable settlement surfaces.

## Reproduction

### Season
Spawning occurs from spring through fall in much of the Gulf.

Rules:

Heading depth is meaningful.
Skip-level headings should be rejected or normalized.
Body text belongs to the nearest preceding heading.
# level is optional if the document title already exists in front matter.
Tables, lists, and citations are allowed only where explicitly supported.
Arbitrary embedded HTML should be disallowed.

Citations

Keep citations readable in Markdown but structured enough to parse.

Preferred first-pass shape:

## Citations

- [7] Ahmed, M. 1975. Speciation in living oysters. Advances in Marine Biology 13:357-397.
- [15] Andrews, J.D. 1979. Pelecypoda: Ostreidae. Reproduction of Marine Invertebrates...

This is intentionally simpler than trying to infer citations from arbitrary prose.

Canonical AST

Markdown should not be the sole internal representation. Parse it into an AST that preserves hierarchy explicitly.

Example conceptual shape:

{
  "metadata": {
    "title": "American Oyster",
    "common_name": "American Oyster",
    "scientific_name": "Crassostrea virginica",
    "legacy_identifiers": [
      {
        "authority": "legacy-ecospecies",
        "identifier": "5192",
        "label": "FLELMR"
      }
    ]
  },
  "nodes": [
    {
      "id": "n1",
      "type": "section",
      "depth": 2,
      "title": "Summary",
      "body": "Short editor-reviewed abstract.",
      "children": []
    },
    {
      "id": "n2",
      "type": "section",
      "depth": 2,
      "title": "Habitat",
      "body": "",
      "children": [
        {
          "id": "n3",
          "type": "section",
          "depth": 3,
          "title": "Type",
          "body": "Estuarine.",
          "children": []
        }
      ]
    }
  ]
}

Required AST properties:

arbitrary hierarchical depth
stable node identifiers
separate metadata from body structure
support for editor audit and provenance
support for extracting source spans from imported legacy text when available

Database Direction

The current flat document_section model should evolve into a general document tree.

Suggested core tables:

species_document
species_document_node
species_document_node_revision
species_document_metadata
citation
species_document_export

Suggested species_document_node fields:

id
document_id
parent_id
position
depth
node_type
title
body_markdown
body_plaintext
source_heading
source_span_start
source_span_end

This enables:

greater hierarchical depth
stable editor operations on subtrees
future insertion of machine-extracted nested content
simplified export back to Markdown

Import Flow

The legacy text parser should no longer attempt to infer the final database structure directly.

Instead:

Parse raw legacy text into a best-effort intermediate tree.
Normalize extracted metadata.
Emit constrained Markdown.
Parse constrained Markdown into AST.
Persist AST and project relationally.
Record diagnostics on uncertain conversions.

This changes the parser’s role from “infer final structure perfectly” to “produce a reviewable first draft”.

Editor Flow

The web editor should operate primarily on the Markdown representation, with a structured parse running on save or preview.

Recommended behavior:

fold by heading depth in CodeMirror
validate front matter and heading structure
preview rendered sections
show parser diagnostics inline
save both Markdown source and parsed AST

The editor should reject or flag:

invalid front matter
duplicate canonical metadata keys
heading depth jumps
malformed citation entries in structured sections

Export Policy

Markdown should be the primary export format for a species life history.

Export targets:

constrained Markdown for editorial interchange
JSON AST for machine workflows
derived relational/API payloads for the application
optional report-oriented exports later

The export path should be:

database document tree -> canonical AST -> constrained Markdown

This ensures the exported plaintext remains stable and human-readable.

Migration Strategy

Stage 1: Introduce the document model

add AST schema and persistence tables
keep existing section-based reads working
build Markdown import/export helpers

Stage 2: Convert current parser output

map current parsed sections into Markdown drafts
preserve existing metadata and diagnostics
store generated Markdown alongside current records

Stage 3: Introduce Markdown editor

add CodeMirror-based editor with heading folding
add validation for front matter and heading structure
add round-trip save through AST

Stage 4: Move public reads to the new document model

generate current API responses from the hierarchical document tree
keep compatibility shims for legacy flat sections where needed

Stage 5: Expand structured extraction

add deeper parsing for habitat, reproduction, citations, and linkages
add richer projections from AST to relational tables

Immediate Implementation Tasks

Recommended first engineering tasks:

Define the constrained Markdown grammar and validation rules.
Design the AST schema and PostgreSQL tables.
Add Markdown import/export utilities in the API service.
Prototype a CodeMirror editor with heading folding.
Add a migration command that converts current species records into Markdown drafts.
Preserve current endpoints while introducing the document-tree backing model.

Non-Goals For The First Pass

full unrestricted Markdown feature support
WYSIWYG editing
arbitrary embedded HTML
perfect citation parsing from all legacy free text
replacing every existing API shape immediately

Decision Summary

The planned direction is:

constrained Markdown as the editable and exportable document format
internal AST as the canonical application representation
relational projection for queryable application state
CodeMirror-based browser editing with heading folding

This is the most practical path toward human-editable hierarchy, permissive-only implementation, cleaner parsing, and deeper long-term document structure.

8.9 KiB Raw Permalink Blame History Unescape Escape