doclift/docs/bundle-format.md

42 lines
1.0 KiB
Markdown
Executable File

# Bundle Format
## Top-level
`manifest.json`
- bundle version
- source root
- converter summary
- document list
`conversion_report.json`
- per-document conversion metrics
- counts for tables, figure references, and errors
`assets/figure_asset_inventory.json`
- optional inventory of external image/figure files discovered under an asset root
## Per-document
Each normalized document lives under `documents/<document-id>/`.
`document.md`
- readable normalized text
- extracted table and figure sections when available
`document.layout.json`
- line-oriented layout manifest
- indentation, tabs, and coarse line classification
`document.tables.json`
- table references found in text
- recovered tables with captions, raw lines, parsed rows, and source line ranges
`document.figures.json`
- explicit figure references from text
- related external assets when available
## Stability
The schema should be stable enough for downstream adapters.
Converters may improve row parsing or figure linking without breaking field names.