doclift/docs/architecture.md

39 lines
1.2 KiB
Markdown
Executable File

# Architecture
`doclift` is intended to sit between raw legacy sources and downstream domain-specific systems.
## Layers
1. Format detection
2. Format-specific extraction
3. Structural recovery
4. Normalized bundle emission
5. Downstream import by applications such as Didactopus or GroundRecall
## Design constraints
- deterministic outputs
- explicit provenance
- structured sidecars for non-prose information
- graceful degradation when exact layout cannot be recovered
- container-friendly execution to reduce cross-platform variance
## Output philosophy
The primary artifact is not a page-faithful rendering. It is a normalized bundle:
- readable by humans
- structured enough for agents and pipelines
- explicit about uncertainty and extraction limits
## Initial format strategy
- `.doc`: implemented through `catdoc`, with layout/table recovery on extracted text
- `.docx`: planned as a higher-fidelity path
- `.wpd`: planned as a plugin/adapter target, not hard-coded into core assumptions
## Why separate from Didactopus
`doclift` owns document rescue and normalization complexity.
`Didactopus` should stay focused on course ingestion, concept extraction, and learning-path generation.