doclift/docs/architecture.md

# Architecture

`doclift` is intended to sit between raw legacy sources and downstream domain-specific systems.

## Layers

1. Format detection
2. Format-specific extraction
3. Structural recovery
4. Normalized bundle emission
5. Downstream import by applications such as Didactopus or GroundRecall

## Design constraints

- deterministic outputs
- explicit provenance
- structured sidecars for non-prose information
- graceful degradation when exact layout cannot be recovered
- container-friendly execution to reduce cross-platform variance

## Output philosophy

The primary artifact is not a page-faithful rendering. It is a normalized bundle:

- readable by humans
- structured enough for agents and pipelines
- explicit about uncertainty and extraction limits

## Initial format strategy

- `.doc`: implemented through `catdoc`, with layout/table recovery on extracted text
- `.docx`: planned as a higher-fidelity path
- `.wpd`: planned as a plugin/adapter target, not hard-coded into core assumptions

## Why separate from Didactopus

`doclift` owns document rescue and normalization complexity.
`Didactopus` should stay focused on course ingestion, concept extraction, and learning-path generation.