doclift/docs/architecture.md

1.2 KiB
Executable File

Architecture

doclift is intended to sit between raw legacy sources and downstream domain-specific systems.

Layers

  1. Format detection
  2. Format-specific extraction
  3. Structural recovery
  4. Normalized bundle emission
  5. Downstream import by applications such as Didactopus or GroundRecall

Design constraints

  • deterministic outputs
  • explicit provenance
  • structured sidecars for non-prose information
  • graceful degradation when exact layout cannot be recovered
  • container-friendly execution to reduce cross-platform variance

Output philosophy

The primary artifact is not a page-faithful rendering. It is a normalized bundle:

  • readable by humans
  • structured enough for agents and pipelines
  • explicit about uncertainty and extraction limits

Initial format strategy

  • .doc: implemented through catdoc, with layout/table recovery on extracted text
  • .docx: planned as a higher-fidelity path
  • .wpd: planned as a plugin/adapter target, not hard-coded into core assumptions

Why separate from Didactopus

doclift owns document rescue and normalization complexity. Didactopus should stay focused on course ingestion, concept extraction, and learning-path generation.