1.2 KiB

Executable File

Raw Blame History

Architecture

doclift is intended to sit between raw legacy sources and downstream domain-specific systems.

Layers

Format detection
Format-specific extraction
Structural recovery
Normalized bundle emission
Downstream import by applications such as Didactopus or GroundRecall

Design constraints

deterministic outputs
explicit provenance
structured sidecars for non-prose information
graceful degradation when exact layout cannot be recovered
container-friendly execution to reduce cross-platform variance

Output philosophy

The primary artifact is not a page-faithful rendering. It is a normalized bundle:

readable by humans
structured enough for agents and pipelines
explicit about uncertainty and extraction limits

Initial format strategy

.doc: implemented through catdoc, with layout/table recovery on extracted text
.docx: planned as a higher-fidelity path
.wpd: planned as a plugin/adapter target, not hard-coded into core assumptions

Why separate from Didactopus

doclift owns document rescue and normalization complexity. Didactopus should stay focused on course ingestion, concept extraction, and learning-path generation.

1.2 KiB Executable File Raw Blame History