2.0 KiB

Executable File

Raw Blame History

doclift

doclift is a legacy-document normalization toolkit for turning old office documents into reviewable, structured bundles.

The initial target is legacy Word .doc files, but the repository boundary is intentionally broader:

extract legacy document text and metadata
preserve layout cues that survive extraction
recover tables, figure references, and other structural signals
emit normalized Markdown plus JSON sidecars
produce deterministic conversion reports for downstream systems such as Didactopus and GroundRecall

Scope

doclift is not a learner-facing system. It is a source-normalization layer that other projects can consume.

Current implementation:

legacy Word .doc conversion through catdoc
bundle emission with:
- document.md
- document.layout.json
- document.tables.json
- document.figures.json
- manifest.json
- conversion_report.json
course/workspace-level external figure asset inventory

Planned follow-on formats:

WordPerfect
RTF
DOCX as a higher-fidelity path
old HTML
OCR-assisted scanned documents

Install

pip install -e .
doclift --help

Quick Start

Inspect a source:

doclift inspect /path/to/legacy.doc

Convert one document:

doclift convert /path/to/legacy.doc /tmp/doclift-out

Convert a directory tree and inventory external figure assets:

doclift convert-dir /path/to/source-tree /tmp/doclift-bundle --asset-root /path/to/source-tree

Bundle Layout

out/
  conversion_report.json
  manifest.json
  assets/
    figure_asset_inventory.json
  documents/
    some-doc/
      document.md
      document.layout.json
      document.tables.json
      document.figures.json

Relationship To Other Projects

Didactopus should consume doclift bundles rather than own legacy format handling.
GroundRecall can use the same bundles for provenance-aware import.
other archival or scholarly tooling can reuse the same normalization path without depending on Didactopus.

2.0 KiB Executable File Raw Blame History