doclift/README.md

2.0 KiB
Executable File

doclift

doclift is a legacy-document normalization toolkit for turning old office documents into reviewable, structured bundles.

The initial target is legacy Word .doc files, but the repository boundary is intentionally broader:

  • extract legacy document text and metadata
  • preserve layout cues that survive extraction
  • recover tables, figure references, and other structural signals
  • emit normalized Markdown plus JSON sidecars
  • produce deterministic conversion reports for downstream systems such as Didactopus and GroundRecall

Scope

doclift is not a learner-facing system. It is a source-normalization layer that other projects can consume.

Current implementation:

  • legacy Word .doc conversion through catdoc
  • bundle emission with:
    • document.md
    • document.layout.json
    • document.tables.json
    • document.figures.json
    • manifest.json
    • conversion_report.json
  • course/workspace-level external figure asset inventory

Planned follow-on formats:

  • WordPerfect
  • RTF
  • DOCX as a higher-fidelity path
  • old HTML
  • OCR-assisted scanned documents

Install

pip install -e .
doclift --help

Quick Start

Inspect a source:

doclift inspect /path/to/legacy.doc

Convert one document:

doclift convert /path/to/legacy.doc /tmp/doclift-out

Convert a directory tree and inventory external figure assets:

doclift convert-dir /path/to/source-tree /tmp/doclift-bundle --asset-root /path/to/source-tree

Bundle Layout

out/
  conversion_report.json
  manifest.json
  assets/
    figure_asset_inventory.json
  documents/
    some-doc/
      document.md
      document.layout.json
      document.tables.json
      document.figures.json

Relationship To Other Projects

  • Didactopus should consume doclift bundles rather than own legacy format handling.
  • GroundRecall can use the same bundles for provenance-aware import.
  • other archival or scholarly tooling can reuse the same normalization path without depending on Didactopus.