2.2 KiB
Executable File
2.2 KiB
Executable File
doclift
doclift is a legacy-document normalization toolkit for turning old office documents into reviewable, structured bundles.
The initial target is legacy Word .doc files, but the repository boundary is intentionally broader:
- extract legacy document text and metadata
- preserve layout cues that survive extraction
- recover tables, figure references, and other structural signals
- emit normalized Markdown plus JSON sidecars
- produce deterministic conversion reports for downstream systems such as Didactopus and GroundRecall
Scope
doclift is not a learner-facing system. It is a source-normalization layer that other projects can consume.
Project planning and lifecycle notes live in:
docs/architecture.mddocs/bundle-format.mddocs/roadmap.md
Current implementation:
- legacy Word
.docconversion throughcatdoc - bundle emission with:
document.mddocument.layout.jsondocument.tables.jsondocument.figures.jsonmanifest.jsonconversion_report.json
- course/workspace-level external figure asset inventory
Planned follow-on formats:
- WordPerfect
- RTF
- DOCX as a higher-fidelity path
- old HTML
- OCR-assisted scanned documents
Install
pip install -e .
doclift --help
Quick Start
Inspect a source:
doclift inspect /path/to/legacy.doc
Convert one document:
doclift convert /path/to/legacy.doc /tmp/doclift-out
Convert a directory tree and inventory external figure assets:
doclift convert-dir /path/to/source-tree /tmp/doclift-bundle --asset-root /path/to/source-tree
Bundle Layout
out/
conversion_report.json
manifest.json
assets/
figure_asset_inventory.json
documents/
some-doc/
document.md
document.layout.json
document.tables.json
document.figures.json
Relationship To Other Projects
Didactopusshould consumedocliftbundles rather than own legacy format handling.GroundRecallcan use the same bundles for provenance-aware import.- other archival or scholarly tooling can reuse the same normalization path without depending on Didactopus.
License
doclift is licensed under the MIT license. See LICENSE.