# doclift `doclift` is a legacy-document normalization toolkit for turning old office documents into reviewable, structured bundles. The initial target is legacy Word `.doc` files, but the repository boundary is intentionally broader: - extract legacy document text and metadata - preserve layout cues that survive extraction - recover tables, figure references, and other structural signals - emit normalized Markdown plus JSON sidecars - produce deterministic conversion reports for downstream systems such as Didactopus and GroundRecall ## Scope `doclift` is not a learner-facing system. It is a source-normalization layer that other projects can consume. Project planning and lifecycle notes live in: - `docs/architecture.md` - `docs/bundle-format.md` - `docs/roadmap.md` Current implementation: - legacy Word `.doc` conversion through `catdoc` - bundle emission with: - `document.md` - `document.layout.json` - `document.tables.json` - `document.figures.json` - `manifest.json` - `conversion_report.json` - course/workspace-level external figure asset inventory Planned follow-on formats: - WordPerfect - RTF - DOCX as a higher-fidelity path - old HTML - OCR-assisted scanned documents ## Install ```bash pip install -e . doclift --help ``` ## Quick Start Inspect a source: ```bash doclift inspect /path/to/legacy.doc ``` Convert one document: ```bash doclift convert /path/to/legacy.doc /tmp/doclift-out ``` Convert a directory tree and inventory external figure assets: ```bash doclift convert-dir /path/to/source-tree /tmp/doclift-bundle --asset-root /path/to/source-tree ``` ## Downstream Workflow `doclift` is meant to hand off a normalized bundle to downstream systems rather than to own review, pedagogy, or canonical knowledge storage. Minimal end-to-end flow: ```bash doclift convert-dir /path/to/legacy-course /tmp/doclift-bundle --asset-root /path/to/legacy-course didactopus doclift-bundle /tmp/doclift-bundle /tmp/didactopus-pack --course-title "Example Course" groundrecall import /tmp/doclift-bundle --mode quick ``` Use the Didactopus step when you want a learner-facing pack and review workflow. Use the GroundRecall step when you want to import the normalized source bundle directly into a canonical knowledge store. ## Bundle Layout ```text out/ conversion_report.json manifest.json assets/ figure_asset_inventory.json documents/ some-doc/ document.md document.layout.json document.tables.json document.figures.json ``` ## Relationship To Other Projects - `Didactopus` should consume `doclift` bundles rather than own legacy format handling. - `GroundRecall` can use the same bundles for provenance-aware import. - other archival or scholarly tooling can reuse the same normalization path without depending on Didactopus. ## License `doclift` is licensed under the MIT license. See `LICENSE`.