doclift/README.md

85 lines
2.0 KiB
Markdown
Executable File

# doclift
`doclift` is a legacy-document normalization toolkit for turning old office documents into reviewable, structured bundles.
The initial target is legacy Word `.doc` files, but the repository boundary is intentionally broader:
- extract legacy document text and metadata
- preserve layout cues that survive extraction
- recover tables, figure references, and other structural signals
- emit normalized Markdown plus JSON sidecars
- produce deterministic conversion reports for downstream systems such as Didactopus and GroundRecall
## Scope
`doclift` is not a learner-facing system. It is a source-normalization layer that other projects can consume.
Current implementation:
- legacy Word `.doc` conversion through `catdoc`
- bundle emission with:
- `document.md`
- `document.layout.json`
- `document.tables.json`
- `document.figures.json`
- `manifest.json`
- `conversion_report.json`
- course/workspace-level external figure asset inventory
Planned follow-on formats:
- WordPerfect
- RTF
- DOCX as a higher-fidelity path
- old HTML
- OCR-assisted scanned documents
## Install
```bash
pip install -e .
doclift --help
```
## Quick Start
Inspect a source:
```bash
doclift inspect /path/to/legacy.doc
```
Convert one document:
```bash
doclift convert /path/to/legacy.doc /tmp/doclift-out
```
Convert a directory tree and inventory external figure assets:
```bash
doclift convert-dir /path/to/source-tree /tmp/doclift-bundle --asset-root /path/to/source-tree
```
## Bundle Layout
```text
out/
conversion_report.json
manifest.json
assets/
figure_asset_inventory.json
documents/
some-doc/
document.md
document.layout.json
document.tables.json
document.figures.json
```
## Relationship To Other Projects
- `Didactopus` should consume `doclift` bundles rather than own legacy format handling.
- `GroundRecall` can use the same bundles for provenance-aware import.
- other archival or scholarly tooling can reuse the same normalization path without depending on Didactopus.