112 lines
2.8 KiB
Markdown
Executable File
112 lines
2.8 KiB
Markdown
Executable File
# doclift
|
|
|
|
`doclift` is a legacy-document normalization toolkit for turning old office documents into reviewable, structured bundles.
|
|
|
|
The initial target is legacy Word `.doc` files, but the repository boundary is intentionally broader:
|
|
|
|
- extract legacy document text and metadata
|
|
- preserve layout cues that survive extraction
|
|
- recover tables, figure references, and other structural signals
|
|
- emit normalized Markdown plus JSON sidecars
|
|
- produce deterministic conversion reports for downstream systems such as Didactopus and GroundRecall
|
|
|
|
## Scope
|
|
|
|
`doclift` is not a learner-facing system. It is a source-normalization layer that other projects can consume.
|
|
|
|
Project planning and lifecycle notes live in:
|
|
|
|
- `docs/architecture.md`
|
|
- `docs/bundle-format.md`
|
|
- `docs/roadmap.md`
|
|
|
|
Current implementation:
|
|
|
|
- legacy Word `.doc` conversion through `catdoc`
|
|
- bundle emission with:
|
|
- `document.md`
|
|
- `document.layout.json`
|
|
- `document.tables.json`
|
|
- `document.figures.json`
|
|
- `manifest.json`
|
|
- `conversion_report.json`
|
|
- course/workspace-level external figure asset inventory
|
|
|
|
Planned follow-on formats:
|
|
|
|
- WordPerfect
|
|
- RTF
|
|
- DOCX as a higher-fidelity path
|
|
- old HTML
|
|
- OCR-assisted scanned documents
|
|
|
|
## Install
|
|
|
|
```bash
|
|
pip install -e .
|
|
doclift --help
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
Inspect a source:
|
|
|
|
```bash
|
|
doclift inspect /path/to/legacy.doc
|
|
```
|
|
|
|
Convert one document:
|
|
|
|
```bash
|
|
doclift convert /path/to/legacy.doc /tmp/doclift-out
|
|
```
|
|
|
|
Convert a directory tree and inventory external figure assets:
|
|
|
|
```bash
|
|
doclift convert-dir /path/to/source-tree /tmp/doclift-bundle --asset-root /path/to/source-tree
|
|
```
|
|
|
|
## Downstream Workflow
|
|
|
|
`doclift` is meant to hand off a normalized bundle to downstream systems rather
|
|
than to own review, pedagogy, or canonical knowledge storage.
|
|
|
|
Minimal end-to-end flow:
|
|
|
|
```bash
|
|
doclift convert-dir /path/to/legacy-course /tmp/doclift-bundle --asset-root /path/to/legacy-course
|
|
didactopus doclift-bundle /tmp/doclift-bundle /tmp/didactopus-pack --course-title "Example Course"
|
|
groundrecall import /tmp/doclift-bundle --mode quick
|
|
```
|
|
|
|
Use the Didactopus step when you want a learner-facing pack and review workflow.
|
|
Use the GroundRecall step when you want to import the normalized source bundle
|
|
directly into a canonical knowledge store.
|
|
|
|
## Bundle Layout
|
|
|
|
```text
|
|
out/
|
|
conversion_report.json
|
|
manifest.json
|
|
assets/
|
|
figure_asset_inventory.json
|
|
documents/
|
|
some-doc/
|
|
document.md
|
|
document.layout.json
|
|
document.tables.json
|
|
document.figures.json
|
|
```
|
|
|
|
## Relationship To Other Projects
|
|
|
|
- `Didactopus` should consume `doclift` bundles rather than own legacy format handling.
|
|
- `GroundRecall` can use the same bundles for provenance-aware import.
|
|
- other archival or scholarly tooling can reuse the same normalization path without depending on Didactopus.
|
|
|
|
## License
|
|
|
|
`doclift` is licensed under the MIT license. See `LICENSE`.
|