# Course-to-Pack Pipeline

The course-to-pack pipeline turns source material into a Didactopus draft domain pack.

## Current code path

The main building blocks are:

- `didactopus.document_adapters`
  Normalize source files into `NormalizedDocument`.
- `didactopus.topic_ingest` and `didactopus.course_ingest`
  Build `NormalizedCourse` data and extract concept candidates.
- `didactopus.rule_policy`
  Apply deterministic cleanup and heuristic rules.
- `didactopus.pack_emitter`
  Emit pack files and review/conflict artifacts.

## Supported source types

The repository currently accepts:

- Markdown
- plain text
- HTML
- PDF-ish text
- DOCX-ish text
- PPTX-ish text

Binary-format adapters are interface-stable but still intentionally simple.

## Intermediate structures

The ingestion path works through these data shapes:

- `NormalizedDocument`
- `NormalizedCourse`
- `TopicBundle`
- `ConceptCandidate`
- `DraftPack`

## Current emitted artifacts

The pack emitter writes:

- `pack.yaml`
- `concepts.yaml`
- `roadmap.yaml`
- `projects.yaml`
- `rubrics.yaml`
- `review_report.md`
- `conflict_report.md`
- `license_attribution.json`
- `source_corpus.json`
- `knowledge_graph.json`

`source_corpus.json` is the main grounded-text artifact. It preserves lesson bodies, objectives,
exercises, and source references from the ingested material so downstream tutoring or evaluation
can rely on source-derived text instead of only the distilled concept graph.

`knowledge_graph.json` is the graph-first artifact. It preserves typed nodes and justified edges
for sources, modules, lessons, concepts, assessment signals, and prerequisite/support relations.
Later Didactopus retrieval and tutoring flows can use this graph to explain why a concept appears,
what supports it, and which source material grounds it.

## Rule layer

The current default rules:

- infer prerequisites from content order
- merge duplicate concept candidates by title
- flag modules that look project-like
- flag modules or concepts with weak extracted assessment signals

These rules are intentionally small and deterministic. They are meant to be easy to inspect and patch.

## Known limitations

- title-cased phrases can still become noisy concept candidates
- extracted mastery signals remain weak for many source styles
- project extraction is conservative
- document parsing for PDF/DOCX/PPTX is still lightweight

## Reference demo

The end-to-end reference flow in this repository is:

```bash
python -m didactopus.ocw_information_entropy_demo
```

That command ingests the MIT OCW Information and Entropy source file or directory tree in `examples/ocw-information-entropy/`, emits a draft pack into `domain-packs/mit-ocw-information-entropy/`, writes a grounded `source_corpus.json`, runs a deterministic agentic learner over the generated path, and writes downstream skill/visualization artifacts.