# Course-to-Pack Pipeline The course-to-pack pipeline turns source material into a Didactopus draft domain pack. ## Current code path The main building blocks are: - `didactopus.document_adapters` Normalize source files into `NormalizedDocument`. - `didactopus.topic_ingest` and `didactopus.course_ingest` Build `NormalizedCourse` data and extract concept candidates. - `didactopus.rule_policy` Apply deterministic cleanup and heuristic rules. - `didactopus.pack_emitter` Emit pack files and review/conflict artifacts. ## Supported source types The repository currently accepts: - Markdown - plain text - HTML - PDF-ish text - DOCX-ish text - PPTX-ish text Binary-format adapters are interface-stable but still intentionally simple. ## Intermediate structures The ingestion path works through these data shapes: - `NormalizedDocument` - `NormalizedCourse` - `TopicBundle` - `ConceptCandidate` - `DraftPack` ## Current emitted artifacts The pack emitter writes: - `pack.yaml` - `concepts.yaml` - `roadmap.yaml` - `projects.yaml` - `rubrics.yaml` - `review_report.md` - `conflict_report.md` - `license_attribution.json` - `source_corpus.json` - `knowledge_graph.json` `source_corpus.json` is the main grounded-text artifact. It preserves lesson bodies, objectives, exercises, and source references from the ingested material so downstream tutoring or evaluation can rely on source-derived text instead of only the distilled concept graph. `knowledge_graph.json` is the graph-first artifact. It preserves typed nodes and justified edges for sources, modules, lessons, concepts, assessment signals, and prerequisite/support relations. Later Didactopus retrieval and tutoring flows can use this graph to explain why a concept appears, what supports it, and which source material grounds it. ## Rule layer The current default rules: - infer prerequisites from content order - merge duplicate concept candidates by title - flag modules that look project-like - flag modules or concepts with weak extracted assessment signals These rules are intentionally small and deterministic. They are meant to be easy to inspect and patch. ## Known limitations - title-cased phrases can still become noisy concept candidates - extracted mastery signals remain weak for many source styles - project extraction is conservative - document parsing for PDF/DOCX/PPTX is still lightweight ## Reference demo The end-to-end reference flow in this repository is: ```bash python -m didactopus.ocw_information_entropy_demo ``` That command ingests the MIT OCW Information and Entropy source file or directory tree in `examples/ocw-information-entropy/`, emits a draft pack into `domain-packs/mit-ocw-information-entropy/`, writes a grounded `source_corpus.json`, runs a deterministic agentic learner over the generated path, and writes downstream skill/visualization artifacts.