2.8 KiB
Course-to-Pack Pipeline
The course-to-pack pipeline turns source material into a Didactopus draft domain pack.
Current code path
The main building blocks are:
didactopus.document_adaptersNormalize source files intoNormalizedDocument.didactopus.topic_ingestanddidactopus.course_ingestBuildNormalizedCoursedata and extract concept candidates.didactopus.rule_policyApply deterministic cleanup and heuristic rules.didactopus.pack_emitterEmit pack files and review/conflict artifacts.
Supported source types
The repository currently accepts:
- Markdown
- plain text
- HTML
- PDF-ish text
- DOCX-ish text
- PPTX-ish text
Binary-format adapters are interface-stable but still intentionally simple.
Intermediate structures
The ingestion path works through these data shapes:
NormalizedDocumentNormalizedCourseTopicBundleConceptCandidateDraftPack
Current emitted artifacts
The pack emitter writes:
pack.yamlconcepts.yamlroadmap.yamlprojects.yamlrubrics.yamlreview_report.mdconflict_report.mdlicense_attribution.jsonsource_corpus.jsonknowledge_graph.json
source_corpus.json is the main grounded-text artifact. It preserves lesson bodies, objectives,
exercises, and source references from the ingested material so downstream tutoring or evaluation
can rely on source-derived text instead of only the distilled concept graph.
knowledge_graph.json is the graph-first artifact. It preserves typed nodes and justified edges
for sources, modules, lessons, concepts, assessment signals, and prerequisite/support relations.
Later Didactopus retrieval and tutoring flows can use this graph to explain why a concept appears,
what supports it, and which source material grounds it.
Rule layer
The current default rules:
- infer prerequisites from content order
- merge duplicate concept candidates by title
- flag modules that look project-like
- flag modules or concepts with weak extracted assessment signals
These rules are intentionally small and deterministic. They are meant to be easy to inspect and patch.
Known limitations
- title-cased phrases can still become noisy concept candidates
- extracted mastery signals remain weak for many source styles
- project extraction is conservative
- document parsing for PDF/DOCX/PPTX is still lightweight
Reference demo
The end-to-end reference flow in this repository is:
python -m didactopus.ocw_information_entropy_demo
That command ingests the MIT OCW Information and Entropy source file or directory tree in examples/ocw-information-entropy/, emits a draft pack into domain-packs/mit-ocw-information-entropy/, writes a grounded source_corpus.json, runs a deterministic agentic learner over the generated path, and writes downstream skill/visualization artifacts.