Didactopus/docs/course-to-pack.md

2.8 KiB

Course-to-Pack Pipeline

The course-to-pack pipeline turns source material into a Didactopus draft domain pack.

Current code path

The main building blocks are:

  • didactopus.document_adapters Normalize source files into NormalizedDocument.
  • didactopus.topic_ingest and didactopus.course_ingest Build NormalizedCourse data and extract concept candidates.
  • didactopus.rule_policy Apply deterministic cleanup and heuristic rules.
  • didactopus.pack_emitter Emit pack files and review/conflict artifacts.

Supported source types

The repository currently accepts:

  • Markdown
  • plain text
  • HTML
  • PDF-ish text
  • DOCX-ish text
  • PPTX-ish text

Binary-format adapters are interface-stable but still intentionally simple.

Intermediate structures

The ingestion path works through these data shapes:

  • NormalizedDocument
  • NormalizedCourse
  • TopicBundle
  • ConceptCandidate
  • DraftPack

Current emitted artifacts

The pack emitter writes:

  • pack.yaml
  • concepts.yaml
  • roadmap.yaml
  • projects.yaml
  • rubrics.yaml
  • review_report.md
  • conflict_report.md
  • license_attribution.json
  • source_corpus.json
  • knowledge_graph.json

source_corpus.json is the main grounded-text artifact. It preserves lesson bodies, objectives, exercises, and source references from the ingested material so downstream tutoring or evaluation can rely on source-derived text instead of only the distilled concept graph.

knowledge_graph.json is the graph-first artifact. It preserves typed nodes and justified edges for sources, modules, lessons, concepts, assessment signals, and prerequisite/support relations. Later Didactopus retrieval and tutoring flows can use this graph to explain why a concept appears, what supports it, and which source material grounds it.

Rule layer

The current default rules:

  • infer prerequisites from content order
  • merge duplicate concept candidates by title
  • flag modules that look project-like
  • flag modules or concepts with weak extracted assessment signals

These rules are intentionally small and deterministic. They are meant to be easy to inspect and patch.

Known limitations

  • title-cased phrases can still become noisy concept candidates
  • extracted mastery signals remain weak for many source styles
  • project extraction is conservative
  • document parsing for PDF/DOCX/PPTX is still lightweight

Reference demo

The end-to-end reference flow in this repository is:

python -m didactopus.ocw_information_entropy_demo

That command ingests the MIT OCW Information and Entropy source file or directory tree in examples/ocw-information-entropy/, emits a draft pack into domain-packs/mit-ocw-information-entropy/, writes a grounded source_corpus.json, runs a deterministic agentic learner over the generated path, and writes downstream skill/visualization artifacts.