Didactopus/docs/multisource-ingestion.md

35 lines
966 B
Markdown

# Multi-Source Ingestion
The multi-source ingestion layer lets Didactopus build one draft domain pack from
several heterogeneous inputs describing the same course or topic.
## Why this matters
Real course material is often scattered across:
- syllabus files
- lesson notes
- transcripts
- assignment sheets
- HTML pages
- supplemental markdown
A single-source parser is too narrow for serious curriculum distillation.
## Pipeline
1. detect adapter by file extension or naming convention
2. normalize each source into a `NormalizedSourceRecord`
3. merge sources into a `NormalizedCourse`
4. extract concept candidates
5. run rule-policy passes
6. emit merged draft pack
7. emit conflict report and attribution manifest
## Conflict report categories
- duplicate lesson titles across sources
- repeated key terms with different local contexts
- modules with no explicit exercises
- project-like content needing manual review
- lessons with thin mastery signals