Added multi-source course ingestion.
This commit is contained in:
parent
8b4359f4cc
commit
8defaab1c2
34
README.md
34
README.md
|
|
@ -8,6 +8,39 @@
|
||||||
|
|
||||||
## Recent revisions
|
## Recent revisions
|
||||||
|
|
||||||
|
### Multi-Source Course Ingestion
|
||||||
|
|
||||||
|
This revision adds a **Multi-Source Course Ingestion Layer**.
|
||||||
|
|
||||||
|
The pipeline can now accept multiple source files representing the same course or
|
||||||
|
topic domain, normalize them into a shared intermediate representation, merge them,
|
||||||
|
and emit a single draft Didactopus pack plus a conflict report.
|
||||||
|
|
||||||
|
#### Supported scaffold source types
|
||||||
|
|
||||||
|
Current scaffold adapters:
|
||||||
|
- Markdown (`.md`)
|
||||||
|
- Plain text (`.txt`)
|
||||||
|
- HTML-ish text (`.html`, `.htm`)
|
||||||
|
- Transcript text (`.transcript.txt`)
|
||||||
|
- Syllabus text (`.syllabus.txt`)
|
||||||
|
|
||||||
|
This revision is intentionally adapter-oriented, so future PDF, slide, and DOCX
|
||||||
|
adapters can be added behind the same interface.
|
||||||
|
|
||||||
|
#### What is included
|
||||||
|
|
||||||
|
- multi-source adapter dispatch
|
||||||
|
- normalized source records
|
||||||
|
- source merge logic
|
||||||
|
- cross-source terminology conflict report
|
||||||
|
- duplicate lesson/title detection
|
||||||
|
- merged draft pack emission
|
||||||
|
- merged attribution manifest
|
||||||
|
- sample multi-source inputs
|
||||||
|
- sample merged output pack
|
||||||
|
|
||||||
|
|
||||||
### Course Ingestion Pipeline
|
### Course Ingestion Pipeline
|
||||||
|
|
||||||
This revision adds a **Course-to-Pack Ingestion Pipeline** plus a **stable rule-policy adapter layer**.
|
This revision adds a **Course-to-Pack Ingestion Pipeline** plus a **stable rule-policy adapter layer**.
|
||||||
|
|
@ -182,3 +215,4 @@ didactopus/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -9,3 +9,8 @@ rule_policy:
|
||||||
enable_duplicate_term_merge_rule: true
|
enable_duplicate_term_merge_rule: true
|
||||||
enable_project_detection_rule: true
|
enable_project_detection_rule: true
|
||||||
enable_review_flags: true
|
enable_review_flags: true
|
||||||
|
|
||||||
|
multisource:
|
||||||
|
detect_duplicate_lessons: true
|
||||||
|
detect_term_conflicts: true
|
||||||
|
merge_same_named_lessons: true
|
||||||
|
|
|
||||||
39
docs/faq.md
39
docs/faq.md
|
|
@ -1,32 +1,27 @@
|
||||||
# FAQ
|
# FAQ
|
||||||
|
|
||||||
## Why add course ingestion?
|
## Why multi-source ingestion?
|
||||||
|
|
||||||
Because many open or user-supplied courses already encode:
|
Because course structure is usually distributed across several files rather than
|
||||||
- topic sequencing
|
perfectly contained in one source.
|
||||||
- learning objectives
|
|
||||||
- exercises
|
|
||||||
- project prompts
|
|
||||||
- terminology
|
|
||||||
|
|
||||||
That makes them strong starting material for draft domain packs.
|
## What kinds of conflicts can arise?
|
||||||
|
|
||||||
## Why not just embed all course text?
|
Common examples:
|
||||||
|
- the same lesson with slightly different titles
|
||||||
|
- inconsistent terminology across notes and transcripts
|
||||||
|
- exercises present in one source but absent in another
|
||||||
|
- project prompts implied in one file and explicit in another
|
||||||
|
|
||||||
Because Didactopus needs structured artifacts:
|
## Does the system resolve all conflicts automatically?
|
||||||
|
|
||||||
|
No. It produces a merged draft pack and a conflict report for human review.
|
||||||
|
|
||||||
|
## Why not rely only on embeddings for this?
|
||||||
|
|
||||||
|
Because Didactopus needs explicit structures such as:
|
||||||
- concepts
|
- concepts
|
||||||
- prerequisites
|
- prerequisites
|
||||||
- projects
|
- projects
|
||||||
- rubrics
|
- rubrics
|
||||||
- mastery cues
|
- checkpoints
|
||||||
|
|
||||||
A flat embedding store is not enough for mastery planning.
|
|
||||||
|
|
||||||
## Why avoid PyKE or another heavy rule engine here?
|
|
||||||
|
|
||||||
Dependency stability matters. The current rule-policy adapter keeps rules simple,
|
|
||||||
transparent, and dependency-light.
|
|
||||||
|
|
||||||
## Can the rule layer be replaced later?
|
|
||||||
|
|
||||||
Yes. The adapter is designed so a future engine can be plugged in behind the same interface.
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,34 @@
|
||||||
|
# Multi-Source Ingestion
|
||||||
|
|
||||||
|
The multi-source ingestion layer lets Didactopus build one draft domain pack from
|
||||||
|
several heterogeneous inputs describing the same course or topic.
|
||||||
|
|
||||||
|
## Why this matters
|
||||||
|
|
||||||
|
Real course material is often scattered across:
|
||||||
|
- syllabus files
|
||||||
|
- lesson notes
|
||||||
|
- transcripts
|
||||||
|
- assignment sheets
|
||||||
|
- HTML pages
|
||||||
|
- supplemental markdown
|
||||||
|
|
||||||
|
A single-source parser is too narrow for serious curriculum distillation.
|
||||||
|
|
||||||
|
## Pipeline
|
||||||
|
|
||||||
|
1. detect adapter by file extension or naming convention
|
||||||
|
2. normalize each source into a `NormalizedSourceRecord`
|
||||||
|
3. merge sources into a `NormalizedCourse`
|
||||||
|
4. extract concept candidates
|
||||||
|
5. run rule-policy passes
|
||||||
|
6. emit merged draft pack
|
||||||
|
7. emit conflict report and attribution manifest
|
||||||
|
|
||||||
|
## Conflict report categories
|
||||||
|
|
||||||
|
- duplicate lesson titles across sources
|
||||||
|
- repeated key terms with different local contexts
|
||||||
|
- modules with no explicit exercises
|
||||||
|
- project-like content needing manual review
|
||||||
|
- lessons with thin mastery signals
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Conflict Report
|
||||||
|
|
||||||
|
- Key term 'prior' appears in multiple lesson contexts: Prior and Posterior
|
||||||
|
|
@ -1,5 +1,20 @@
|
||||||
{
|
{
|
||||||
"source_name": "Sample Course",
|
"rights_note": "REVIEW REQUIRED",
|
||||||
"source_url": "",
|
"sources": [
|
||||||
"rights_note": "REVIEW REQUIRED"
|
{
|
||||||
|
"source_name": "sample_course_syllabus.syllabus.txt",
|
||||||
|
"source_type": "syllabus",
|
||||||
|
"source_path": "examples/sample_course_syllabus.syllabus.txt"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source_name": "sample_course_notes.md",
|
||||||
|
"source_type": "markdown",
|
||||||
|
"source_path": "examples/sample_course_notes.md"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source_name": "sample_course_lecture.transcript.txt",
|
||||||
|
"source_type": "transcript",
|
||||||
|
"source_path": "examples/sample_course_lecture.transcript.txt"
|
||||||
|
}
|
||||||
|
]
|
||||||
}
|
}
|
||||||
|
|
@ -4,7 +4,8 @@ version: 0.1.0-draft
|
||||||
schema_version: '1'
|
schema_version: '1'
|
||||||
didactopus_min_version: 0.1.0
|
didactopus_min_version: 0.1.0
|
||||||
didactopus_max_version: 0.9.99
|
didactopus_max_version: 0.9.99
|
||||||
description: Draft pack generated from sample course.
|
description: Draft pack generated from multi-source course inputs for 'Introductory
|
||||||
|
Bayesian Inference'.
|
||||||
author: Wesley R. Elsberry
|
author: Wesley R. Elsberry
|
||||||
license: REVIEW-REQUIRED
|
license: REVIEW-REQUIRED
|
||||||
dependencies: []
|
dependencies: []
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,5 @@
|
||||||
|
# Introductory Bayesian Inference
|
||||||
|
|
||||||
|
## Module 2: Bayesian Updating
|
||||||
|
### Prior and Posterior
|
||||||
|
In this lecture we revisit Prior and Posterior and discuss model assumptions, bias, and uncertainty.
|
||||||
|
|
@ -0,0 +1,16 @@
|
||||||
|
# Introductory Bayesian Inference
|
||||||
|
|
||||||
|
## Module 1: Foundations
|
||||||
|
### Descriptive Statistics
|
||||||
|
Descriptive Statistics introduces measures of center and spread.
|
||||||
|
|
||||||
|
### Probability Basics
|
||||||
|
Probability Basics introduces events, likelihood, and Bayes-style reasoning.
|
||||||
|
|
||||||
|
## Module 2: Bayesian Updating
|
||||||
|
### Prior and Posterior
|
||||||
|
A Prior expresses assumptions before evidence. Posterior reasoning updates belief after evidence.
|
||||||
|
|
||||||
|
### Capstone Mini Project
|
||||||
|
- Exercise: Write a short project report comparing priors and posteriors.
|
||||||
|
This project asks learners to critique assumptions and produce a small capstone artifact.
|
||||||
|
|
@ -0,0 +1,16 @@
|
||||||
|
# Introductory Bayesian Inference
|
||||||
|
|
||||||
|
## Module 1: Foundations
|
||||||
|
### Descriptive Statistics
|
||||||
|
- Objective: Explain mean, median, and variance.
|
||||||
|
- Exercise: Summarize a small dataset.
|
||||||
|
|
||||||
|
### Probability Basics
|
||||||
|
- Objective: Explain conditional probability.
|
||||||
|
- Exercise: Compute a simple conditional probability.
|
||||||
|
|
||||||
|
## Module 2: Bayesian Updating
|
||||||
|
### Prior and Posterior
|
||||||
|
- Objective: Explain a prior distribution.
|
||||||
|
- Objective: Explain how evidence changes belief.
|
||||||
|
- Exercise: Compare prior and posterior beliefs.
|
||||||
|
|
@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
|
||||||
[project]
|
[project]
|
||||||
name = "didactopus"
|
name = "didactopus"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
description = "Didactopus: course-to-pack ingestion scaffold"
|
description = "Didactopus: multi-source course-to-pack ingestion scaffold"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
requires-python = ">=3.10"
|
requires-python = ">=3.10"
|
||||||
license = {text = "MIT"}
|
license = {text = "MIT"}
|
||||||
|
|
|
||||||
|
|
@ -17,9 +17,16 @@ class RulePolicyConfig(BaseModel):
|
||||||
enable_review_flags: bool = True
|
enable_review_flags: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
class MultisourceConfig(BaseModel):
|
||||||
|
detect_duplicate_lessons: bool = True
|
||||||
|
detect_term_conflicts: bool = True
|
||||||
|
merge_same_named_lessons: bool = True
|
||||||
|
|
||||||
|
|
||||||
class AppConfig(BaseModel):
|
class AppConfig(BaseModel):
|
||||||
course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig)
|
course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig)
|
||||||
rule_policy: RulePolicyConfig = Field(default_factory=RulePolicyConfig)
|
rule_policy: RulePolicyConfig = Field(default_factory=RulePolicyConfig)
|
||||||
|
multisource: MultisourceConfig = Field(default_factory=MultisourceConfig)
|
||||||
|
|
||||||
|
|
||||||
def load_config(path: str | Path) -> AppConfig:
|
def load_config(path: str | Path) -> AppConfig:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,39 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from collections import defaultdict
|
||||||
|
from .course_schema import NormalizedCourse, ConceptCandidate
|
||||||
|
|
||||||
|
|
||||||
|
def detect_duplicate_lessons(course: NormalizedCourse) -> list[str]:
|
||||||
|
seen: dict[str, list[str]] = defaultdict(list)
|
||||||
|
for module in course.modules:
|
||||||
|
for lesson in module.lessons:
|
||||||
|
seen[lesson.title.lower()].append(module.title)
|
||||||
|
flags = []
|
||||||
|
for title, modules in seen.items():
|
||||||
|
if len(modules) > 1:
|
||||||
|
flags.append(f"Lesson title '{title}' appears in multiple modules: {', '.join(sorted(set(modules)))}")
|
||||||
|
return flags
|
||||||
|
|
||||||
|
|
||||||
|
def detect_term_conflicts(course: NormalizedCourse) -> list[str]:
|
||||||
|
contexts: dict[str, set[str]] = defaultdict(set)
|
||||||
|
for module in course.modules:
|
||||||
|
for lesson in module.lessons:
|
||||||
|
for term in lesson.key_terms:
|
||||||
|
contexts[term.lower()].add(lesson.title)
|
||||||
|
flags = []
|
||||||
|
for term, lessons in contexts.items():
|
||||||
|
if len(lessons) > 1:
|
||||||
|
flags.append(f"Key term '{term}' appears in multiple lesson contexts: {', '.join(sorted(lessons))}")
|
||||||
|
return flags
|
||||||
|
|
||||||
|
|
||||||
|
def detect_thin_concepts(concepts: list[ConceptCandidate]) -> list[str]:
|
||||||
|
flags = []
|
||||||
|
for concept in concepts:
|
||||||
|
if not concept.mastery_signals:
|
||||||
|
flags.append(f"Concept '{concept.title}' has no mastery signals.")
|
||||||
|
if len(concept.description.strip()) < 20:
|
||||||
|
flags.append(f"Concept '{concept.title}' has a very thin description.")
|
||||||
|
return flags
|
||||||
|
|
@ -1,7 +1,8 @@
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
import re
|
import re
|
||||||
from .course_schema import NormalizedCourse, Module, Lesson, ConceptCandidate
|
from pathlib import Path
|
||||||
|
from .course_schema import NormalizedCourse, NormalizedSourceRecord, Module, Lesson, ConceptCandidate
|
||||||
|
|
||||||
HEADING_RE = re.compile(r"^(#{1,3})\s+(.*)$")
|
HEADING_RE = re.compile(r"^(#{1,3})\s+(.*)$")
|
||||||
BULLET_RE = re.compile(r"^\s*[-*+]\s+(.*)$")
|
BULLET_RE = re.compile(r"^\s*[-*+]\s+(.*)$")
|
||||||
|
|
@ -12,6 +13,23 @@ def slugify(text: str) -> str:
|
||||||
return cleaned or "untitled"
|
return cleaned or "untitled"
|
||||||
|
|
||||||
|
|
||||||
|
def detect_source_type(path: str | Path) -> str:
|
||||||
|
p = Path(path)
|
||||||
|
name = p.name.lower()
|
||||||
|
suffix = p.suffix.lower()
|
||||||
|
if name.endswith(".transcript.txt"):
|
||||||
|
return "transcript"
|
||||||
|
if name.endswith(".syllabus.txt"):
|
||||||
|
return "syllabus"
|
||||||
|
if suffix in {".md"}:
|
||||||
|
return "markdown"
|
||||||
|
if suffix in {".html", ".htm"}:
|
||||||
|
return "html"
|
||||||
|
if suffix in {".txt"}:
|
||||||
|
return "text"
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
|
||||||
def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -> list[str]:
|
def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -> list[str]:
|
||||||
candidates = re.findall(r"\b[A-Z][A-Za-z0-9\-]{%d,}\b" % (min_term_length - 1), text)
|
candidates = re.findall(r"\b[A-Z][A-Za-z0-9\-]{%d,}\b" % (min_term_length - 1), text)
|
||||||
seen = set()
|
seen = set()
|
||||||
|
|
@ -25,7 +43,7 @@ def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -
|
||||||
return ordered
|
return ordered
|
||||||
|
|
||||||
|
|
||||||
def parse_markdown_course(text: str, title: str, source_name: str = "", source_url: str = "", rights_note: str = "") -> NormalizedCourse:
|
def parse_markdown_like(text: str, title: str, source_name: str, source_path: str) -> NormalizedSourceRecord:
|
||||||
lines = text.splitlines()
|
lines = text.splitlines()
|
||||||
modules: list[Module] = []
|
modules: list[Module] = []
|
||||||
current_module: Module | None = None
|
current_module: Module | None = None
|
||||||
|
|
@ -57,7 +75,7 @@ def parse_markdown_course(text: str, title: str, source_name: str = "", source_u
|
||||||
flush_body()
|
flush_body()
|
||||||
if current_lesson is not None and current_module is not None:
|
if current_lesson is not None and current_module is not None:
|
||||||
current_module.lessons.append(current_lesson)
|
current_module.lessons.append(current_lesson)
|
||||||
current_lesson = Lesson(title=heading)
|
current_lesson = Lesson(title=heading, source_refs=[source_name])
|
||||||
continue
|
continue
|
||||||
|
|
||||||
bullet = BULLET_RE.match(line)
|
bullet = BULLET_RE.match(line)
|
||||||
|
|
@ -79,17 +97,59 @@ def parse_markdown_course(text: str, title: str, source_name: str = "", source_u
|
||||||
if current_module is not None:
|
if current_module is not None:
|
||||||
modules.append(current_module)
|
modules.append(current_module)
|
||||||
|
|
||||||
course = NormalizedCourse(
|
for module in modules:
|
||||||
title=title,
|
|
||||||
source_name=source_name,
|
|
||||||
source_url=source_url,
|
|
||||||
rights_note=rights_note,
|
|
||||||
modules=modules,
|
|
||||||
)
|
|
||||||
for module in course.modules:
|
|
||||||
for lesson in module.lessons:
|
for lesson in module.lessons:
|
||||||
lesson.key_terms = extract_key_terms(f"{lesson.title}\n{lesson.body}")
|
lesson.key_terms = extract_key_terms(f"{lesson.title}\n{lesson.body}")
|
||||||
return course
|
return NormalizedSourceRecord(
|
||||||
|
source_name=source_name,
|
||||||
|
source_type=detect_source_type(source_path),
|
||||||
|
source_path=str(source_path),
|
||||||
|
title=title,
|
||||||
|
modules=modules,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_source_file(path: str | Path, title: str = "") -> NormalizedSourceRecord:
|
||||||
|
p = Path(path)
|
||||||
|
text = p.read_text(encoding="utf-8")
|
||||||
|
inferred_title = title or p.stem.replace("_", " ").replace("-", " ").title()
|
||||||
|
return parse_markdown_like(text=text, title=inferred_title, source_name=p.name, source_path=str(p))
|
||||||
|
|
||||||
|
|
||||||
|
def merge_source_records(records: list[NormalizedSourceRecord], course_title: str, rights_note: str = "", merge_same_named_lessons: bool = True) -> NormalizedCourse:
|
||||||
|
modules_by_title: dict[str, Module] = {}
|
||||||
|
for record in records:
|
||||||
|
for module in record.modules:
|
||||||
|
target_module = modules_by_title.setdefault(module.title, Module(title=module.title, lessons=[]))
|
||||||
|
if merge_same_named_lessons:
|
||||||
|
lesson_map = {lesson.title: lesson for lesson in target_module.lessons}
|
||||||
|
for lesson in module.lessons:
|
||||||
|
if lesson.title in lesson_map:
|
||||||
|
existing = lesson_map[lesson.title]
|
||||||
|
if lesson.body and lesson.body not in existing.body:
|
||||||
|
existing.body = (existing.body + "\n\n" + lesson.body).strip()
|
||||||
|
for item in lesson.objectives:
|
||||||
|
if item not in existing.objectives:
|
||||||
|
existing.objectives.append(item)
|
||||||
|
for item in lesson.exercises:
|
||||||
|
if item not in existing.exercises:
|
||||||
|
existing.exercises.append(item)
|
||||||
|
for item in lesson.key_terms:
|
||||||
|
if item not in existing.key_terms:
|
||||||
|
existing.key_terms.append(item)
|
||||||
|
for item in lesson.source_refs:
|
||||||
|
if item not in existing.source_refs:
|
||||||
|
existing.source_refs.append(item)
|
||||||
|
else:
|
||||||
|
target_module.lessons.append(lesson)
|
||||||
|
else:
|
||||||
|
target_module.lessons.extend(module.lessons)
|
||||||
|
return NormalizedCourse(
|
||||||
|
title=course_title,
|
||||||
|
rights_note=rights_note,
|
||||||
|
modules=list(modules_by_title.values()),
|
||||||
|
source_records=records,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def extract_concept_candidates(course: NormalizedCourse) -> list[ConceptCandidate]:
|
def extract_concept_candidates(course: NormalizedCourse) -> list[ConceptCandidate]:
|
||||||
|
|
|
||||||
|
|
@ -9,6 +9,7 @@ class Lesson(BaseModel):
|
||||||
objectives: list[str] = Field(default_factory=list)
|
objectives: list[str] = Field(default_factory=list)
|
||||||
exercises: list[str] = Field(default_factory=list)
|
exercises: list[str] = Field(default_factory=list)
|
||||||
key_terms: list[str] = Field(default_factory=list)
|
key_terms: list[str] = Field(default_factory=list)
|
||||||
|
source_refs: list[str] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
class Module(BaseModel):
|
class Module(BaseModel):
|
||||||
|
|
@ -16,12 +17,21 @@ class Module(BaseModel):
|
||||||
lessons: list[Lesson] = Field(default_factory=list)
|
lessons: list[Lesson] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
|
class NormalizedSourceRecord(BaseModel):
|
||||||
|
source_name: str
|
||||||
|
source_type: str
|
||||||
|
source_path: str
|
||||||
|
title: str = ""
|
||||||
|
modules: list[Module] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
class NormalizedCourse(BaseModel):
|
class NormalizedCourse(BaseModel):
|
||||||
title: str
|
title: str
|
||||||
source_name: str = ""
|
source_name: str = ""
|
||||||
source_url: str = ""
|
source_url: str = ""
|
||||||
rights_note: str = ""
|
rights_note: str = ""
|
||||||
modules: list[Module] = Field(default_factory=list)
|
modules: list[Module] = Field(default_factory=list)
|
||||||
|
source_records: list[NormalizedSourceRecord] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
class ConceptCandidate(BaseModel):
|
class ConceptCandidate(BaseModel):
|
||||||
|
|
@ -42,3 +52,4 @@ class DraftPack(BaseModel):
|
||||||
rubrics: dict
|
rubrics: dict
|
||||||
review_report: list[str] = Field(default_factory=list)
|
review_report: list[str] = Field(default_factory=list)
|
||||||
attribution: dict = Field(default_factory=dict)
|
attribution: dict = Field(default_factory=dict)
|
||||||
|
conflicts: list[str] = Field(default_factory=list)
|
||||||
|
|
|
||||||
|
|
@ -4,17 +4,16 @@ import argparse
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .config import load_config
|
from .config import load_config
|
||||||
from .course_ingest import parse_markdown_course, extract_concept_candidates
|
from .course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
||||||
from .rule_policy import RuleContext, build_default_rules, run_rules
|
from .rule_policy import RuleContext, build_default_rules, run_rules
|
||||||
|
from .conflict_report import detect_duplicate_lessons, detect_term_conflicts, detect_thin_concepts
|
||||||
from .pack_emitter import build_draft_pack, write_draft_pack
|
from .pack_emitter import build_draft_pack, write_draft_pack
|
||||||
|
|
||||||
|
|
||||||
def build_parser() -> argparse.ArgumentParser:
|
def build_parser() -> argparse.ArgumentParser:
|
||||||
parser = argparse.ArgumentParser(description="Didactopus course-to-pack ingestion pipeline")
|
parser = argparse.ArgumentParser(description="Didactopus multi-source course-to-pack ingestion pipeline")
|
||||||
parser.add_argument("--input", required=True)
|
parser.add_argument("--inputs", nargs="+", required=True, help="Input source files")
|
||||||
parser.add_argument("--title", required=True)
|
parser.add_argument("--title", required=True, help="Course or topic title")
|
||||||
parser.add_argument("--source-name", default="")
|
|
||||||
parser.add_argument("--source-url", default="")
|
|
||||||
parser.add_argument("--rights-note", default="REVIEW REQUIRED")
|
parser.add_argument("--rights-note", default="REVIEW REQUIRED")
|
||||||
parser.add_argument("--output-dir", default="generated-pack")
|
parser.add_argument("--output-dir", default="generated-pack")
|
||||||
parser.add_argument("--config", default="configs/config.example.yaml")
|
parser.add_argument("--config", default="configs/config.example.yaml")
|
||||||
|
|
@ -24,14 +23,13 @@ def build_parser() -> argparse.ArgumentParser:
|
||||||
def main() -> None:
|
def main() -> None:
|
||||||
args = build_parser().parse_args()
|
args = build_parser().parse_args()
|
||||||
config = load_config(args.config)
|
config = load_config(args.config)
|
||||||
text = Path(args.input).read_text(encoding="utf-8")
|
|
||||||
|
|
||||||
course = parse_markdown_course(
|
records = [parse_source_file(path, title=args.title) for path in args.inputs]
|
||||||
text=text,
|
course = merge_source_records(
|
||||||
title=args.title,
|
records=records,
|
||||||
source_name=args.source_name,
|
course_title=args.title,
|
||||||
source_url=args.source_url,
|
|
||||||
rights_note=args.rights_note,
|
rights_note=args.rights_note,
|
||||||
|
merge_same_named_lessons=config.multisource.merge_same_named_lessons,
|
||||||
)
|
)
|
||||||
concepts = extract_concept_candidates(course)
|
concepts = extract_concept_candidates(course)
|
||||||
context = RuleContext(course=course, concepts=concepts)
|
context = RuleContext(course=course, concepts=concepts)
|
||||||
|
|
@ -44,20 +42,30 @@ def main() -> None:
|
||||||
)
|
)
|
||||||
run_rules(context, rules)
|
run_rules(context, rules)
|
||||||
|
|
||||||
|
conflicts = []
|
||||||
|
if config.multisource.detect_duplicate_lessons:
|
||||||
|
conflicts.extend(detect_duplicate_lessons(course))
|
||||||
|
if config.multisource.detect_term_conflicts:
|
||||||
|
conflicts.extend(detect_term_conflicts(course))
|
||||||
|
conflicts.extend(detect_thin_concepts(context.concepts))
|
||||||
|
|
||||||
draft = build_draft_pack(
|
draft = build_draft_pack(
|
||||||
course=course,
|
course=course,
|
||||||
concepts=context.concepts,
|
concepts=context.concepts,
|
||||||
author=config.course_ingest.default_pack_author,
|
author=config.course_ingest.default_pack_author,
|
||||||
license_name=config.course_ingest.default_license,
|
license_name=config.course_ingest.default_license,
|
||||||
review_flags=context.review_flags,
|
review_flags=context.review_flags,
|
||||||
|
conflicts=conflicts,
|
||||||
)
|
)
|
||||||
write_draft_pack(draft, args.output_dir)
|
write_draft_pack(draft, args.output_dir)
|
||||||
|
|
||||||
print("== Didactopus Course-to-Pack Ingest ==")
|
print("== Didactopus Multi-Source Course Ingest ==")
|
||||||
print(f"Course: {course.title}")
|
print(f"Course: {course.title}")
|
||||||
|
print(f"Sources: {len(records)}")
|
||||||
print(f"Modules: {len(course.modules)}")
|
print(f"Modules: {len(course.modules)}")
|
||||||
print(f"Concept candidates: {len(context.concepts)}")
|
print(f"Concept candidates: {len(context.concepts)}")
|
||||||
print(f"Review flags: {len(context.review_flags)}")
|
print(f"Review flags: {len(context.review_flags)}")
|
||||||
|
print(f"Conflicts: {len(conflicts)}")
|
||||||
print(f"Output dir: {args.output_dir}")
|
print(f"Output dir: {args.output_dir}")
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -6,7 +6,7 @@ import yaml
|
||||||
from .course_schema import NormalizedCourse, ConceptCandidate, DraftPack
|
from .course_schema import NormalizedCourse, ConceptCandidate, DraftPack
|
||||||
|
|
||||||
|
|
||||||
def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate], author: str, license_name: str, review_flags: list[str]) -> DraftPack:
|
def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate], author: str, license_name: str, review_flags: list[str], conflicts: list[str]) -> DraftPack:
|
||||||
pack_name = course.title.lower().replace(" ", "-")
|
pack_name = course.title.lower().replace(" ", "-")
|
||||||
pack = {
|
pack = {
|
||||||
"name": pack_name,
|
"name": pack_name,
|
||||||
|
|
@ -15,7 +15,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
|
||||||
"schema_version": "1",
|
"schema_version": "1",
|
||||||
"didactopus_min_version": "0.1.0",
|
"didactopus_min_version": "0.1.0",
|
||||||
"didactopus_max_version": "0.9.99",
|
"didactopus_max_version": "0.9.99",
|
||||||
"description": f"Draft pack generated from course source '{course.source_name or course.title}'.",
|
"description": f"Draft pack generated from multi-source course inputs for '{course.title}'.",
|
||||||
"author": author,
|
"author": author,
|
||||||
"license": license_name,
|
"license": license_name,
|
||||||
"dependencies": [],
|
"dependencies": [],
|
||||||
|
|
@ -61,8 +61,23 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
|
||||||
})
|
})
|
||||||
projects = {"projects": project_items}
|
projects = {"projects": project_items}
|
||||||
rubrics = {"rubrics": [{"id": "draft-rubric", "title": "Draft Rubric", "criteria": ["correctness", "explanation"]}]}
|
rubrics = {"rubrics": [{"id": "draft-rubric", "title": "Draft Rubric", "criteria": ["correctness", "explanation"]}]}
|
||||||
attribution = {"source_name": course.source_name, "source_url": course.source_url, "rights_note": course.rights_note}
|
attribution = {
|
||||||
return DraftPack(pack=pack, concepts=concepts_yaml, roadmap=roadmap, projects=projects, rubrics=rubrics, review_report=review_flags, attribution=attribution)
|
"rights_note": course.rights_note,
|
||||||
|
"sources": [
|
||||||
|
{"source_name": src.source_name, "source_type": src.source_type, "source_path": src.source_path}
|
||||||
|
for src in course.source_records
|
||||||
|
],
|
||||||
|
}
|
||||||
|
return DraftPack(
|
||||||
|
pack=pack,
|
||||||
|
concepts=concepts_yaml,
|
||||||
|
roadmap=roadmap,
|
||||||
|
projects=projects,
|
||||||
|
rubrics=rubrics,
|
||||||
|
review_report=review_flags,
|
||||||
|
attribution=attribution,
|
||||||
|
conflicts=conflicts,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None:
|
def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None:
|
||||||
|
|
@ -73,6 +88,11 @@ def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None:
|
||||||
(out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8")
|
(out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8")
|
||||||
(out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8")
|
(out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8")
|
||||||
(out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8")
|
(out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8")
|
||||||
|
|
||||||
review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"]
|
review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"]
|
||||||
(out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8")
|
(out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8")
|
||||||
|
|
||||||
|
conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"]
|
||||||
|
(out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8")
|
||||||
|
|
||||||
(out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8")
|
(out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8")
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,15 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
||||||
|
from didactopus.conflict_report import detect_duplicate_lessons, detect_term_conflicts, detect_thin_concepts
|
||||||
|
|
||||||
|
|
||||||
|
def test_conflict_detection(tmp_path: Path) -> None:
|
||||||
|
a = tmp_path / "a.md"
|
||||||
|
b = tmp_path / "b.md"
|
||||||
|
a.write_text("# C\n\n## M1\n### Bayesian Updating\nPrior and Posterior are discussed here.", encoding="utf-8")
|
||||||
|
b.write_text("# C\n\n## M2\n### Bayesian Updating\nPrior and Posterior appear again.", encoding="utf-8")
|
||||||
|
course = merge_source_records([parse_source_file(a, title="Course"), parse_source_file(b, title="Course")], course_title="Course", merge_same_named_lessons=False)
|
||||||
|
concepts = extract_concept_candidates(course)
|
||||||
|
assert isinstance(detect_duplicate_lessons(course), list)
|
||||||
|
assert isinstance(detect_term_conflicts(course), list)
|
||||||
|
assert isinstance(detect_thin_concepts(concepts), list)
|
||||||
|
|
@ -0,0 +1,23 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
||||||
|
|
||||||
|
|
||||||
|
def test_merge_source_records(tmp_path: Path) -> None:
|
||||||
|
a = tmp_path / "a.md"
|
||||||
|
b = tmp_path / "b.transcript.txt"
|
||||||
|
a.write_text("# C\n\n## M1\n### L1\n- Objective: Explain A.\nText A.", encoding="utf-8")
|
||||||
|
b.write_text("# C\n\n## M1\n### L1\nExtra transcript detail.", encoding="utf-8")
|
||||||
|
|
||||||
|
records = [parse_source_file(a, title="Course"), parse_source_file(b, title="Course")]
|
||||||
|
course = merge_source_records(records, course_title="Course")
|
||||||
|
assert len(course.modules) == 1
|
||||||
|
assert len(course.modules[0].lessons) == 1
|
||||||
|
assert len(course.modules[0].lessons[0].source_refs) >= 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_candidates_from_merged(tmp_path: Path) -> None:
|
||||||
|
a = tmp_path / "a.md"
|
||||||
|
a.write_text("# C\n\n## M1\n### Lesson A\n- Objective: Explain Topic A.\nBody.", encoding="utf-8")
|
||||||
|
course = merge_source_records([parse_source_file(a, title="Course")], course_title="Course")
|
||||||
|
concepts = extract_concept_candidates(course)
|
||||||
|
assert len(concepts) >= 1
|
||||||
|
|
@ -0,0 +1,17 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
||||||
|
from didactopus.rule_policy import RuleContext, build_default_rules, run_rules
|
||||||
|
from didactopus.pack_emitter import build_draft_pack, write_draft_pack
|
||||||
|
|
||||||
|
|
||||||
|
def test_emit_multisource_pack(tmp_path: Path) -> None:
|
||||||
|
src = tmp_path / "course.md"
|
||||||
|
src.write_text("# C\n\n## M1\n### Lesson A\n- Objective: Explain Topic A.\n- Exercise: Do task A.\nTopic A body.", encoding="utf-8")
|
||||||
|
course = merge_source_records([parse_source_file(src, title="Course")], course_title="Course")
|
||||||
|
concepts = extract_concept_candidates(course)
|
||||||
|
ctx = RuleContext(course=course, concepts=concepts)
|
||||||
|
run_rules(ctx, build_default_rules())
|
||||||
|
draft = build_draft_pack(course, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
|
||||||
|
write_draft_pack(draft, tmp_path / "out")
|
||||||
|
assert (tmp_path / "out" / "pack.yaml").exists()
|
||||||
|
assert (tmp_path / "out" / "conflict_report.md").exists()
|
||||||
Loading…
Reference in New Issue