Added cross-course merger.

This commit is contained in:
welsberr 2026-03-13 06:36:27 -04:00
parent 8defaab1c2
commit 0656f7bbe8
31 changed files with 753 additions and 90 deletions

View File

@ -8,6 +8,41 @@
## Recent revisions ## Recent revisions
### Course-to-course merger
This revision adds two major capabilities:
- **real document adapter scaffolds** for PDF, DOCX, PPTX, and HTML
- a **cross-course merger** for combining multiple course-derived packs into one stronger domain draft
These additions extend the earlier multi-source ingestion layer from "multiple files for one course"
to "multiple courses or course-like sources for one topic domain."
## What is included
- adapter registry for:
- PDF
- DOCX
- PPTX
- HTML
- Markdown
- text
- normalized document extraction interface
- course bundle ingestion across multiple source documents
- cross-course terminology and overlap analysis
- merged topic-pack emitter
- cross-course conflict report
- example source files and example merged output
## Design stance
This is still scaffold-level extraction. The purpose is to define stable interfaces and emitted artifacts,
not to claim perfect semantic parsing of every teaching document.
The implementation is designed so stronger parsers can later replace the stub extractors without changing
the surrounding pipeline.
### Multi-Source Course Ingestion ### Multi-Source Course Ingestion
This revision adds a **Multi-Source Course Ingestion Layer**. This revision adds a **Multi-Source Course Ingestion Layer**.
@ -216,3 +251,4 @@ didactopus/

View File

@ -1,16 +1,19 @@
document_adapters:
allow_pdf: true
allow_docx: true
allow_pptx: true
allow_html: true
allow_markdown: true
allow_text: true
course_ingest: course_ingest:
default_pack_author: "Wesley R. Elsberry" default_pack_author: "Wesley R. Elsberry"
default_license: "REVIEW-REQUIRED" default_license: "REVIEW-REQUIRED"
min_term_length: 4 min_term_length: 4
max_terms_per_lesson: 8 max_terms_per_lesson: 8
rule_policy: cross_course:
enable_prerequisite_order_rule: true detect_title_overlaps: true
enable_duplicate_term_merge_rule: true
enable_project_detection_rule: true
enable_review_flags: true
multisource:
detect_duplicate_lessons: true
detect_term_conflicts: true detect_term_conflicts: true
detect_order_conflicts: true
merge_same_named_lessons: true merge_same_named_lessons: true

View File

@ -0,0 +1,31 @@
# Cross-Course Merger
The cross-course merger combines multiple course-like inputs covering the same subject area.
## Goal
Build a stronger draft topic pack from several partially overlapping sources.
## What it does
- merges normalized source records into course bundles
- merges course bundles into one topic bundle
- compares repeated concepts across courses
- flags terminology conflicts and overlap
- emits a merged draft pack
- emits a cross-course conflict report
## Why this matters
No single course is usually ideal for mastery-oriented domain construction.
Combining multiple sources can improve:
- concept coverage
- exercise diversity
- project identification
- terminology mapping
- prerequisite robustness
## Important caveat
This merger is draft-oriented.
Human review remains necessary before trusting the result as a final domain pack.

42
docs/document-adapters.md Normal file
View File

@ -0,0 +1,42 @@
# Document Adapters
Didactopus now includes adapter scaffolds for several common educational document types.
## Supported adapter interfaces
- PDF adapter
- DOCX adapter
- PPTX adapter
- HTML adapter
- Markdown adapter
- text adapter
## Current status
The current implementation is intentionally conservative:
- it focuses on stable interfaces
- it extracts text in a simplified way
- it normalizes results into shared internal structures
## Why this matters
Educational material commonly lives in:
- syllabi PDFs
- DOCX notes
- PowerPoint slide decks
- LMS HTML exports
- markdown lesson files
A useful curriculum distiller must be able to treat these as first-class inputs.
## Adapter contract
Each adapter returns a normalized document record with:
- source path
- source type
- title
- extracted text
- sections
- metadata
This record is then passed into higher-level course/topic distillation logic.

View File

@ -1,27 +1,25 @@
# FAQ # FAQ
## Why multi-source ingestion? ## Why add document adapters now?
Because course structure is usually distributed across several files rather than Because real educational material is rarely provided in only one plain-text format.
perfectly contained in one source.
## What kinds of conflicts can arise? ## Are these full-fidelity parsers?
Common examples: Not yet. The current implementation is a stable scaffold for extraction and normalization.
- the same lesson with slightly different titles
- inconsistent terminology across notes and transcripts
- exercises present in one source but absent in another
- project prompts implied in one file and explicit in another
## Does the system resolve all conflicts automatically? ## Why add cross-course merging?
No. It produces a merged draft pack and a conflict report for human review. Because one course often under-specifies a domain, while multiple sources together can produce a better draft pack.
## Why not rely only on embeddings for this? ## Does the merger resolve every concept conflict automatically?
Because Didactopus needs explicit structures such as: No. It produces a merged draft plus a conflict report for human review.
- concepts
- prerequisites ## What kinds of issues are flagged?
- projects
- rubrics Examples:
- checkpoints - repeated concepts with different names
- same term used with different local contexts
- courses that introduce topics in conflicting orders
- weak or thin concept descriptions

View File

@ -0,0 +1,42 @@
concepts:
- id: descriptive-statistics
title: Descriptive Statistics
description: 'Objective: Explain mean, median, and variance.
Exercise: Summarize a small dataset.
Descriptive Statistics introduces center and spread.'
prerequisites: []
mastery_signals:
- Summarize a small dataset.
mastery_profile: {}
- id: probability-basics
title: Probability Basics
description: 'Objective: Explain conditional probability.
Exercise: Compute a simple conditional probability.
Probability Basics introduces events and likelihood.'
prerequisites:
- descriptive-statistics
mastery_signals:
- Compute a simple conditional probability.
mastery_profile: {}
- id: prior-and-posterior
title: Prior And Posterior
description: 'Prior and Posterior are central concepts. Prior reflects assumptions
before evidence. Exercise: Compare prior and posterior beliefs.'
prerequisites:
- probability-basics
mastery_signals:
- Compare prior and posterior beliefs.
mastery_profile: {}
- id: model-checking
title: Model Checking
description: 'A weakness is hidden assumptions. A limitation is poor fit. Uncertainty
remains. Exercise: Critique a simple inference model.'
prerequisites:
- prior-and-posterior
mastery_signals:
- Critique a simple inference model.
mastery_profile: {}

View File

@ -0,0 +1,3 @@
# Conflict Report
- Lesson 'prior and posterior' was merged from multiple sources; review ordering assumptions.

View File

@ -0,0 +1,30 @@
{
"rights_note": "REVIEW REQUIRED",
"sources": [
{
"source_path": "examples/intro_bayes_outline.md",
"source_type": "markdown",
"title": "Intro Bayes Outline"
},
{
"source_path": "examples/intro_bayes_lecture.html",
"source_type": "html",
"title": "Intro Bayes Lecture"
},
{
"source_path": "examples/intro_bayes_slides.pptx",
"source_type": "pptx",
"title": "Intro Bayes Slides"
},
{
"source_path": "examples/intro_bayes_notes.docx",
"source_type": "docx",
"title": "Intro Bayes Notes"
},
{
"source_path": "examples/intro_bayes_syllabus.pdf",
"source_type": "pdf",
"title": "Intro Bayes Syllabus"
}
]
}

View File

@ -0,0 +1,14 @@
name: introductory-bayesian-inference
display_name: Introductory Bayesian Inference
version: 0.1.0-draft
schema_version: '1'
didactopus_min_version: 0.1.0
didactopus_max_version: 0.9.99
description: Draft topic pack generated from multi-course inputs for 'Introductory
Bayesian Inference'.
author: Wesley R. Elsberry
license: REVIEW-REQUIRED
dependencies: []
overrides: []
profile_templates: {}
cross_pack_links: []

View File

@ -0,0 +1,7 @@
projects:
- id: prior-and-posterior
title: Prior And Posterior
difficulty: review-required
prerequisites: []
deliverables:
- project artifact

View File

@ -0,0 +1,3 @@
# Review Report
- Module 'Imported from PPTX' appears to contain project-like material; review project extraction.

View File

@ -0,0 +1,17 @@
stages:
- id: stage-1
title: Imported from MARKDOWN
concepts:
- descriptive-statistics
- probability-basics
checkpoint: []
- id: stage-2
title: Imported from HTML
concepts:
- prior-and-posterior
checkpoint: []
- id: stage-3
title: Imported from DOCX
concepts:
- model-checking
checkpoint: []

View File

@ -0,0 +1,6 @@
rubrics:
- id: draft-rubric
title: Draft Rubric
criteria:
- correctness
- explanation

View File

@ -0,0 +1,7 @@
<html><body>
<h1>Introductory Bayesian Inference</h1>
<h2>Bayesian Updating</h2>
<h3>Prior and Posterior</h3>
<p>Prior and Posterior are central concepts. Prior reflects assumptions before evidence.</p>
<p>Exercise: Compare prior and posterior beliefs.</p>
</body></html>

View File

@ -0,0 +1,6 @@
# Bayesian Notes
## Model Critique
### Model Checking
A weakness is hidden assumptions. A limitation is poor fit. Uncertainty remains.
Exercise: Critique a simple inference model.

View File

@ -0,0 +1,12 @@
# Introductory Bayesian Inference
## Foundations
### Descriptive Statistics
Objective: Explain mean, median, and variance.
Exercise: Summarize a small dataset.
Descriptive Statistics introduces center and spread.
### Probability Basics
Objective: Explain conditional probability.
Exercise: Compute a simple conditional probability.
Probability Basics introduces events and likelihood.

View File

@ -0,0 +1,7 @@
# Bayesian Slides
## Bayesian Updating
### Prior and Posterior
Prior and Posterior summary slide text.
Capstone Mini Project
Exercise: Write a short project report comparing priors and posteriors.

View File

@ -0,0 +1,5 @@
# Bayesian Syllabus
## Schedule
### Foundations
Objective: Explain descriptive statistics and conditional probability.

View File

@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "didactopus" name = "didactopus"
version = "0.1.0" version = "0.1.0"
description = "Didactopus: multi-source course-to-pack ingestion scaffold" description = "Didactopus: document-adapter and cross-course merger scaffold"
readme = "README.md" readme = "README.md"
requires-python = ">=3.10" requires-python = ">=3.10"
license = {text = "MIT"} license = {text = "MIT"}
@ -16,7 +16,7 @@ dependencies = ["pydantic>=2.7", "pyyaml>=6.0"]
dev = ["pytest>=8.0", "ruff>=0.6"] dev = ["pytest>=8.0", "ruff>=0.6"]
[project.scripts] [project.scripts]
didactopus-course-ingest = "didactopus.main:main" didactopus-topic-ingest = "didactopus.main:main"
[tool.setuptools.packages.find] [tool.setuptools.packages.find]
where = ["src"] where = ["src"]

View File

@ -3,6 +3,15 @@ from pydantic import BaseModel, Field
import yaml import yaml
class DocumentAdaptersConfig(BaseModel):
allow_pdf: bool = True
allow_docx: bool = True
allow_pptx: bool = True
allow_html: bool = True
allow_markdown: bool = True
allow_text: bool = True
class CourseIngestConfig(BaseModel): class CourseIngestConfig(BaseModel):
default_pack_author: str = "Unknown" default_pack_author: str = "Unknown"
default_license: str = "REVIEW-REQUIRED" default_license: str = "REVIEW-REQUIRED"
@ -10,23 +19,17 @@ class CourseIngestConfig(BaseModel):
max_terms_per_lesson: int = 8 max_terms_per_lesson: int = 8
class RulePolicyConfig(BaseModel): class CrossCourseConfig(BaseModel):
enable_prerequisite_order_rule: bool = True detect_title_overlaps: bool = True
enable_duplicate_term_merge_rule: bool = True
enable_project_detection_rule: bool = True
enable_review_flags: bool = True
class MultisourceConfig(BaseModel):
detect_duplicate_lessons: bool = True
detect_term_conflicts: bool = True detect_term_conflicts: bool = True
detect_order_conflicts: bool = True
merge_same_named_lessons: bool = True merge_same_named_lessons: bool = True
class AppConfig(BaseModel): class AppConfig(BaseModel):
document_adapters: DocumentAdaptersConfig = Field(default_factory=DocumentAdaptersConfig)
course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig) course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig)
rule_policy: RulePolicyConfig = Field(default_factory=RulePolicyConfig) cross_course: CrossCourseConfig = Field(default_factory=CrossCourseConfig)
multisource: MultisourceConfig = Field(default_factory=MultisourceConfig)
def load_config(path: str | Path) -> AppConfig: def load_config(path: str | Path) -> AppConfig:

View File

@ -1,8 +1,21 @@
from __future__ import annotations from __future__ import annotations
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
class Section(BaseModel):
heading: str
body: str = ""
class NormalizedDocument(BaseModel):
source_path: str
source_type: str
title: str = ""
text: str = ""
sections: list[Section] = Field(default_factory=list)
metadata: dict = Field(default_factory=dict)
class Lesson(BaseModel): class Lesson(BaseModel):
title: str title: str
body: str = "" body: str = ""
@ -17,21 +30,18 @@ class Module(BaseModel):
lessons: list[Lesson] = Field(default_factory=list) lessons: list[Lesson] = Field(default_factory=list)
class NormalizedSourceRecord(BaseModel):
source_name: str
source_type: str
source_path: str
title: str = ""
modules: list[Module] = Field(default_factory=list)
class NormalizedCourse(BaseModel): class NormalizedCourse(BaseModel):
title: str title: str
source_name: str = "" source_name: str = ""
source_url: str = "" source_url: str = ""
rights_note: str = "" rights_note: str = ""
modules: list[Module] = Field(default_factory=list) modules: list[Module] = Field(default_factory=list)
source_records: list[NormalizedSourceRecord] = Field(default_factory=list) source_records: list[NormalizedDocument] = Field(default_factory=list)
class TopicBundle(BaseModel):
topic_title: str
courses: list[NormalizedCourse] = Field(default_factory=list)
class ConceptCandidate(BaseModel): class ConceptCandidate(BaseModel):
@ -40,6 +50,7 @@ class ConceptCandidate(BaseModel):
description: str = "" description: str = ""
source_modules: list[str] = Field(default_factory=list) source_modules: list[str] = Field(default_factory=list)
source_lessons: list[str] = Field(default_factory=list) source_lessons: list[str] = Field(default_factory=list)
source_courses: list[str] = Field(default_factory=list)
prerequisites: list[str] = Field(default_factory=list) prerequisites: list[str] = Field(default_factory=list)
mastery_signals: list[str] = Field(default_factory=list) mastery_signals: list[str] = Field(default_factory=list)

View File

@ -0,0 +1,50 @@
from __future__ import annotations
from collections import defaultdict
from .course_schema import NormalizedCourse, ConceptCandidate
def detect_title_overlaps(course: NormalizedCourse) -> list[str]:
lesson_to_sources = defaultdict(set)
for module in course.modules:
for lesson in module.lessons:
for src in lesson.source_refs:
lesson_to_sources[lesson.title.lower()].add(src)
flags = []
for title, sources in lesson_to_sources.items():
if len(sources) > 1:
flags.append(f"Lesson title '{title}' appears across multiple sources: {', '.join(sorted(sources))}")
return flags
def detect_term_conflicts(course: NormalizedCourse) -> list[str]:
term_to_lessons = defaultdict(set)
for module in course.modules:
for lesson in module.lessons:
for term in lesson.key_terms:
term_to_lessons[term.lower()].add(lesson.title)
flags = []
for term, lessons in term_to_lessons.items():
if len(lessons) > 1:
flags.append(f"Key term '{term}' appears in multiple lesson contexts: {', '.join(sorted(lessons))}")
return flags
def detect_order_conflicts(course: NormalizedCourse) -> list[str]:
# Placeholder heuristic: if same lesson title appears in multiple source_refs, flag for order review.
flags = []
for module in course.modules:
for lesson in module.lessons:
if len(set(lesson.source_refs)) > 1:
flags.append(f"Lesson '{lesson.title}' was merged from multiple sources; review ordering assumptions.")
return flags
def detect_thin_concepts(concepts: list[ConceptCandidate]) -> list[str]:
flags = []
for concept in concepts:
if len(concept.description.strip()) < 20:
flags.append(f"Concept '{concept.title}' has a very thin description.")
if not concept.mastery_signals:
flags.append(f"Concept '{concept.title}' has no extracted mastery signals.")
return flags

View File

@ -0,0 +1,141 @@
from __future__ import annotations
from pathlib import Path
import re
from .course_schema import NormalizedDocument, Section
def _title_from_path(path: str | Path) -> str:
p = Path(path)
return p.stem.replace("_", " ").replace("-", " ").title()
def _simple_section_split(text: str) -> list[Section]:
sections = []
current_heading = "Main"
current_lines = []
for line in text.splitlines():
if re.match(r"^(#{1,3})\s+", line):
if current_lines:
sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
current_heading = re.sub(r"^(#{1,3})\s+", "", line).strip()
current_lines = []
else:
current_lines.append(line)
if current_lines:
sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
return sections
def read_textish(path: str | Path) -> str:
return Path(path).read_text(encoding="utf-8")
def adapt_markdown(path: str | Path) -> NormalizedDocument:
text = read_textish(path)
return NormalizedDocument(
source_path=str(path),
source_type="markdown",
title=_title_from_path(path),
text=text,
sections=_simple_section_split(text),
metadata={},
)
def adapt_text(path: str | Path) -> NormalizedDocument:
text = read_textish(path)
return NormalizedDocument(
source_path=str(path),
source_type="text",
title=_title_from_path(path),
text=text,
sections=_simple_section_split(text),
metadata={},
)
def adapt_html(path: str | Path) -> NormalizedDocument:
raw = read_textish(path)
text = re.sub(r"<[^>]+>", " ", raw)
text = re.sub(r"\s+", " ", text).strip()
return NormalizedDocument(
source_path=str(path),
source_type="html",
title=_title_from_path(path),
text=text,
sections=[Section(heading="HTML Extract", body=text)],
metadata={"extraction": "stub-html-strip"},
)
def adapt_pdf(path: str | Path) -> NormalizedDocument:
# Stub: in a real implementation, plug in PDF text extraction here.
text = read_textish(path)
return NormalizedDocument(
source_path=str(path),
source_type="pdf",
title=_title_from_path(path),
text=text,
sections=_simple_section_split(text),
metadata={"extraction": "stub-pdf-text"},
)
def adapt_docx(path: str | Path) -> NormalizedDocument:
# Stub: in a real implementation, plug in DOCX extraction here.
text = read_textish(path)
return NormalizedDocument(
source_path=str(path),
source_type="docx",
title=_title_from_path(path),
text=text,
sections=_simple_section_split(text),
metadata={"extraction": "stub-docx-text"},
)
def adapt_pptx(path: str | Path) -> NormalizedDocument:
# Stub: in a real implementation, plug in PPTX extraction here.
text = read_textish(path)
return NormalizedDocument(
source_path=str(path),
source_type="pptx",
title=_title_from_path(path),
text=text,
sections=_simple_section_split(text),
metadata={"extraction": "stub-pptx-text"},
)
def detect_adapter(path: str | Path) -> str:
p = Path(path)
suffix = p.suffix.lower()
if suffix == ".md":
return "markdown"
if suffix in {".txt"}:
return "text"
if suffix in {".html", ".htm"}:
return "html"
if suffix == ".pdf":
return "pdf"
if suffix == ".docx":
return "docx"
if suffix == ".pptx":
return "pptx"
return "text"
def adapt_document(path: str | Path) -> NormalizedDocument:
adapter = detect_adapter(path)
if adapter == "markdown":
return adapt_markdown(path)
if adapter == "html":
return adapt_html(path)
if adapter == "pdf":
return adapt_pdf(path)
if adapter == "docx":
return adapt_docx(path)
if adapter == "pptx":
return adapt_pptx(path)
return adapt_text(path)

View File

@ -4,18 +4,19 @@ import argparse
from pathlib import Path from pathlib import Path
from .config import load_config from .config import load_config
from .course_ingest import parse_source_file, merge_source_records, extract_concept_candidates from .document_adapters import adapt_document
from .topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
from .cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
from .rule_policy import RuleContext, build_default_rules, run_rules from .rule_policy import RuleContext, build_default_rules, run_rules
from .conflict_report import detect_duplicate_lessons, detect_term_conflicts, detect_thin_concepts
from .pack_emitter import build_draft_pack, write_draft_pack from .pack_emitter import build_draft_pack, write_draft_pack
def build_parser() -> argparse.ArgumentParser: def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Didactopus multi-source course-to-pack ingestion pipeline") parser = argparse.ArgumentParser(description="Didactopus document-adapter and cross-course topic ingestion")
parser.add_argument("--inputs", nargs="+", required=True, help="Input source files") parser.add_argument("--inputs", nargs="+", required=True, help="Document inputs")
parser.add_argument("--title", required=True, help="Course or topic title") parser.add_argument("--title", required=True, help="Topic title")
parser.add_argument("--rights-note", default="REVIEW REQUIRED") parser.add_argument("--rights-note", default="REVIEW REQUIRED")
parser.add_argument("--output-dir", default="generated-pack") parser.add_argument("--output-dir", default="generated-topic-pack")
parser.add_argument("--config", default="configs/config.example.yaml") parser.add_argument("--config", default="configs/config.example.yaml")
return parser return parser
@ -24,33 +25,30 @@ def main() -> None:
args = build_parser().parse_args() args = build_parser().parse_args()
config = load_config(args.config) config = load_config(args.config)
records = [parse_source_file(path, title=args.title) for path in args.inputs] docs = [adapt_document(path) for path in args.inputs]
course = merge_source_records( courses = [document_to_course(doc, course_title=args.title) for doc in docs]
records=records, topic = build_topic_bundle(args.title, courses)
course_title=args.title, merged_course = merge_courses_into_topic_course(
rights_note=args.rights_note, topic_bundle=topic,
merge_same_named_lessons=config.multisource.merge_same_named_lessons, merge_same_named_lessons=config.cross_course.merge_same_named_lessons,
) )
concepts = extract_concept_candidates(course) concepts = extract_concept_candidates(merged_course)
context = RuleContext(course=course, concepts=concepts)
rules = build_default_rules( context = RuleContext(course=merged_course, concepts=concepts)
enable_prereq=config.rule_policy.enable_prerequisite_order_rule, rules = build_default_rules()
enable_merge=config.rule_policy.enable_duplicate_term_merge_rule,
enable_projects=config.rule_policy.enable_project_detection_rule,
enable_review=config.rule_policy.enable_review_flags,
)
run_rules(context, rules) run_rules(context, rules)
conflicts = [] conflicts = []
if config.multisource.detect_duplicate_lessons: if config.cross_course.detect_title_overlaps:
conflicts.extend(detect_duplicate_lessons(course)) conflicts.extend(detect_title_overlaps(merged_course))
if config.multisource.detect_term_conflicts: if config.cross_course.detect_term_conflicts:
conflicts.extend(detect_term_conflicts(course)) conflicts.extend(detect_term_conflicts(merged_course))
if config.cross_course.detect_order_conflicts:
conflicts.extend(detect_order_conflicts(merged_course))
conflicts.extend(detect_thin_concepts(context.concepts)) conflicts.extend(detect_thin_concepts(context.concepts))
draft = build_draft_pack( draft = build_draft_pack(
course=course, course=merged_course,
concepts=context.concepts, concepts=context.concepts,
author=config.course_ingest.default_pack_author, author=config.course_ingest.default_pack_author,
license_name=config.course_ingest.default_license, license_name=config.course_ingest.default_license,
@ -59,10 +57,11 @@ def main() -> None:
) )
write_draft_pack(draft, args.output_dir) write_draft_pack(draft, args.output_dir)
print("== Didactopus Multi-Source Course Ingest ==") print("== Didactopus Cross-Course Topic Ingest ==")
print(f"Course: {course.title}") print(f"Topic: {args.title}")
print(f"Sources: {len(records)}") print(f"Documents: {len(docs)}")
print(f"Modules: {len(course.modules)}") print(f"Courses: {len(courses)}")
print(f"Merged modules: {len(merged_course.modules)}")
print(f"Concept candidates: {len(context.concepts)}") print(f"Concept candidates: {len(context.concepts)}")
print(f"Review flags: {len(context.review_flags)}") print(f"Review flags: {len(context.review_flags)}")
print(f"Conflicts: {len(conflicts)}") print(f"Conflicts: {len(conflicts)}")

View File

@ -15,7 +15,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
"schema_version": "1", "schema_version": "1",
"didactopus_min_version": "0.1.0", "didactopus_min_version": "0.1.0",
"didactopus_max_version": "0.9.99", "didactopus_max_version": "0.9.99",
"description": f"Draft pack generated from multi-source course inputs for '{course.title}'.", "description": f"Draft topic pack generated from multi-course inputs for '{course.title}'.",
"author": author, "author": author,
"license": license_name, "license": license_name,
"dependencies": [], "dependencies": [],
@ -64,7 +64,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
attribution = { attribution = {
"rights_note": course.rights_note, "rights_note": course.rights_note,
"sources": [ "sources": [
{"source_name": src.source_name, "source_type": src.source_type, "source_path": src.source_path} {"source_path": src.source_path, "source_type": src.source_type, "title": src.title}
for src in course.source_records for src in course.source_records
], ],
} }
@ -88,11 +88,8 @@ def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None:
(out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8") (out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8")
(out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8") (out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8")
(out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8") (out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8")
review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"] review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"]
(out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8") (out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8")
conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"] conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"]
(out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8") (out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8")
(out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8") (out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8")

View File

@ -39,6 +39,7 @@ def duplicate_term_merge_rule(context: RuleContext) -> None:
if key in seen: if key in seen:
seen[key].source_modules.extend(x for x in concept.source_modules if x not in seen[key].source_modules) seen[key].source_modules.extend(x for x in concept.source_modules if x not in seen[key].source_modules)
seen[key].source_lessons.extend(x for x in concept.source_lessons if x not in seen[key].source_lessons) seen[key].source_lessons.extend(x for x in concept.source_lessons if x not in seen[key].source_lessons)
seen[key].source_courses.extend(x for x in concept.source_courses if x not in seen[key].source_courses)
if concept.description and len(seen[key].description) < len(concept.description): if concept.description and len(seen[key].description) < len(concept.description):
seen[key].description = concept.description seen[key].description = concept.description
else: else:

View File

@ -0,0 +1,126 @@
from __future__ import annotations
import re
from collections import defaultdict
from .course_schema import NormalizedDocument, NormalizedCourse, Module, Lesson, TopicBundle, ConceptCandidate
def slugify(text: str) -> str:
cleaned = re.sub(r"[^a-zA-Z0-9]+", "-", text.strip().lower()).strip("-")
return cleaned or "untitled"
def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -> list[str]:
candidates = re.findall(r"\b[A-Z][A-Za-z0-9\-]{%d,}\b" % (min_term_length - 1), text)
seen = set()
out = []
for term in candidates:
if term not in seen:
seen.add(term)
out.append(term)
if len(out) >= max_terms:
break
return out
def document_to_course(doc: NormalizedDocument, course_title: str) -> NormalizedCourse:
# Conservative mapping: each section becomes a lesson; all lessons go into one module.
lessons = []
for section in doc.sections:
body = section.body.strip()
lines = body.splitlines()
objectives = []
exercises = []
for line in lines:
low = line.lower().strip()
if low.startswith("objective:"):
objectives.append(line.split(":", 1)[1].strip())
if low.startswith("exercise:"):
exercises.append(line.split(":", 1)[1].strip())
lessons.append(
Lesson(
title=section.heading.strip() or "Untitled Lesson",
body=body,
objectives=objectives,
exercises=exercises,
key_terms=extract_key_terms(section.heading + "\n" + body),
source_refs=[doc.source_path],
)
)
module = Module(title=f"Imported from {doc.source_type.upper()}", lessons=lessons)
return NormalizedCourse(title=course_title, modules=[module], source_records=[doc])
def build_topic_bundle(topic_title: str, courses: list[NormalizedCourse]) -> TopicBundle:
return TopicBundle(topic_title=topic_title, courses=courses)
def merge_courses_into_topic_course(topic_bundle: TopicBundle, merge_same_named_lessons: bool = True) -> NormalizedCourse:
modules_by_title: dict[str, Module] = {}
source_records = []
for course in topic_bundle.courses:
source_records.extend(course.source_records)
for module in course.modules:
target_module = modules_by_title.setdefault(module.title, Module(title=module.title, lessons=[]))
if merge_same_named_lessons:
lesson_map = {lesson.title: lesson for lesson in target_module.lessons}
for lesson in module.lessons:
if lesson.title in lesson_map:
existing = lesson_map[lesson.title]
if lesson.body and lesson.body not in existing.body:
existing.body = (existing.body + "\n\n" + lesson.body).strip()
for x in lesson.objectives:
if x not in existing.objectives:
existing.objectives.append(x)
for x in lesson.exercises:
if x not in existing.exercises:
existing.exercises.append(x)
for x in lesson.key_terms:
if x not in existing.key_terms:
existing.key_terms.append(x)
for x in lesson.source_refs:
if x not in existing.source_refs:
existing.source_refs.append(x)
else:
target_module.lessons.append(lesson)
else:
target_module.lessons.extend(module.lessons)
return NormalizedCourse(title=topic_bundle.topic_title, modules=list(modules_by_title.values()), source_records=source_records)
def extract_concept_candidates(course: NormalizedCourse) -> list[ConceptCandidate]:
concepts = []
seen_ids = set()
for module in course.modules:
for lesson in module.lessons:
cid = slugify(lesson.title)
if cid not in seen_ids:
seen_ids.add(cid)
concepts.append(
ConceptCandidate(
id=cid,
title=lesson.title,
description=lesson.body[:240].strip(),
source_modules=[module.title],
source_lessons=[lesson.title],
source_courses=list(lesson.source_refs),
mastery_signals=list(lesson.objectives[:3] or lesson.exercises[:2]),
)
)
for term in lesson.key_terms:
tid = slugify(term)
if tid in seen_ids:
continue
seen_ids.add(tid)
concepts.append(
ConceptCandidate(
id=tid,
title=term,
description=f"Candidate concept extracted from lesson '{lesson.title}'.",
source_modules=[module.title],
source_lessons=[lesson.title],
source_courses=list(lesson.source_refs),
mastery_signals=list(lesson.objectives[:2]),
)
)
return concepts

View File

@ -0,0 +1,19 @@
from pathlib import Path
from didactopus.document_adapters import adapt_document
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
from didactopus.cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
def test_conflict_detection(tmp_path: Path) -> None:
a = tmp_path / "a.md"
b = tmp_path / "b.md"
a.write_text("# T\n\n## M1\n### Bayesian Updating\nPrior and Posterior appear here.", encoding="utf-8")
b.write_text("# T\n\n## M2\n### Bayesian Updating\nPrior and Posterior appear again.", encoding="utf-8")
docs = [adapt_document(a), adapt_document(b)]
courses = [document_to_course(doc, "Topic") for doc in docs]
merged = merge_courses_into_topic_course(build_topic_bundle("Topic", courses), merge_same_named_lessons=False)
concepts = extract_concept_candidates(merged)
assert isinstance(detect_title_overlaps(merged), list)
assert isinstance(detect_term_conflicts(merged), list)
assert isinstance(detect_order_conflicts(merged), list)
assert isinstance(detect_thin_concepts(concepts), list)

View File

@ -0,0 +1,18 @@
from pathlib import Path
from didactopus.document_adapters import adapt_document, detect_adapter
def test_detect_adapter() -> None:
assert detect_adapter("a.md") == "markdown"
assert detect_adapter("b.html") == "html"
assert detect_adapter("c.pdf") == "pdf"
assert detect_adapter("d.docx") == "docx"
assert detect_adapter("e.pptx") == "pptx"
def test_adapt_markdown(tmp_path: Path) -> None:
p = tmp_path / "x.md"
p.write_text("# T\n\n## A\nBody", encoding="utf-8")
doc = adapt_document(p)
assert doc.source_type == "markdown"
assert len(doc.sections) >= 1

View File

@ -1,17 +1,20 @@
from pathlib import Path from pathlib import Path
from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates from didactopus.document_adapters import adapt_document
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
from didactopus.rule_policy import RuleContext, build_default_rules, run_rules from didactopus.rule_policy import RuleContext, build_default_rules, run_rules
from didactopus.pack_emitter import build_draft_pack, write_draft_pack from didactopus.pack_emitter import build_draft_pack, write_draft_pack
def test_emit_multisource_pack(tmp_path: Path) -> None: def test_emit_topic_pack(tmp_path: Path) -> None:
src = tmp_path / "course.md" src = tmp_path / "course.md"
src.write_text("# C\n\n## M1\n### Lesson A\n- Objective: Explain Topic A.\n- Exercise: Do task A.\nTopic A body.", encoding="utf-8") src.write_text("# T\n\n## M\n### L\nExercise: Do task A.\nTopic A body.", encoding="utf-8")
course = merge_source_records([parse_source_file(src, title="Course")], course_title="Course") doc = adapt_document(src)
concepts = extract_concept_candidates(course) course = document_to_course(doc, "Topic")
ctx = RuleContext(course=course, concepts=concepts) merged = merge_courses_into_topic_course(build_topic_bundle("Topic", [course]))
concepts = extract_concept_candidates(merged)
ctx = RuleContext(course=merged, concepts=concepts)
run_rules(ctx, build_default_rules()) run_rules(ctx, build_default_rules())
draft = build_draft_pack(course, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, []) draft = build_draft_pack(merged, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
write_draft_pack(draft, tmp_path / "out") write_draft_pack(draft, tmp_path / "out")
assert (tmp_path / "out" / "pack.yaml").exists() assert (tmp_path / "out" / "pack.yaml").exists()
assert (tmp_path / "out" / "conflict_report.md").exists() assert (tmp_path / "out" / "conflict_report.md").exists()

View File

@ -0,0 +1,26 @@
from pathlib import Path
from didactopus.document_adapters import adapt_document
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
def test_cross_course_merge(tmp_path: Path) -> None:
a = tmp_path / "a.md"
b = tmp_path / "b.docx"
a.write_text("# T\n\n## M\n### L1\nBody A", encoding="utf-8")
b.write_text("# T\n\n## M\n### L1\nBody B", encoding="utf-8")
docs = [adapt_document(a), adapt_document(b)]
courses = [document_to_course(doc, "Topic") for doc in docs]
topic = build_topic_bundle("Topic", courses)
merged = merge_courses_into_topic_course(topic)
assert len(merged.modules) >= 1
assert len(merged.modules[0].lessons) == 1
def test_extract_concepts(tmp_path: Path) -> None:
a = tmp_path / "a.md"
a.write_text("# T\n\n## M\n### Lesson A\nObjective: Explain Topic A.\nBody.", encoding="utf-8")
doc = adapt_document(a)
course = document_to_course(doc, "Topic")
concepts = extract_concept_candidates(course)
assert len(concepts) >= 1