diff --git a/README.md b/README.md index d35d32c..e378a84 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,41 @@ ## Recent revisions +### Course-to-course merger + +This revision adds two major capabilities: + +- **real document adapter scaffolds** for PDF, DOCX, PPTX, and HTML +- a **cross-course merger** for combining multiple course-derived packs into one stronger domain draft + +These additions extend the earlier multi-source ingestion layer from "multiple files for one course" +to "multiple courses or course-like sources for one topic domain." + +## What is included + +- adapter registry for: + - PDF + - DOCX + - PPTX + - HTML + - Markdown + - text +- normalized document extraction interface +- course bundle ingestion across multiple source documents +- cross-course terminology and overlap analysis +- merged topic-pack emitter +- cross-course conflict report +- example source files and example merged output + +## Design stance + +This is still scaffold-level extraction. The purpose is to define stable interfaces and emitted artifacts, +not to claim perfect semantic parsing of every teaching document. + +The implementation is designed so stronger parsers can later replace the stub extractors without changing +the surrounding pipeline. + + ### Multi-Source Course Ingestion This revision adds a **Multi-Source Course Ingestion Layer**. @@ -216,3 +251,4 @@ didactopus/ + diff --git a/configs/config.example.yaml b/configs/config.example.yaml index 6047dab..320a6d0 100644 --- a/configs/config.example.yaml +++ b/configs/config.example.yaml @@ -1,16 +1,19 @@ +document_adapters: + allow_pdf: true + allow_docx: true + allow_pptx: true + allow_html: true + allow_markdown: true + allow_text: true + course_ingest: default_pack_author: "Wesley R. Elsberry" default_license: "REVIEW-REQUIRED" min_term_length: 4 max_terms_per_lesson: 8 -rule_policy: - enable_prerequisite_order_rule: true - enable_duplicate_term_merge_rule: true - enable_project_detection_rule: true - enable_review_flags: true - -multisource: - detect_duplicate_lessons: true +cross_course: + detect_title_overlaps: true detect_term_conflicts: true + detect_order_conflicts: true merge_same_named_lessons: true diff --git a/docs/cross-course-merger.md b/docs/cross-course-merger.md new file mode 100644 index 0000000..d549664 --- /dev/null +++ b/docs/cross-course-merger.md @@ -0,0 +1,31 @@ +# Cross-Course Merger + +The cross-course merger combines multiple course-like inputs covering the same subject area. + +## Goal + +Build a stronger draft topic pack from several partially overlapping sources. + +## What it does + +- merges normalized source records into course bundles +- merges course bundles into one topic bundle +- compares repeated concepts across courses +- flags terminology conflicts and overlap +- emits a merged draft pack +- emits a cross-course conflict report + +## Why this matters + +No single course is usually ideal for mastery-oriented domain construction. +Combining multiple sources can improve: +- concept coverage +- exercise diversity +- project identification +- terminology mapping +- prerequisite robustness + +## Important caveat + +This merger is draft-oriented. +Human review remains necessary before trusting the result as a final domain pack. diff --git a/docs/document-adapters.md b/docs/document-adapters.md new file mode 100644 index 0000000..6336430 --- /dev/null +++ b/docs/document-adapters.md @@ -0,0 +1,42 @@ +# Document Adapters + +Didactopus now includes adapter scaffolds for several common educational document types. + +## Supported adapter interfaces + +- PDF adapter +- DOCX adapter +- PPTX adapter +- HTML adapter +- Markdown adapter +- text adapter + +## Current status + +The current implementation is intentionally conservative: +- it focuses on stable interfaces +- it extracts text in a simplified way +- it normalizes results into shared internal structures + +## Why this matters + +Educational material commonly lives in: +- syllabi PDFs +- DOCX notes +- PowerPoint slide decks +- LMS HTML exports +- markdown lesson files + +A useful curriculum distiller must be able to treat these as first-class inputs. + +## Adapter contract + +Each adapter returns a normalized document record with: +- source path +- source type +- title +- extracted text +- sections +- metadata + +This record is then passed into higher-level course/topic distillation logic. diff --git a/docs/faq.md b/docs/faq.md index b5165bf..805941f 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1,27 +1,25 @@ # FAQ -## Why multi-source ingestion? +## Why add document adapters now? -Because course structure is usually distributed across several files rather than -perfectly contained in one source. +Because real educational material is rarely provided in only one plain-text format. -## What kinds of conflicts can arise? +## Are these full-fidelity parsers? -Common examples: -- the same lesson with slightly different titles -- inconsistent terminology across notes and transcripts -- exercises present in one source but absent in another -- project prompts implied in one file and explicit in another +Not yet. The current implementation is a stable scaffold for extraction and normalization. -## Does the system resolve all conflicts automatically? +## Why add cross-course merging? -No. It produces a merged draft pack and a conflict report for human review. +Because one course often under-specifies a domain, while multiple sources together can produce a better draft pack. -## Why not rely only on embeddings for this? +## Does the merger resolve every concept conflict automatically? -Because Didactopus needs explicit structures such as: -- concepts -- prerequisites -- projects -- rubrics -- checkpoints +No. It produces a merged draft plus a conflict report for human review. + +## What kinds of issues are flagged? + +Examples: +- repeated concepts with different names +- same term used with different local contexts +- courses that introduce topics in conflicting orders +- weak or thin concept descriptions diff --git a/examples/generated_topic_pack/concepts.yaml b/examples/generated_topic_pack/concepts.yaml new file mode 100644 index 0000000..1e89fd5 --- /dev/null +++ b/examples/generated_topic_pack/concepts.yaml @@ -0,0 +1,42 @@ +concepts: +- id: descriptive-statistics + title: Descriptive Statistics + description: 'Objective: Explain mean, median, and variance. + + Exercise: Summarize a small dataset. + + Descriptive Statistics introduces center and spread.' + prerequisites: [] + mastery_signals: + - Summarize a small dataset. + mastery_profile: {} +- id: probability-basics + title: Probability Basics + description: 'Objective: Explain conditional probability. + + Exercise: Compute a simple conditional probability. + + Probability Basics introduces events and likelihood.' + prerequisites: + - descriptive-statistics + mastery_signals: + - Compute a simple conditional probability. + mastery_profile: {} +- id: prior-and-posterior + title: Prior And Posterior + description: 'Prior and Posterior are central concepts. Prior reflects assumptions + before evidence. Exercise: Compare prior and posterior beliefs.' + prerequisites: + - probability-basics + mastery_signals: + - Compare prior and posterior beliefs. + mastery_profile: {} +- id: model-checking + title: Model Checking + description: 'A weakness is hidden assumptions. A limitation is poor fit. Uncertainty + remains. Exercise: Critique a simple inference model.' + prerequisites: + - prior-and-posterior + mastery_signals: + - Critique a simple inference model. + mastery_profile: {} diff --git a/examples/generated_topic_pack/conflict_report.md b/examples/generated_topic_pack/conflict_report.md new file mode 100644 index 0000000..a34681b --- /dev/null +++ b/examples/generated_topic_pack/conflict_report.md @@ -0,0 +1,3 @@ +# Conflict Report + +- Lesson 'prior and posterior' was merged from multiple sources; review ordering assumptions. diff --git a/examples/generated_topic_pack/license_attribution.json b/examples/generated_topic_pack/license_attribution.json new file mode 100644 index 0000000..bf16872 --- /dev/null +++ b/examples/generated_topic_pack/license_attribution.json @@ -0,0 +1,30 @@ +{ + "rights_note": "REVIEW REQUIRED", + "sources": [ + { + "source_path": "examples/intro_bayes_outline.md", + "source_type": "markdown", + "title": "Intro Bayes Outline" + }, + { + "source_path": "examples/intro_bayes_lecture.html", + "source_type": "html", + "title": "Intro Bayes Lecture" + }, + { + "source_path": "examples/intro_bayes_slides.pptx", + "source_type": "pptx", + "title": "Intro Bayes Slides" + }, + { + "source_path": "examples/intro_bayes_notes.docx", + "source_type": "docx", + "title": "Intro Bayes Notes" + }, + { + "source_path": "examples/intro_bayes_syllabus.pdf", + "source_type": "pdf", + "title": "Intro Bayes Syllabus" + } + ] +} \ No newline at end of file diff --git a/examples/generated_topic_pack/pack.yaml b/examples/generated_topic_pack/pack.yaml new file mode 100644 index 0000000..44480bf --- /dev/null +++ b/examples/generated_topic_pack/pack.yaml @@ -0,0 +1,14 @@ +name: introductory-bayesian-inference +display_name: Introductory Bayesian Inference +version: 0.1.0-draft +schema_version: '1' +didactopus_min_version: 0.1.0 +didactopus_max_version: 0.9.99 +description: Draft topic pack generated from multi-course inputs for 'Introductory + Bayesian Inference'. +author: Wesley R. Elsberry +license: REVIEW-REQUIRED +dependencies: [] +overrides: [] +profile_templates: {} +cross_pack_links: [] diff --git a/examples/generated_topic_pack/projects.yaml b/examples/generated_topic_pack/projects.yaml new file mode 100644 index 0000000..a3b94ce --- /dev/null +++ b/examples/generated_topic_pack/projects.yaml @@ -0,0 +1,7 @@ +projects: +- id: prior-and-posterior + title: Prior And Posterior + difficulty: review-required + prerequisites: [] + deliverables: + - project artifact diff --git a/examples/generated_topic_pack/review_report.md b/examples/generated_topic_pack/review_report.md new file mode 100644 index 0000000..103662c --- /dev/null +++ b/examples/generated_topic_pack/review_report.md @@ -0,0 +1,3 @@ +# Review Report + +- Module 'Imported from PPTX' appears to contain project-like material; review project extraction. diff --git a/examples/generated_topic_pack/roadmap.yaml b/examples/generated_topic_pack/roadmap.yaml new file mode 100644 index 0000000..6db939a --- /dev/null +++ b/examples/generated_topic_pack/roadmap.yaml @@ -0,0 +1,17 @@ +stages: +- id: stage-1 + title: Imported from MARKDOWN + concepts: + - descriptive-statistics + - probability-basics + checkpoint: [] +- id: stage-2 + title: Imported from HTML + concepts: + - prior-and-posterior + checkpoint: [] +- id: stage-3 + title: Imported from DOCX + concepts: + - model-checking + checkpoint: [] diff --git a/examples/generated_topic_pack/rubrics.yaml b/examples/generated_topic_pack/rubrics.yaml new file mode 100644 index 0000000..65aee54 --- /dev/null +++ b/examples/generated_topic_pack/rubrics.yaml @@ -0,0 +1,6 @@ +rubrics: +- id: draft-rubric + title: Draft Rubric + criteria: + - correctness + - explanation diff --git a/examples/intro_bayes_lecture.html b/examples/intro_bayes_lecture.html new file mode 100644 index 0000000..a7dfe88 --- /dev/null +++ b/examples/intro_bayes_lecture.html @@ -0,0 +1,7 @@ +
+Prior and Posterior are central concepts. Prior reflects assumptions before evidence.
+Exercise: Compare prior and posterior beliefs.
+ diff --git a/examples/intro_bayes_notes.docx b/examples/intro_bayes_notes.docx new file mode 100644 index 0000000..227f00d --- /dev/null +++ b/examples/intro_bayes_notes.docx @@ -0,0 +1,6 @@ +# Bayesian Notes + +## Model Critique +### Model Checking +A weakness is hidden assumptions. A limitation is poor fit. Uncertainty remains. +Exercise: Critique a simple inference model. diff --git a/examples/intro_bayes_outline.md b/examples/intro_bayes_outline.md new file mode 100644 index 0000000..785b94a --- /dev/null +++ b/examples/intro_bayes_outline.md @@ -0,0 +1,12 @@ +# Introductory Bayesian Inference + +## Foundations +### Descriptive Statistics +Objective: Explain mean, median, and variance. +Exercise: Summarize a small dataset. +Descriptive Statistics introduces center and spread. + +### Probability Basics +Objective: Explain conditional probability. +Exercise: Compute a simple conditional probability. +Probability Basics introduces events and likelihood. diff --git a/examples/intro_bayes_slides.pptx b/examples/intro_bayes_slides.pptx new file mode 100644 index 0000000..7f27865 --- /dev/null +++ b/examples/intro_bayes_slides.pptx @@ -0,0 +1,7 @@ +# Bayesian Slides + +## Bayesian Updating +### Prior and Posterior +Prior and Posterior summary slide text. +Capstone Mini Project +Exercise: Write a short project report comparing priors and posteriors. diff --git a/examples/intro_bayes_syllabus.pdf b/examples/intro_bayes_syllabus.pdf new file mode 100644 index 0000000..52e7b77 --- /dev/null +++ b/examples/intro_bayes_syllabus.pdf @@ -0,0 +1,5 @@ +# Bayesian Syllabus + +## Schedule +### Foundations +Objective: Explain descriptive statistics and conditional probability. diff --git a/pyproject.toml b/pyproject.toml index 0b95f2f..8716c9e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta" [project] name = "didactopus" version = "0.1.0" -description = "Didactopus: multi-source course-to-pack ingestion scaffold" +description = "Didactopus: document-adapter and cross-course merger scaffold" readme = "README.md" requires-python = ">=3.10" license = {text = "MIT"} @@ -16,7 +16,7 @@ dependencies = ["pydantic>=2.7", "pyyaml>=6.0"] dev = ["pytest>=8.0", "ruff>=0.6"] [project.scripts] -didactopus-course-ingest = "didactopus.main:main" +didactopus-topic-ingest = "didactopus.main:main" [tool.setuptools.packages.find] where = ["src"] diff --git a/src/didactopus/config.py b/src/didactopus/config.py index c42b9e7..6dfcf72 100644 --- a/src/didactopus/config.py +++ b/src/didactopus/config.py @@ -3,6 +3,15 @@ from pydantic import BaseModel, Field import yaml +class DocumentAdaptersConfig(BaseModel): + allow_pdf: bool = True + allow_docx: bool = True + allow_pptx: bool = True + allow_html: bool = True + allow_markdown: bool = True + allow_text: bool = True + + class CourseIngestConfig(BaseModel): default_pack_author: str = "Unknown" default_license: str = "REVIEW-REQUIRED" @@ -10,23 +19,17 @@ class CourseIngestConfig(BaseModel): max_terms_per_lesson: int = 8 -class RulePolicyConfig(BaseModel): - enable_prerequisite_order_rule: bool = True - enable_duplicate_term_merge_rule: bool = True - enable_project_detection_rule: bool = True - enable_review_flags: bool = True - - -class MultisourceConfig(BaseModel): - detect_duplicate_lessons: bool = True +class CrossCourseConfig(BaseModel): + detect_title_overlaps: bool = True detect_term_conflicts: bool = True + detect_order_conflicts: bool = True merge_same_named_lessons: bool = True class AppConfig(BaseModel): + document_adapters: DocumentAdaptersConfig = Field(default_factory=DocumentAdaptersConfig) course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig) - rule_policy: RulePolicyConfig = Field(default_factory=RulePolicyConfig) - multisource: MultisourceConfig = Field(default_factory=MultisourceConfig) + cross_course: CrossCourseConfig = Field(default_factory=CrossCourseConfig) def load_config(path: str | Path) -> AppConfig: diff --git a/src/didactopus/course_schema.py b/src/didactopus/course_schema.py index ed5803d..98d6743 100644 --- a/src/didactopus/course_schema.py +++ b/src/didactopus/course_schema.py @@ -1,8 +1,21 @@ from __future__ import annotations - from pydantic import BaseModel, Field +class Section(BaseModel): + heading: str + body: str = "" + + +class NormalizedDocument(BaseModel): + source_path: str + source_type: str + title: str = "" + text: str = "" + sections: list[Section] = Field(default_factory=list) + metadata: dict = Field(default_factory=dict) + + class Lesson(BaseModel): title: str body: str = "" @@ -17,21 +30,18 @@ class Module(BaseModel): lessons: list[Lesson] = Field(default_factory=list) -class NormalizedSourceRecord(BaseModel): - source_name: str - source_type: str - source_path: str - title: str = "" - modules: list[Module] = Field(default_factory=list) - - class NormalizedCourse(BaseModel): title: str source_name: str = "" source_url: str = "" rights_note: str = "" modules: list[Module] = Field(default_factory=list) - source_records: list[NormalizedSourceRecord] = Field(default_factory=list) + source_records: list[NormalizedDocument] = Field(default_factory=list) + + +class TopicBundle(BaseModel): + topic_title: str + courses: list[NormalizedCourse] = Field(default_factory=list) class ConceptCandidate(BaseModel): @@ -40,6 +50,7 @@ class ConceptCandidate(BaseModel): description: str = "" source_modules: list[str] = Field(default_factory=list) source_lessons: list[str] = Field(default_factory=list) + source_courses: list[str] = Field(default_factory=list) prerequisites: list[str] = Field(default_factory=list) mastery_signals: list[str] = Field(default_factory=list) diff --git a/src/didactopus/cross_course_conflicts.py b/src/didactopus/cross_course_conflicts.py new file mode 100644 index 0000000..78240c2 --- /dev/null +++ b/src/didactopus/cross_course_conflicts.py @@ -0,0 +1,50 @@ +from __future__ import annotations + +from collections import defaultdict +from .course_schema import NormalizedCourse, ConceptCandidate + + +def detect_title_overlaps(course: NormalizedCourse) -> list[str]: + lesson_to_sources = defaultdict(set) + for module in course.modules: + for lesson in module.lessons: + for src in lesson.source_refs: + lesson_to_sources[lesson.title.lower()].add(src) + flags = [] + for title, sources in lesson_to_sources.items(): + if len(sources) > 1: + flags.append(f"Lesson title '{title}' appears across multiple sources: {', '.join(sorted(sources))}") + return flags + + +def detect_term_conflicts(course: NormalizedCourse) -> list[str]: + term_to_lessons = defaultdict(set) + for module in course.modules: + for lesson in module.lessons: + for term in lesson.key_terms: + term_to_lessons[term.lower()].add(lesson.title) + flags = [] + for term, lessons in term_to_lessons.items(): + if len(lessons) > 1: + flags.append(f"Key term '{term}' appears in multiple lesson contexts: {', '.join(sorted(lessons))}") + return flags + + +def detect_order_conflicts(course: NormalizedCourse) -> list[str]: + # Placeholder heuristic: if same lesson title appears in multiple source_refs, flag for order review. + flags = [] + for module in course.modules: + for lesson in module.lessons: + if len(set(lesson.source_refs)) > 1: + flags.append(f"Lesson '{lesson.title}' was merged from multiple sources; review ordering assumptions.") + return flags + + +def detect_thin_concepts(concepts: list[ConceptCandidate]) -> list[str]: + flags = [] + for concept in concepts: + if len(concept.description.strip()) < 20: + flags.append(f"Concept '{concept.title}' has a very thin description.") + if not concept.mastery_signals: + flags.append(f"Concept '{concept.title}' has no extracted mastery signals.") + return flags diff --git a/src/didactopus/document_adapters.py b/src/didactopus/document_adapters.py new file mode 100644 index 0000000..a759207 --- /dev/null +++ b/src/didactopus/document_adapters.py @@ -0,0 +1,141 @@ +from __future__ import annotations + +from pathlib import Path +import re +from .course_schema import NormalizedDocument, Section + + +def _title_from_path(path: str | Path) -> str: + p = Path(path) + return p.stem.replace("_", " ").replace("-", " ").title() + + +def _simple_section_split(text: str) -> list[Section]: + sections = [] + current_heading = "Main" + current_lines = [] + for line in text.splitlines(): + if re.match(r"^(#{1,3})\s+", line): + if current_lines: + sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip())) + current_heading = re.sub(r"^(#{1,3})\s+", "", line).strip() + current_lines = [] + else: + current_lines.append(line) + if current_lines: + sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip())) + return sections + + +def read_textish(path: str | Path) -> str: + return Path(path).read_text(encoding="utf-8") + + +def adapt_markdown(path: str | Path) -> NormalizedDocument: + text = read_textish(path) + return NormalizedDocument( + source_path=str(path), + source_type="markdown", + title=_title_from_path(path), + text=text, + sections=_simple_section_split(text), + metadata={}, + ) + + +def adapt_text(path: str | Path) -> NormalizedDocument: + text = read_textish(path) + return NormalizedDocument( + source_path=str(path), + source_type="text", + title=_title_from_path(path), + text=text, + sections=_simple_section_split(text), + metadata={}, + ) + + +def adapt_html(path: str | Path) -> NormalizedDocument: + raw = read_textish(path) + text = re.sub(r"<[^>]+>", " ", raw) + text = re.sub(r"\s+", " ", text).strip() + return NormalizedDocument( + source_path=str(path), + source_type="html", + title=_title_from_path(path), + text=text, + sections=[Section(heading="HTML Extract", body=text)], + metadata={"extraction": "stub-html-strip"}, + ) + + +def adapt_pdf(path: str | Path) -> NormalizedDocument: + # Stub: in a real implementation, plug in PDF text extraction here. + text = read_textish(path) + return NormalizedDocument( + source_path=str(path), + source_type="pdf", + title=_title_from_path(path), + text=text, + sections=_simple_section_split(text), + metadata={"extraction": "stub-pdf-text"}, + ) + + +def adapt_docx(path: str | Path) -> NormalizedDocument: + # Stub: in a real implementation, plug in DOCX extraction here. + text = read_textish(path) + return NormalizedDocument( + source_path=str(path), + source_type="docx", + title=_title_from_path(path), + text=text, + sections=_simple_section_split(text), + metadata={"extraction": "stub-docx-text"}, + ) + + +def adapt_pptx(path: str | Path) -> NormalizedDocument: + # Stub: in a real implementation, plug in PPTX extraction here. + text = read_textish(path) + return NormalizedDocument( + source_path=str(path), + source_type="pptx", + title=_title_from_path(path), + text=text, + sections=_simple_section_split(text), + metadata={"extraction": "stub-pptx-text"}, + ) + + +def detect_adapter(path: str | Path) -> str: + p = Path(path) + suffix = p.suffix.lower() + if suffix == ".md": + return "markdown" + if suffix in {".txt"}: + return "text" + if suffix in {".html", ".htm"}: + return "html" + if suffix == ".pdf": + return "pdf" + if suffix == ".docx": + return "docx" + if suffix == ".pptx": + return "pptx" + return "text" + + +def adapt_document(path: str | Path) -> NormalizedDocument: + adapter = detect_adapter(path) + if adapter == "markdown": + return adapt_markdown(path) + if adapter == "html": + return adapt_html(path) + if adapter == "pdf": + return adapt_pdf(path) + if adapter == "docx": + return adapt_docx(path) + if adapter == "pptx": + return adapt_pptx(path) + return adapt_text(path) diff --git a/src/didactopus/main.py b/src/didactopus/main.py index 253b5b6..95a1076 100644 --- a/src/didactopus/main.py +++ b/src/didactopus/main.py @@ -4,18 +4,19 @@ import argparse from pathlib import Path from .config import load_config -from .course_ingest import parse_source_file, merge_source_records, extract_concept_candidates +from .document_adapters import adapt_document +from .topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates +from .cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts from .rule_policy import RuleContext, build_default_rules, run_rules -from .conflict_report import detect_duplicate_lessons, detect_term_conflicts, detect_thin_concepts from .pack_emitter import build_draft_pack, write_draft_pack def build_parser() -> argparse.ArgumentParser: - parser = argparse.ArgumentParser(description="Didactopus multi-source course-to-pack ingestion pipeline") - parser.add_argument("--inputs", nargs="+", required=True, help="Input source files") - parser.add_argument("--title", required=True, help="Course or topic title") + parser = argparse.ArgumentParser(description="Didactopus document-adapter and cross-course topic ingestion") + parser.add_argument("--inputs", nargs="+", required=True, help="Document inputs") + parser.add_argument("--title", required=True, help="Topic title") parser.add_argument("--rights-note", default="REVIEW REQUIRED") - parser.add_argument("--output-dir", default="generated-pack") + parser.add_argument("--output-dir", default="generated-topic-pack") parser.add_argument("--config", default="configs/config.example.yaml") return parser @@ -24,33 +25,30 @@ def main() -> None: args = build_parser().parse_args() config = load_config(args.config) - records = [parse_source_file(path, title=args.title) for path in args.inputs] - course = merge_source_records( - records=records, - course_title=args.title, - rights_note=args.rights_note, - merge_same_named_lessons=config.multisource.merge_same_named_lessons, + docs = [adapt_document(path) for path in args.inputs] + courses = [document_to_course(doc, course_title=args.title) for doc in docs] + topic = build_topic_bundle(args.title, courses) + merged_course = merge_courses_into_topic_course( + topic_bundle=topic, + merge_same_named_lessons=config.cross_course.merge_same_named_lessons, ) - concepts = extract_concept_candidates(course) - context = RuleContext(course=course, concepts=concepts) + concepts = extract_concept_candidates(merged_course) - rules = build_default_rules( - enable_prereq=config.rule_policy.enable_prerequisite_order_rule, - enable_merge=config.rule_policy.enable_duplicate_term_merge_rule, - enable_projects=config.rule_policy.enable_project_detection_rule, - enable_review=config.rule_policy.enable_review_flags, - ) + context = RuleContext(course=merged_course, concepts=concepts) + rules = build_default_rules() run_rules(context, rules) conflicts = [] - if config.multisource.detect_duplicate_lessons: - conflicts.extend(detect_duplicate_lessons(course)) - if config.multisource.detect_term_conflicts: - conflicts.extend(detect_term_conflicts(course)) + if config.cross_course.detect_title_overlaps: + conflicts.extend(detect_title_overlaps(merged_course)) + if config.cross_course.detect_term_conflicts: + conflicts.extend(detect_term_conflicts(merged_course)) + if config.cross_course.detect_order_conflicts: + conflicts.extend(detect_order_conflicts(merged_course)) conflicts.extend(detect_thin_concepts(context.concepts)) draft = build_draft_pack( - course=course, + course=merged_course, concepts=context.concepts, author=config.course_ingest.default_pack_author, license_name=config.course_ingest.default_license, @@ -59,10 +57,11 @@ def main() -> None: ) write_draft_pack(draft, args.output_dir) - print("== Didactopus Multi-Source Course Ingest ==") - print(f"Course: {course.title}") - print(f"Sources: {len(records)}") - print(f"Modules: {len(course.modules)}") + print("== Didactopus Cross-Course Topic Ingest ==") + print(f"Topic: {args.title}") + print(f"Documents: {len(docs)}") + print(f"Courses: {len(courses)}") + print(f"Merged modules: {len(merged_course.modules)}") print(f"Concept candidates: {len(context.concepts)}") print(f"Review flags: {len(context.review_flags)}") print(f"Conflicts: {len(conflicts)}") diff --git a/src/didactopus/pack_emitter.py b/src/didactopus/pack_emitter.py index 87899e7..e3058c0 100644 --- a/src/didactopus/pack_emitter.py +++ b/src/didactopus/pack_emitter.py @@ -15,7 +15,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate], "schema_version": "1", "didactopus_min_version": "0.1.0", "didactopus_max_version": "0.9.99", - "description": f"Draft pack generated from multi-source course inputs for '{course.title}'.", + "description": f"Draft topic pack generated from multi-course inputs for '{course.title}'.", "author": author, "license": license_name, "dependencies": [], @@ -64,7 +64,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate], attribution = { "rights_note": course.rights_note, "sources": [ - {"source_name": src.source_name, "source_type": src.source_type, "source_path": src.source_path} + {"source_path": src.source_path, "source_type": src.source_type, "title": src.title} for src in course.source_records ], } @@ -88,11 +88,8 @@ def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None: (out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8") (out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8") (out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8") - review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"] (out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8") - conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"] (out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8") - (out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8") diff --git a/src/didactopus/rule_policy.py b/src/didactopus/rule_policy.py index 8f7747b..dcf4b31 100644 --- a/src/didactopus/rule_policy.py +++ b/src/didactopus/rule_policy.py @@ -39,6 +39,7 @@ def duplicate_term_merge_rule(context: RuleContext) -> None: if key in seen: seen[key].source_modules.extend(x for x in concept.source_modules if x not in seen[key].source_modules) seen[key].source_lessons.extend(x for x in concept.source_lessons if x not in seen[key].source_lessons) + seen[key].source_courses.extend(x for x in concept.source_courses if x not in seen[key].source_courses) if concept.description and len(seen[key].description) < len(concept.description): seen[key].description = concept.description else: diff --git a/src/didactopus/topic_ingest.py b/src/didactopus/topic_ingest.py new file mode 100644 index 0000000..a725555 --- /dev/null +++ b/src/didactopus/topic_ingest.py @@ -0,0 +1,126 @@ +from __future__ import annotations + +import re +from collections import defaultdict +from .course_schema import NormalizedDocument, NormalizedCourse, Module, Lesson, TopicBundle, ConceptCandidate + + +def slugify(text: str) -> str: + cleaned = re.sub(r"[^a-zA-Z0-9]+", "-", text.strip().lower()).strip("-") + return cleaned or "untitled" + + +def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -> list[str]: + candidates = re.findall(r"\b[A-Z][A-Za-z0-9\-]{%d,}\b" % (min_term_length - 1), text) + seen = set() + out = [] + for term in candidates: + if term not in seen: + seen.add(term) + out.append(term) + if len(out) >= max_terms: + break + return out + + +def document_to_course(doc: NormalizedDocument, course_title: str) -> NormalizedCourse: + # Conservative mapping: each section becomes a lesson; all lessons go into one module. + lessons = [] + for section in doc.sections: + body = section.body.strip() + lines = body.splitlines() + objectives = [] + exercises = [] + for line in lines: + low = line.lower().strip() + if low.startswith("objective:"): + objectives.append(line.split(":", 1)[1].strip()) + if low.startswith("exercise:"): + exercises.append(line.split(":", 1)[1].strip()) + lessons.append( + Lesson( + title=section.heading.strip() or "Untitled Lesson", + body=body, + objectives=objectives, + exercises=exercises, + key_terms=extract_key_terms(section.heading + "\n" + body), + source_refs=[doc.source_path], + ) + ) + module = Module(title=f"Imported from {doc.source_type.upper()}", lessons=lessons) + return NormalizedCourse(title=course_title, modules=[module], source_records=[doc]) + + +def build_topic_bundle(topic_title: str, courses: list[NormalizedCourse]) -> TopicBundle: + return TopicBundle(topic_title=topic_title, courses=courses) + + +def merge_courses_into_topic_course(topic_bundle: TopicBundle, merge_same_named_lessons: bool = True) -> NormalizedCourse: + modules_by_title: dict[str, Module] = {} + source_records = [] + for course in topic_bundle.courses: + source_records.extend(course.source_records) + for module in course.modules: + target_module = modules_by_title.setdefault(module.title, Module(title=module.title, lessons=[])) + if merge_same_named_lessons: + lesson_map = {lesson.title: lesson for lesson in target_module.lessons} + for lesson in module.lessons: + if lesson.title in lesson_map: + existing = lesson_map[lesson.title] + if lesson.body and lesson.body not in existing.body: + existing.body = (existing.body + "\n\n" + lesson.body).strip() + for x in lesson.objectives: + if x not in existing.objectives: + existing.objectives.append(x) + for x in lesson.exercises: + if x not in existing.exercises: + existing.exercises.append(x) + for x in lesson.key_terms: + if x not in existing.key_terms: + existing.key_terms.append(x) + for x in lesson.source_refs: + if x not in existing.source_refs: + existing.source_refs.append(x) + else: + target_module.lessons.append(lesson) + else: + target_module.lessons.extend(module.lessons) + return NormalizedCourse(title=topic_bundle.topic_title, modules=list(modules_by_title.values()), source_records=source_records) + + +def extract_concept_candidates(course: NormalizedCourse) -> list[ConceptCandidate]: + concepts = [] + seen_ids = set() + for module in course.modules: + for lesson in module.lessons: + cid = slugify(lesson.title) + if cid not in seen_ids: + seen_ids.add(cid) + concepts.append( + ConceptCandidate( + id=cid, + title=lesson.title, + description=lesson.body[:240].strip(), + source_modules=[module.title], + source_lessons=[lesson.title], + source_courses=list(lesson.source_refs), + mastery_signals=list(lesson.objectives[:3] or lesson.exercises[:2]), + ) + ) + for term in lesson.key_terms: + tid = slugify(term) + if tid in seen_ids: + continue + seen_ids.add(tid) + concepts.append( + ConceptCandidate( + id=tid, + title=term, + description=f"Candidate concept extracted from lesson '{lesson.title}'.", + source_modules=[module.title], + source_lessons=[lesson.title], + source_courses=list(lesson.source_refs), + mastery_signals=list(lesson.objectives[:2]), + ) + ) + return concepts diff --git a/tests/test_cross_course_conflicts.py b/tests/test_cross_course_conflicts.py new file mode 100644 index 0000000..8a3099d --- /dev/null +++ b/tests/test_cross_course_conflicts.py @@ -0,0 +1,19 @@ +from pathlib import Path +from didactopus.document_adapters import adapt_document +from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates +from didactopus.cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts + + +def test_conflict_detection(tmp_path: Path) -> None: + a = tmp_path / "a.md" + b = tmp_path / "b.md" + a.write_text("# T\n\n## M1\n### Bayesian Updating\nPrior and Posterior appear here.", encoding="utf-8") + b.write_text("# T\n\n## M2\n### Bayesian Updating\nPrior and Posterior appear again.", encoding="utf-8") + docs = [adapt_document(a), adapt_document(b)] + courses = [document_to_course(doc, "Topic") for doc in docs] + merged = merge_courses_into_topic_course(build_topic_bundle("Topic", courses), merge_same_named_lessons=False) + concepts = extract_concept_candidates(merged) + assert isinstance(detect_title_overlaps(merged), list) + assert isinstance(detect_term_conflicts(merged), list) + assert isinstance(detect_order_conflicts(merged), list) + assert isinstance(detect_thin_concepts(concepts), list) diff --git a/tests/test_document_adapters.py b/tests/test_document_adapters.py new file mode 100644 index 0000000..2d020eb --- /dev/null +++ b/tests/test_document_adapters.py @@ -0,0 +1,18 @@ +from pathlib import Path +from didactopus.document_adapters import adapt_document, detect_adapter + + +def test_detect_adapter() -> None: + assert detect_adapter("a.md") == "markdown" + assert detect_adapter("b.html") == "html" + assert detect_adapter("c.pdf") == "pdf" + assert detect_adapter("d.docx") == "docx" + assert detect_adapter("e.pptx") == "pptx" + + +def test_adapt_markdown(tmp_path: Path) -> None: + p = tmp_path / "x.md" + p.write_text("# T\n\n## A\nBody", encoding="utf-8") + doc = adapt_document(p) + assert doc.source_type == "markdown" + assert len(doc.sections) >= 1 diff --git a/tests/test_pack_output.py b/tests/test_pack_output.py index 6e4ad11..87711ba 100644 --- a/tests/test_pack_output.py +++ b/tests/test_pack_output.py @@ -1,17 +1,20 @@ from pathlib import Path -from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates +from didactopus.document_adapters import adapt_document +from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates from didactopus.rule_policy import RuleContext, build_default_rules, run_rules from didactopus.pack_emitter import build_draft_pack, write_draft_pack -def test_emit_multisource_pack(tmp_path: Path) -> None: +def test_emit_topic_pack(tmp_path: Path) -> None: src = tmp_path / "course.md" - src.write_text("# C\n\n## M1\n### Lesson A\n- Objective: Explain Topic A.\n- Exercise: Do task A.\nTopic A body.", encoding="utf-8") - course = merge_source_records([parse_source_file(src, title="Course")], course_title="Course") - concepts = extract_concept_candidates(course) - ctx = RuleContext(course=course, concepts=concepts) + src.write_text("# T\n\n## M\n### L\nExercise: Do task A.\nTopic A body.", encoding="utf-8") + doc = adapt_document(src) + course = document_to_course(doc, "Topic") + merged = merge_courses_into_topic_course(build_topic_bundle("Topic", [course])) + concepts = extract_concept_candidates(merged) + ctx = RuleContext(course=merged, concepts=concepts) run_rules(ctx, build_default_rules()) - draft = build_draft_pack(course, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, []) + draft = build_draft_pack(merged, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, []) write_draft_pack(draft, tmp_path / "out") assert (tmp_path / "out" / "pack.yaml").exists() assert (tmp_path / "out" / "conflict_report.md").exists() diff --git a/tests/test_topic_ingest.py b/tests/test_topic_ingest.py new file mode 100644 index 0000000..9ca6c25 --- /dev/null +++ b/tests/test_topic_ingest.py @@ -0,0 +1,26 @@ +from pathlib import Path +from didactopus.document_adapters import adapt_document +from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates + + +def test_cross_course_merge(tmp_path: Path) -> None: + a = tmp_path / "a.md" + b = tmp_path / "b.docx" + a.write_text("# T\n\n## M\n### L1\nBody A", encoding="utf-8") + b.write_text("# T\n\n## M\n### L1\nBody B", encoding="utf-8") + + docs = [adapt_document(a), adapt_document(b)] + courses = [document_to_course(doc, "Topic") for doc in docs] + topic = build_topic_bundle("Topic", courses) + merged = merge_courses_into_topic_course(topic) + assert len(merged.modules) >= 1 + assert len(merged.modules[0].lessons) == 1 + + +def test_extract_concepts(tmp_path: Path) -> None: + a = tmp_path / "a.md" + a.write_text("# T\n\n## M\n### Lesson A\nObjective: Explain Topic A.\nBody.", encoding="utf-8") + doc = adapt_document(a) + course = document_to_course(doc, "Topic") + concepts = extract_concept_candidates(course) + assert len(concepts) >= 1