Added cross-course merger.

2026-03-13 06:36:27 -04:00 · 2026-03-13 06:36:27 -04:00 · 0656f7bbe8
parent 8defaab1c2
commit 0656f7bbe8
31 changed files with 753 additions and 90 deletions
--- a/README.md
+++ b/README.md
@ -8,6 +8,41 @@

 ## Recent revisions

+### Course-to-course merger
+
+This revision adds two major capabilities:
+
+- **real document adapter scaffolds** for PDF, DOCX, PPTX, and HTML
+- a **cross-course merger** for combining multiple course-derived packs into one stronger domain draft
+
+These additions extend the earlier multi-source ingestion layer from "multiple files for one course"
+to "multiple courses or course-like sources for one topic domain."
+
+## What is included
+
+- adapter registry for:
+  - PDF
+  - DOCX
+  - PPTX
+  - HTML
+  - Markdown
+  - text
+- normalized document extraction interface
+- course bundle ingestion across multiple source documents
+- cross-course terminology and overlap analysis
+- merged topic-pack emitter
+- cross-course conflict report
+- example source files and example merged output
+
+## Design stance
+
+This is still scaffold-level extraction. The purpose is to define stable interfaces and emitted artifacts,
+not to claim perfect semantic parsing of every teaching document.
+
+The implementation is designed so stronger parsers can later replace the stub extractors without changing
+the surrounding pipeline.
+
+
 ### Multi-Source Course Ingestion

 This revision adds a **Multi-Source Course Ingestion Layer**.
@ -216,3 +251,4 @@ didactopus/



+
--- a/configs/config.example.yaml
+++ b/configs/config.example.yaml
@ -1,16 +1,19 @@
+document_adapters:
+  allow_pdf: true
+  allow_docx: true
+  allow_pptx: true
+  allow_html: true
+  allow_markdown: true
+  allow_text: true
+
 course_ingest:
  default_pack_author: "Wesley R. Elsberry"
  default_license: "REVIEW-REQUIRED"
  min_term_length: 4
  max_terms_per_lesson: 8

-rule_policy:
-  enable_prerequisite_order_rule: true
-  enable_duplicate_term_merge_rule: true
-  enable_project_detection_rule: true
-  enable_review_flags: true
-
-multisource:
-  detect_duplicate_lessons: true
+cross_course:
+  detect_title_overlaps: true
  detect_term_conflicts: true
+  detect_order_conflicts: true
  merge_same_named_lessons: true
--- a/docs/cross-course-merger.md
+++ b/docs/cross-course-merger.md
@ -0,0 +1,31 @@
+# Cross-Course Merger
+
+The cross-course merger combines multiple course-like inputs covering the same subject area.
+
+## Goal
+
+Build a stronger draft topic pack from several partially overlapping sources.
+
+## What it does
+
+- merges normalized source records into course bundles
+- merges course bundles into one topic bundle
+- compares repeated concepts across courses
+- flags terminology conflicts and overlap
+- emits a merged draft pack
+- emits a cross-course conflict report
+
+## Why this matters
+
+No single course is usually ideal for mastery-oriented domain construction.
+Combining multiple sources can improve:
+- concept coverage
+- exercise diversity
+- project identification
+- terminology mapping
+- prerequisite robustness
+
+## Important caveat
+
+This merger is draft-oriented.
+Human review remains necessary before trusting the result as a final domain pack.
--- a/docs/document-adapters.md
+++ b/docs/document-adapters.md
@ -0,0 +1,42 @@
+# Document Adapters
+
+Didactopus now includes adapter scaffolds for several common educational document types.
+
+## Supported adapter interfaces
+
+- PDF adapter
+- DOCX adapter
+- PPTX adapter
+- HTML adapter
+- Markdown adapter
+- text adapter
+
+## Current status
+
+The current implementation is intentionally conservative:
+- it focuses on stable interfaces
+- it extracts text in a simplified way
+- it normalizes results into shared internal structures
+
+## Why this matters
+
+Educational material commonly lives in:
+- syllabi PDFs
+- DOCX notes
+- PowerPoint slide decks
+- LMS HTML exports
+- markdown lesson files
+
+A useful curriculum distiller must be able to treat these as first-class inputs.
+
+## Adapter contract
+
+Each adapter returns a normalized document record with:
+- source path
+- source type
+- title
+- extracted text
+- sections
+- metadata
+
+This record is then passed into higher-level course/topic distillation logic.
--- a/docs/faq.md
+++ b/docs/faq.md
@ -1,27 +1,25 @@
 # FAQ

-## Why multi-source ingestion?
+## Why add document adapters now?

-Because course structure is usually distributed across several files rather than
-perfectly contained in one source.
+Because real educational material is rarely provided in only one plain-text format.

-## What kinds of conflicts can arise?
+## Are these full-fidelity parsers?

-Common examples:
- the same lesson with slightly different titles
- inconsistent terminology across notes and transcripts
- exercises present in one source but absent in another
- project prompts implied in one file and explicit in another
+Not yet. The current implementation is a stable scaffold for extraction and normalization.

-## Does the system resolve all conflicts automatically?
+## Why add cross-course merging?

-No. It produces a merged draft pack and a conflict report for human review.
+Because one course often under-specifies a domain, while multiple sources together can produce a better draft pack.

-## Why not rely only on embeddings for this?
+## Does the merger resolve every concept conflict automatically?

-Because Didactopus needs explicit structures such as:
- concepts
- prerequisites
- projects
- rubrics
- checkpoints
+No. It produces a merged draft plus a conflict report for human review.
+
+## What kinds of issues are flagged?
+
+Examples:
+- repeated concepts with different names
+- same term used with different local contexts
+- courses that introduce topics in conflicting orders
+- weak or thin concept descriptions
--- a/examples/generated_topic_pack/concepts.yaml
+++ b/examples/generated_topic_pack/concepts.yaml
@ -0,0 +1,42 @@
+concepts:
+- id: descriptive-statistics
+  title: Descriptive Statistics
+  description: 'Objective: Explain mean, median, and variance.
+
+    Exercise: Summarize a small dataset.
+
+    Descriptive Statistics introduces center and spread.'
+  prerequisites: []
+  mastery_signals:
+  - Summarize a small dataset.
+  mastery_profile: {}
+- id: probability-basics
+  title: Probability Basics
+  description: 'Objective: Explain conditional probability.
+
+    Exercise: Compute a simple conditional probability.
+
+    Probability Basics introduces events and likelihood.'
+  prerequisites:
+  - descriptive-statistics
+  mastery_signals:
+  - Compute a simple conditional probability.
+  mastery_profile: {}
+- id: prior-and-posterior
+  title: Prior And Posterior
+  description: 'Prior and Posterior are central concepts. Prior reflects assumptions
+    before evidence. Exercise: Compare prior and posterior beliefs.'
+  prerequisites:
+  - probability-basics
+  mastery_signals:
+  - Compare prior and posterior beliefs.
+  mastery_profile: {}
+- id: model-checking
+  title: Model Checking
+  description: 'A weakness is hidden assumptions. A limitation is poor fit. Uncertainty
+    remains. Exercise: Critique a simple inference model.'
+  prerequisites:
+  - prior-and-posterior
+  mastery_signals:
+  - Critique a simple inference model.
+  mastery_profile: {}
--- a/examples/generated_topic_pack/conflict_report.md
+++ b/examples/generated_topic_pack/conflict_report.md
@ -0,0 +1,3 @@
+# Conflict Report
+
+- Lesson 'prior and posterior' was merged from multiple sources; review ordering assumptions.
--- a/examples/generated_topic_pack/license_attribution.json
+++ b/examples/generated_topic_pack/license_attribution.json
@ -0,0 +1,30 @@
+{
+  "rights_note": "REVIEW REQUIRED",
+  "sources": [
+    {
+      "source_path": "examples/intro_bayes_outline.md",
+      "source_type": "markdown",
+      "title": "Intro Bayes Outline"
+    },
+    {
+      "source_path": "examples/intro_bayes_lecture.html",
+      "source_type": "html",
+      "title": "Intro Bayes Lecture"
+    },
+    {
+      "source_path": "examples/intro_bayes_slides.pptx",
+      "source_type": "pptx",
+      "title": "Intro Bayes Slides"
+    },
+    {
+      "source_path": "examples/intro_bayes_notes.docx",
+      "source_type": "docx",
+      "title": "Intro Bayes Notes"
+    },
+    {
+      "source_path": "examples/intro_bayes_syllabus.pdf",
+      "source_type": "pdf",
+      "title": "Intro Bayes Syllabus"
+    }
+  ]
+}
--- a/examples/generated_topic_pack/pack.yaml
+++ b/examples/generated_topic_pack/pack.yaml
@ -0,0 +1,14 @@
+name: introductory-bayesian-inference
+display_name: Introductory Bayesian Inference
+version: 0.1.0-draft
+schema_version: '1'
+didactopus_min_version: 0.1.0
+didactopus_max_version: 0.9.99
+description: Draft topic pack generated from multi-course inputs for 'Introductory
+  Bayesian Inference'.
+author: Wesley R. Elsberry
+license: REVIEW-REQUIRED
+dependencies: []
+overrides: []
+profile_templates: {}
+cross_pack_links: []
--- a/examples/generated_topic_pack/projects.yaml
+++ b/examples/generated_topic_pack/projects.yaml
@ -0,0 +1,7 @@
+projects:
+- id: prior-and-posterior
+  title: Prior And Posterior
+  difficulty: review-required
+  prerequisites: []
+  deliverables:
+  - project artifact
--- a/examples/generated_topic_pack/review_report.md
+++ b/examples/generated_topic_pack/review_report.md
@ -0,0 +1,3 @@
+# Review Report
+
+- Module 'Imported from PPTX' appears to contain project-like material; review project extraction.
--- a/examples/generated_topic_pack/roadmap.yaml
+++ b/examples/generated_topic_pack/roadmap.yaml
@ -0,0 +1,17 @@
+stages:
+- id: stage-1
+  title: Imported from MARKDOWN
+  concepts:
+  - descriptive-statistics
+  - probability-basics
+  checkpoint: []
+- id: stage-2
+  title: Imported from HTML
+  concepts:
+  - prior-and-posterior
+  checkpoint: []
+- id: stage-3
+  title: Imported from DOCX
+  concepts:
+  - model-checking
+  checkpoint: []
--- a/examples/generated_topic_pack/rubrics.yaml
+++ b/examples/generated_topic_pack/rubrics.yaml
@ -0,0 +1,6 @@
+rubrics:
+- id: draft-rubric
+  title: Draft Rubric
+  criteria:
+  - correctness
+  - explanation
--- a/examples/intro_bayes_lecture.html
+++ b/examples/intro_bayes_lecture.html
@ -0,0 +1,7 @@
+<html><body>
+<h1>Introductory Bayesian Inference</h1>
+<h2>Bayesian Updating</h2>
+<h3>Prior and Posterior</h3>
+<p>Prior and Posterior are central concepts. Prior reflects assumptions before evidence.</p>
+<p>Exercise: Compare prior and posterior beliefs.</p>
+</body></html>
--- a/examples/intro_bayes_notes.docx
+++ b/examples/intro_bayes_notes.docx
@ -0,0 +1,6 @@
+# Bayesian Notes
+
+## Model Critique
+### Model Checking
+A weakness is hidden assumptions. A limitation is poor fit. Uncertainty remains.
+Exercise: Critique a simple inference model.
--- a/examples/intro_bayes_outline.md
+++ b/examples/intro_bayes_outline.md
@ -0,0 +1,12 @@
+# Introductory Bayesian Inference
+
+## Foundations
+### Descriptive Statistics
+Objective: Explain mean, median, and variance.
+Exercise: Summarize a small dataset.
+Descriptive Statistics introduces center and spread.
+
+### Probability Basics
+Objective: Explain conditional probability.
+Exercise: Compute a simple conditional probability.
+Probability Basics introduces events and likelihood.
--- a/examples/intro_bayes_slides.pptx
+++ b/examples/intro_bayes_slides.pptx
@ -0,0 +1,7 @@
+# Bayesian Slides
+
+## Bayesian Updating
+### Prior and Posterior
+Prior and Posterior summary slide text.
+Capstone Mini Project
+Exercise: Write a short project report comparing priors and posteriors.
--- a/examples/intro_bayes_syllabus.pdf
+++ b/examples/intro_bayes_syllabus.pdf
@ -0,0 +1,5 @@
+# Bayesian Syllabus
+
+## Schedule
+### Foundations
+Objective: Explain descriptive statistics and conditional probability.
--- a/pyproject.toml
+++ b/pyproject.toml
@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "didactopus"
 version = "0.1.0"
-description = "Didactopus: multi-source course-to-pack ingestion scaffold"
+description = "Didactopus: document-adapter and cross-course merger scaffold"
 readme = "README.md"
 requires-python = ">=3.10"
 license = {text = "MIT"}
@ -16,7 +16,7 @@ dependencies = ["pydantic>=2.7", "pyyaml>=6.0"]
 dev = ["pytest>=8.0", "ruff>=0.6"]

 [project.scripts]
-didactopus-course-ingest = "didactopus.main:main"
+didactopus-topic-ingest = "didactopus.main:main"

 [tool.setuptools.packages.find]
 where = ["src"]
--- a/src/didactopus/config.py
+++ b/src/didactopus/config.py
@ -3,6 +3,15 @@ from pydantic import BaseModel, Field
 import yaml


+class DocumentAdaptersConfig(BaseModel):
+    allow_pdf: bool = True
+    allow_docx: bool = True
+    allow_pptx: bool = True
+    allow_html: bool = True
+    allow_markdown: bool = True
+    allow_text: bool = True
+
+
 class CourseIngestConfig(BaseModel):
    default_pack_author: str = "Unknown"
    default_license: str = "REVIEW-REQUIRED"
@ -10,23 +19,17 @@ class CourseIngestConfig(BaseModel):
    max_terms_per_lesson: int = 8


-class RulePolicyConfig(BaseModel):
-    enable_prerequisite_order_rule: bool = True
-    enable_duplicate_term_merge_rule: bool = True
-    enable_project_detection_rule: bool = True
-    enable_review_flags: bool = True
-
-
-class MultisourceConfig(BaseModel):
-    detect_duplicate_lessons: bool = True
+class CrossCourseConfig(BaseModel):
+    detect_title_overlaps: bool = True
    detect_term_conflicts: bool = True
+    detect_order_conflicts: bool = True
    merge_same_named_lessons: bool = True


 class AppConfig(BaseModel):
+    document_adapters: DocumentAdaptersConfig = Field(default_factory=DocumentAdaptersConfig)
    course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig)
-    rule_policy: RulePolicyConfig = Field(default_factory=RulePolicyConfig)
-    multisource: MultisourceConfig = Field(default_factory=MultisourceConfig)
+    cross_course: CrossCourseConfig = Field(default_factory=CrossCourseConfig)


 def load_config(path: str | Path) -> AppConfig:
--- a/src/didactopus/course_schema.py
+++ b/src/didactopus/course_schema.py
@ -1,8 +1,21 @@
 from __future__ import annotations
-
 from pydantic import BaseModel, Field


+class Section(BaseModel):
+    heading: str
+    body: str = ""
+
+
+class NormalizedDocument(BaseModel):
+    source_path: str
+    source_type: str
+    title: str = ""
+    text: str = ""
+    sections: list[Section] = Field(default_factory=list)
+    metadata: dict = Field(default_factory=dict)
+
+
 class Lesson(BaseModel):
    title: str
    body: str = ""
@ -17,21 +30,18 @@ class Module(BaseModel):
    lessons: list[Lesson] = Field(default_factory=list)


-class NormalizedSourceRecord(BaseModel):
-    source_name: str
-    source_type: str
-    source_path: str
-    title: str = ""
-    modules: list[Module] = Field(default_factory=list)
-
-
 class NormalizedCourse(BaseModel):
    title: str
    source_name: str = ""
    source_url: str = ""
    rights_note: str = ""
    modules: list[Module] = Field(default_factory=list)
-    source_records: list[NormalizedSourceRecord] = Field(default_factory=list)
+    source_records: list[NormalizedDocument] = Field(default_factory=list)
+
+
+class TopicBundle(BaseModel):
+    topic_title: str
+    courses: list[NormalizedCourse] = Field(default_factory=list)


 class ConceptCandidate(BaseModel):
@ -40,6 +50,7 @@ class ConceptCandidate(BaseModel):
    description: str = ""
    source_modules: list[str] = Field(default_factory=list)
    source_lessons: list[str] = Field(default_factory=list)
+    source_courses: list[str] = Field(default_factory=list)
    prerequisites: list[str] = Field(default_factory=list)
    mastery_signals: list[str] = Field(default_factory=list)

--- a/src/didactopus/cross_course_conflicts.py
+++ b/src/didactopus/cross_course_conflicts.py
@ -0,0 +1,50 @@
+from __future__ import annotations
+
+from collections import defaultdict
+from .course_schema import NormalizedCourse, ConceptCandidate
+
+
+def detect_title_overlaps(course: NormalizedCourse) -> list[str]:
+    lesson_to_sources = defaultdict(set)
+    for module in course.modules:
+        for lesson in module.lessons:
+            for src in lesson.source_refs:
+                lesson_to_sources[lesson.title.lower()].add(src)
+    flags = []
+    for title, sources in lesson_to_sources.items():
+        if len(sources) > 1:
+            flags.append(f"Lesson title '{title}' appears across multiple sources: {', '.join(sorted(sources))}")
+    return flags
+
+
+def detect_term_conflicts(course: NormalizedCourse) -> list[str]:
+    term_to_lessons = defaultdict(set)
+    for module in course.modules:
+        for lesson in module.lessons:
+            for term in lesson.key_terms:
+                term_to_lessons[term.lower()].add(lesson.title)
+    flags = []
+    for term, lessons in term_to_lessons.items():
+        if len(lessons) > 1:
+            flags.append(f"Key term '{term}' appears in multiple lesson contexts: {', '.join(sorted(lessons))}")
+    return flags
+
+
+def detect_order_conflicts(course: NormalizedCourse) -> list[str]:
+    # Placeholder heuristic: if same lesson title appears in multiple source_refs, flag for order review.
+    flags = []
+    for module in course.modules:
+        for lesson in module.lessons:
+            if len(set(lesson.source_refs)) > 1:
+                flags.append(f"Lesson '{lesson.title}' was merged from multiple sources; review ordering assumptions.")
+    return flags
+
+
+def detect_thin_concepts(concepts: list[ConceptCandidate]) -> list[str]:
+    flags = []
+    for concept in concepts:
+        if len(concept.description.strip()) < 20:
+            flags.append(f"Concept '{concept.title}' has a very thin description.")
+        if not concept.mastery_signals:
+            flags.append(f"Concept '{concept.title}' has no extracted mastery signals.")
+    return flags
--- a/src/didactopus/document_adapters.py
+++ b/src/didactopus/document_adapters.py
@ -0,0 +1,141 @@
+from __future__ import annotations
+
+from pathlib import Path
+import re
+from .course_schema import NormalizedDocument, Section
+
+
+def _title_from_path(path: str | Path) -> str:
+    p = Path(path)
+    return p.stem.replace("_", " ").replace("-", " ").title()
+
+
+def _simple_section_split(text: str) -> list[Section]:
+    sections = []
+    current_heading = "Main"
+    current_lines = []
+    for line in text.splitlines():
+        if re.match(r"^(#{1,3})\s+", line):
+            if current_lines:
+                sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
+            current_heading = re.sub(r"^(#{1,3})\s+", "", line).strip()
+            current_lines = []
+        else:
+            current_lines.append(line)
+    if current_lines:
+        sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
+    return sections
+
+
+def read_textish(path: str | Path) -> str:
+    return Path(path).read_text(encoding="utf-8")
+
+
+def adapt_markdown(path: str | Path) -> NormalizedDocument:
+    text = read_textish(path)
+    return NormalizedDocument(
+        source_path=str(path),
+        source_type="markdown",
+        title=_title_from_path(path),
+        text=text,
+        sections=_simple_section_split(text),
+        metadata={},
+    )
+
+
+def adapt_text(path: str | Path) -> NormalizedDocument:
+    text = read_textish(path)
+    return NormalizedDocument(
+        source_path=str(path),
+        source_type="text",
+        title=_title_from_path(path),
+        text=text,
+        sections=_simple_section_split(text),
+        metadata={},
+    )
+
+
+def adapt_html(path: str | Path) -> NormalizedDocument:
+    raw = read_textish(path)
+    text = re.sub(r"<[^>]+>", " ", raw)
+    text = re.sub(r"\s+", " ", text).strip()
+    return NormalizedDocument(
+        source_path=str(path),
+        source_type="html",
+        title=_title_from_path(path),
+        text=text,
+        sections=[Section(heading="HTML Extract", body=text)],
+        metadata={"extraction": "stub-html-strip"},
+    )
+
+
+def adapt_pdf(path: str | Path) -> NormalizedDocument:
+    # Stub: in a real implementation, plug in PDF text extraction here.
+    text = read_textish(path)
+    return NormalizedDocument(
+        source_path=str(path),
+        source_type="pdf",
+        title=_title_from_path(path),
+        text=text,
+        sections=_simple_section_split(text),
+        metadata={"extraction": "stub-pdf-text"},
+    )
+
+
+def adapt_docx(path: str | Path) -> NormalizedDocument:
+    # Stub: in a real implementation, plug in DOCX extraction here.
+    text = read_textish(path)
+    return NormalizedDocument(
+        source_path=str(path),
+        source_type="docx",
+        title=_title_from_path(path),
+        text=text,
+        sections=_simple_section_split(text),
+        metadata={"extraction": "stub-docx-text"},
+    )
+
+
+def adapt_pptx(path: str | Path) -> NormalizedDocument:
+    # Stub: in a real implementation, plug in PPTX extraction here.
+    text = read_textish(path)
+    return NormalizedDocument(
+        source_path=str(path),
+        source_type="pptx",
+        title=_title_from_path(path),
+        text=text,
+        sections=_simple_section_split(text),
+        metadata={"extraction": "stub-pptx-text"},
+    )
+
+
+def detect_adapter(path: str | Path) -> str:
+    p = Path(path)
+    suffix = p.suffix.lower()
+    if suffix == ".md":
+        return "markdown"
+    if suffix in {".txt"}:
+        return "text"
+    if suffix in {".html", ".htm"}:
+        return "html"
+    if suffix == ".pdf":
+        return "pdf"
+    if suffix == ".docx":
+        return "docx"
+    if suffix == ".pptx":
+        return "pptx"
+    return "text"
+
+
+def adapt_document(path: str | Path) -> NormalizedDocument:
+    adapter = detect_adapter(path)
+    if adapter == "markdown":
+        return adapt_markdown(path)
+    if adapter == "html":
+        return adapt_html(path)
+    if adapter == "pdf":
+        return adapt_pdf(path)
+    if adapter == "docx":
+        return adapt_docx(path)
+    if adapter == "pptx":
+        return adapt_pptx(path)
+    return adapt_text(path)
--- a/src/didactopus/main.py
+++ b/src/didactopus/main.py
@ -4,18 +4,19 @@ import argparse
 from pathlib import Path

 from .config import load_config
-from .course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
+from .document_adapters import adapt_document
+from .topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
+from .cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
 from .rule_policy import RuleContext, build_default_rules, run_rules
-from .conflict_report import detect_duplicate_lessons, detect_term_conflicts, detect_thin_concepts
 from .pack_emitter import build_draft_pack, write_draft_pack


 def build_parser() -> argparse.ArgumentParser:
-    parser = argparse.ArgumentParser(description="Didactopus multi-source course-to-pack ingestion pipeline")
-    parser.add_argument("--inputs", nargs="+", required=True, help="Input source files")
-    parser.add_argument("--title", required=True, help="Course or topic title")
+    parser = argparse.ArgumentParser(description="Didactopus document-adapter and cross-course topic ingestion")
+    parser.add_argument("--inputs", nargs="+", required=True, help="Document inputs")
+    parser.add_argument("--title", required=True, help="Topic title")
    parser.add_argument("--rights-note", default="REVIEW REQUIRED")
-    parser.add_argument("--output-dir", default="generated-pack")
+    parser.add_argument("--output-dir", default="generated-topic-pack")
    parser.add_argument("--config", default="configs/config.example.yaml")
    return parser

@ -24,33 +25,30 @@ def main() -> None:
    args = build_parser().parse_args()
    config = load_config(args.config)

-    records = [parse_source_file(path, title=args.title) for path in args.inputs]
-    course = merge_source_records(
-        records=records,
-        course_title=args.title,
-        rights_note=args.rights_note,
-        merge_same_named_lessons=config.multisource.merge_same_named_lessons,
+    docs = [adapt_document(path) for path in args.inputs]
+    courses = [document_to_course(doc, course_title=args.title) for doc in docs]
+    topic = build_topic_bundle(args.title, courses)
+    merged_course = merge_courses_into_topic_course(
+        topic_bundle=topic,
+        merge_same_named_lessons=config.cross_course.merge_same_named_lessons,
    )
-    concepts = extract_concept_candidates(course)
-    context = RuleContext(course=course, concepts=concepts)
+    concepts = extract_concept_candidates(merged_course)

-    rules = build_default_rules(
-        enable_prereq=config.rule_policy.enable_prerequisite_order_rule,
-        enable_merge=config.rule_policy.enable_duplicate_term_merge_rule,
-        enable_projects=config.rule_policy.enable_project_detection_rule,
-        enable_review=config.rule_policy.enable_review_flags,
-    )
+    context = RuleContext(course=merged_course, concepts=concepts)
+    rules = build_default_rules()
    run_rules(context, rules)

    conflicts = []
-    if config.multisource.detect_duplicate_lessons:
-        conflicts.extend(detect_duplicate_lessons(course))
-    if config.multisource.detect_term_conflicts:
-        conflicts.extend(detect_term_conflicts(course))
+    if config.cross_course.detect_title_overlaps:
+        conflicts.extend(detect_title_overlaps(merged_course))
+    if config.cross_course.detect_term_conflicts:
+        conflicts.extend(detect_term_conflicts(merged_course))
+    if config.cross_course.detect_order_conflicts:
+        conflicts.extend(detect_order_conflicts(merged_course))
    conflicts.extend(detect_thin_concepts(context.concepts))

    draft = build_draft_pack(
-        course=course,
+        course=merged_course,
        concepts=context.concepts,
        author=config.course_ingest.default_pack_author,
        license_name=config.course_ingest.default_license,
@ -59,10 +57,11 @@ def main() -> None:
    )
    write_draft_pack(draft, args.output_dir)

-    print("== Didactopus Multi-Source Course Ingest ==")
-    print(f"Course: {course.title}")
-    print(f"Sources: {len(records)}")
-    print(f"Modules: {len(course.modules)}")
+    print("== Didactopus Cross-Course Topic Ingest ==")
+    print(f"Topic: {args.title}")
+    print(f"Documents: {len(docs)}")
+    print(f"Courses: {len(courses)}")
+    print(f"Merged modules: {len(merged_course.modules)}")
    print(f"Concept candidates: {len(context.concepts)}")
    print(f"Review flags: {len(context.review_flags)}")
    print(f"Conflicts: {len(conflicts)}")
--- a/src/didactopus/pack_emitter.py
+++ b/src/didactopus/pack_emitter.py
@ -15,7 +15,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
        "schema_version": "1",
        "didactopus_min_version": "0.1.0",
        "didactopus_max_version": "0.9.99",
-        "description": f"Draft pack generated from multi-source course inputs for '{course.title}'.",
+        "description": f"Draft topic pack generated from multi-course inputs for '{course.title}'.",
        "author": author,
        "license": license_name,
        "dependencies": [],
@ -64,7 +64,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
    attribution = {
        "rights_note": course.rights_note,
        "sources": [
-            {"source_name": src.source_name, "source_type": src.source_type, "source_path": src.source_path}
+            {"source_path": src.source_path, "source_type": src.source_type, "title": src.title}
            for src in course.source_records
        ],
    }
@ -88,11 +88,8 @@ def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None:
    (out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8")
    (out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8")
    (out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8")
-
    review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"]
    (out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8")
-
    conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"]
    (out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8")
-
    (out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8")
--- a/src/didactopus/rule_policy.py
+++ b/src/didactopus/rule_policy.py
@ -39,6 +39,7 @@ def duplicate_term_merge_rule(context: RuleContext) -> None:
        if key in seen:
            seen[key].source_modules.extend(x for x in concept.source_modules if x not in seen[key].source_modules)
            seen[key].source_lessons.extend(x for x in concept.source_lessons if x not in seen[key].source_lessons)
+            seen[key].source_courses.extend(x for x in concept.source_courses if x not in seen[key].source_courses)
            if concept.description and len(seen[key].description) < len(concept.description):
                seen[key].description = concept.description
        else:
--- a/src/didactopus/topic_ingest.py
+++ b/src/didactopus/topic_ingest.py
@ -0,0 +1,126 @@
+from __future__ import annotations
+
+import re
+from collections import defaultdict
+from .course_schema import NormalizedDocument, NormalizedCourse, Module, Lesson, TopicBundle, ConceptCandidate
+
+
+def slugify(text: str) -> str:
+    cleaned = re.sub(r"[^a-zA-Z0-9]+", "-", text.strip().lower()).strip("-")
+    return cleaned or "untitled"
+
+
+def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -> list[str]:
+    candidates = re.findall(r"\b[A-Z][A-Za-z0-9\-]{%d,}\b" % (min_term_length - 1), text)
+    seen = set()
+    out = []
+    for term in candidates:
+        if term not in seen:
+            seen.add(term)
+            out.append(term)
+        if len(out) >= max_terms:
+            break
+    return out
+
+
+def document_to_course(doc: NormalizedDocument, course_title: str) -> NormalizedCourse:
+    # Conservative mapping: each section becomes a lesson; all lessons go into one module.
+    lessons = []
+    for section in doc.sections:
+        body = section.body.strip()
+        lines = body.splitlines()
+        objectives = []
+        exercises = []
+        for line in lines:
+            low = line.lower().strip()
+            if low.startswith("objective:"):
+                objectives.append(line.split(":", 1)[1].strip())
+            if low.startswith("exercise:"):
+                exercises.append(line.split(":", 1)[1].strip())
+        lessons.append(
+            Lesson(
+                title=section.heading.strip() or "Untitled Lesson",
+                body=body,
+                objectives=objectives,
+                exercises=exercises,
+                key_terms=extract_key_terms(section.heading + "\n" + body),
+                source_refs=[doc.source_path],
+            )
+        )
+    module = Module(title=f"Imported from {doc.source_type.upper()}", lessons=lessons)
+    return NormalizedCourse(title=course_title, modules=[module], source_records=[doc])
+
+
+def build_topic_bundle(topic_title: str, courses: list[NormalizedCourse]) -> TopicBundle:
+    return TopicBundle(topic_title=topic_title, courses=courses)
+
+
+def merge_courses_into_topic_course(topic_bundle: TopicBundle, merge_same_named_lessons: bool = True) -> NormalizedCourse:
+    modules_by_title: dict[str, Module] = {}
+    source_records = []
+    for course in topic_bundle.courses:
+        source_records.extend(course.source_records)
+        for module in course.modules:
+            target_module = modules_by_title.setdefault(module.title, Module(title=module.title, lessons=[]))
+            if merge_same_named_lessons:
+                lesson_map = {lesson.title: lesson for lesson in target_module.lessons}
+                for lesson in module.lessons:
+                    if lesson.title in lesson_map:
+                        existing = lesson_map[lesson.title]
+                        if lesson.body and lesson.body not in existing.body:
+                            existing.body = (existing.body + "\n\n" + lesson.body).strip()
+                        for x in lesson.objectives:
+                            if x not in existing.objectives:
+                                existing.objectives.append(x)
+                        for x in lesson.exercises:
+                            if x not in existing.exercises:
+                                existing.exercises.append(x)
+                        for x in lesson.key_terms:
+                            if x not in existing.key_terms:
+                                existing.key_terms.append(x)
+                        for x in lesson.source_refs:
+                            if x not in existing.source_refs:
+                                existing.source_refs.append(x)
+                    else:
+                        target_module.lessons.append(lesson)
+            else:
+                target_module.lessons.extend(module.lessons)
+    return NormalizedCourse(title=topic_bundle.topic_title, modules=list(modules_by_title.values()), source_records=source_records)
+
+
+def extract_concept_candidates(course: NormalizedCourse) -> list[ConceptCandidate]:
+    concepts = []
+    seen_ids = set()
+    for module in course.modules:
+        for lesson in module.lessons:
+            cid = slugify(lesson.title)
+            if cid not in seen_ids:
+                seen_ids.add(cid)
+                concepts.append(
+                    ConceptCandidate(
+                        id=cid,
+                        title=lesson.title,
+                        description=lesson.body[:240].strip(),
+                        source_modules=[module.title],
+                        source_lessons=[lesson.title],
+                        source_courses=list(lesson.source_refs),
+                        mastery_signals=list(lesson.objectives[:3] or lesson.exercises[:2]),
+                    )
+                )
+            for term in lesson.key_terms:
+                tid = slugify(term)
+                if tid in seen_ids:
+                    continue
+                seen_ids.add(tid)
+                concepts.append(
+                    ConceptCandidate(
+                        id=tid,
+                        title=term,
+                        description=f"Candidate concept extracted from lesson '{lesson.title}'.",
+                        source_modules=[module.title],
+                        source_lessons=[lesson.title],
+                        source_courses=list(lesson.source_refs),
+                        mastery_signals=list(lesson.objectives[:2]),
+                    )
+                )
+    return concepts
--- a/tests/test_cross_course_conflicts.py
+++ b/tests/test_cross_course_conflicts.py
@ -0,0 +1,19 @@
+from pathlib import Path
+from didactopus.document_adapters import adapt_document
+from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
+from didactopus.cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
+
+
+def test_conflict_detection(tmp_path: Path) -> None:
+    a = tmp_path / "a.md"
+    b = tmp_path / "b.md"
+    a.write_text("# T\n\n## M1\n### Bayesian Updating\nPrior and Posterior appear here.", encoding="utf-8")
+    b.write_text("# T\n\n## M2\n### Bayesian Updating\nPrior and Posterior appear again.", encoding="utf-8")
+    docs = [adapt_document(a), adapt_document(b)]
+    courses = [document_to_course(doc, "Topic") for doc in docs]
+    merged = merge_courses_into_topic_course(build_topic_bundle("Topic", courses), merge_same_named_lessons=False)
+    concepts = extract_concept_candidates(merged)
+    assert isinstance(detect_title_overlaps(merged), list)
+    assert isinstance(detect_term_conflicts(merged), list)
+    assert isinstance(detect_order_conflicts(merged), list)
+    assert isinstance(detect_thin_concepts(concepts), list)
--- a/tests/test_document_adapters.py
+++ b/tests/test_document_adapters.py
@ -0,0 +1,18 @@
+from pathlib import Path
+from didactopus.document_adapters import adapt_document, detect_adapter
+
+
+def test_detect_adapter() -> None:
+    assert detect_adapter("a.md") == "markdown"
+    assert detect_adapter("b.html") == "html"
+    assert detect_adapter("c.pdf") == "pdf"
+    assert detect_adapter("d.docx") == "docx"
+    assert detect_adapter("e.pptx") == "pptx"
+
+
+def test_adapt_markdown(tmp_path: Path) -> None:
+    p = tmp_path / "x.md"
+    p.write_text("# T\n\n## A\nBody", encoding="utf-8")
+    doc = adapt_document(p)
+    assert doc.source_type == "markdown"
+    assert len(doc.sections) >= 1
--- a/tests/test_pack_output.py
+++ b/tests/test_pack_output.py
@ -1,17 +1,20 @@
 from pathlib import Path
-from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
+from didactopus.document_adapters import adapt_document
+from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
 from didactopus.rule_policy import RuleContext, build_default_rules, run_rules
 from didactopus.pack_emitter import build_draft_pack, write_draft_pack


-def test_emit_multisource_pack(tmp_path: Path) -> None:
+def test_emit_topic_pack(tmp_path: Path) -> None:
    src = tmp_path / "course.md"
-    src.write_text("# C\n\n## M1\n### Lesson A\n- Objective: Explain Topic A.\n- Exercise: Do task A.\nTopic A body.", encoding="utf-8")
-    course = merge_source_records([parse_source_file(src, title="Course")], course_title="Course")
-    concepts = extract_concept_candidates(course)
-    ctx = RuleContext(course=course, concepts=concepts)
+    src.write_text("# T\n\n## M\n### L\nExercise: Do task A.\nTopic A body.", encoding="utf-8")
+    doc = adapt_document(src)
+    course = document_to_course(doc, "Topic")
+    merged = merge_courses_into_topic_course(build_topic_bundle("Topic", [course]))
+    concepts = extract_concept_candidates(merged)
+    ctx = RuleContext(course=merged, concepts=concepts)
    run_rules(ctx, build_default_rules())
-    draft = build_draft_pack(course, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
+    draft = build_draft_pack(merged, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
    write_draft_pack(draft, tmp_path / "out")
    assert (tmp_path / "out" / "pack.yaml").exists()
    assert (tmp_path / "out" / "conflict_report.md").exists()
--- a/tests/test_topic_ingest.py
+++ b/tests/test_topic_ingest.py
@ -0,0 +1,26 @@
+from pathlib import Path
+from didactopus.document_adapters import adapt_document
+from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
+
+
+def test_cross_course_merge(tmp_path: Path) -> None:
+    a = tmp_path / "a.md"
+    b = tmp_path / "b.docx"
+    a.write_text("# T\n\n## M\n### L1\nBody A", encoding="utf-8")
+    b.write_text("# T\n\n## M\n### L1\nBody B", encoding="utf-8")
+
+    docs = [adapt_document(a), adapt_document(b)]
+    courses = [document_to_course(doc, "Topic") for doc in docs]
+    topic = build_topic_bundle("Topic", courses)
+    merged = merge_courses_into_topic_course(topic)
+    assert len(merged.modules) >= 1
+    assert len(merged.modules[0].lessons) == 1
+
+
+def test_extract_concepts(tmp_path: Path) -> None:
+    a = tmp_path / "a.md"
+    a.write_text("# T\n\n## M\n### Lesson A\nObjective: Explain Topic A.\nBody.", encoding="utf-8")
+    doc = adapt_document(a)
+    course = document_to_course(doc, "Topic")
+    concepts = extract_concept_candidates(course)
+    assert len(concepts) >= 1