Added cross-course merger.
This commit is contained in:
parent
8defaab1c2
commit
0656f7bbe8
36
README.md
36
README.md
|
|
@ -8,6 +8,41 @@
|
||||||
|
|
||||||
## Recent revisions
|
## Recent revisions
|
||||||
|
|
||||||
|
### Course-to-course merger
|
||||||
|
|
||||||
|
This revision adds two major capabilities:
|
||||||
|
|
||||||
|
- **real document adapter scaffolds** for PDF, DOCX, PPTX, and HTML
|
||||||
|
- a **cross-course merger** for combining multiple course-derived packs into one stronger domain draft
|
||||||
|
|
||||||
|
These additions extend the earlier multi-source ingestion layer from "multiple files for one course"
|
||||||
|
to "multiple courses or course-like sources for one topic domain."
|
||||||
|
|
||||||
|
## What is included
|
||||||
|
|
||||||
|
- adapter registry for:
|
||||||
|
- PDF
|
||||||
|
- DOCX
|
||||||
|
- PPTX
|
||||||
|
- HTML
|
||||||
|
- Markdown
|
||||||
|
- text
|
||||||
|
- normalized document extraction interface
|
||||||
|
- course bundle ingestion across multiple source documents
|
||||||
|
- cross-course terminology and overlap analysis
|
||||||
|
- merged topic-pack emitter
|
||||||
|
- cross-course conflict report
|
||||||
|
- example source files and example merged output
|
||||||
|
|
||||||
|
## Design stance
|
||||||
|
|
||||||
|
This is still scaffold-level extraction. The purpose is to define stable interfaces and emitted artifacts,
|
||||||
|
not to claim perfect semantic parsing of every teaching document.
|
||||||
|
|
||||||
|
The implementation is designed so stronger parsers can later replace the stub extractors without changing
|
||||||
|
the surrounding pipeline.
|
||||||
|
|
||||||
|
|
||||||
### Multi-Source Course Ingestion
|
### Multi-Source Course Ingestion
|
||||||
|
|
||||||
This revision adds a **Multi-Source Course Ingestion Layer**.
|
This revision adds a **Multi-Source Course Ingestion Layer**.
|
||||||
|
|
@ -216,3 +251,4 @@ didactopus/
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,16 +1,19 @@
|
||||||
|
document_adapters:
|
||||||
|
allow_pdf: true
|
||||||
|
allow_docx: true
|
||||||
|
allow_pptx: true
|
||||||
|
allow_html: true
|
||||||
|
allow_markdown: true
|
||||||
|
allow_text: true
|
||||||
|
|
||||||
course_ingest:
|
course_ingest:
|
||||||
default_pack_author: "Wesley R. Elsberry"
|
default_pack_author: "Wesley R. Elsberry"
|
||||||
default_license: "REVIEW-REQUIRED"
|
default_license: "REVIEW-REQUIRED"
|
||||||
min_term_length: 4
|
min_term_length: 4
|
||||||
max_terms_per_lesson: 8
|
max_terms_per_lesson: 8
|
||||||
|
|
||||||
rule_policy:
|
cross_course:
|
||||||
enable_prerequisite_order_rule: true
|
detect_title_overlaps: true
|
||||||
enable_duplicate_term_merge_rule: true
|
|
||||||
enable_project_detection_rule: true
|
|
||||||
enable_review_flags: true
|
|
||||||
|
|
||||||
multisource:
|
|
||||||
detect_duplicate_lessons: true
|
|
||||||
detect_term_conflicts: true
|
detect_term_conflicts: true
|
||||||
|
detect_order_conflicts: true
|
||||||
merge_same_named_lessons: true
|
merge_same_named_lessons: true
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,31 @@
|
||||||
|
# Cross-Course Merger
|
||||||
|
|
||||||
|
The cross-course merger combines multiple course-like inputs covering the same subject area.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Build a stronger draft topic pack from several partially overlapping sources.
|
||||||
|
|
||||||
|
## What it does
|
||||||
|
|
||||||
|
- merges normalized source records into course bundles
|
||||||
|
- merges course bundles into one topic bundle
|
||||||
|
- compares repeated concepts across courses
|
||||||
|
- flags terminology conflicts and overlap
|
||||||
|
- emits a merged draft pack
|
||||||
|
- emits a cross-course conflict report
|
||||||
|
|
||||||
|
## Why this matters
|
||||||
|
|
||||||
|
No single course is usually ideal for mastery-oriented domain construction.
|
||||||
|
Combining multiple sources can improve:
|
||||||
|
- concept coverage
|
||||||
|
- exercise diversity
|
||||||
|
- project identification
|
||||||
|
- terminology mapping
|
||||||
|
- prerequisite robustness
|
||||||
|
|
||||||
|
## Important caveat
|
||||||
|
|
||||||
|
This merger is draft-oriented.
|
||||||
|
Human review remains necessary before trusting the result as a final domain pack.
|
||||||
|
|
@ -0,0 +1,42 @@
|
||||||
|
# Document Adapters
|
||||||
|
|
||||||
|
Didactopus now includes adapter scaffolds for several common educational document types.
|
||||||
|
|
||||||
|
## Supported adapter interfaces
|
||||||
|
|
||||||
|
- PDF adapter
|
||||||
|
- DOCX adapter
|
||||||
|
- PPTX adapter
|
||||||
|
- HTML adapter
|
||||||
|
- Markdown adapter
|
||||||
|
- text adapter
|
||||||
|
|
||||||
|
## Current status
|
||||||
|
|
||||||
|
The current implementation is intentionally conservative:
|
||||||
|
- it focuses on stable interfaces
|
||||||
|
- it extracts text in a simplified way
|
||||||
|
- it normalizes results into shared internal structures
|
||||||
|
|
||||||
|
## Why this matters
|
||||||
|
|
||||||
|
Educational material commonly lives in:
|
||||||
|
- syllabi PDFs
|
||||||
|
- DOCX notes
|
||||||
|
- PowerPoint slide decks
|
||||||
|
- LMS HTML exports
|
||||||
|
- markdown lesson files
|
||||||
|
|
||||||
|
A useful curriculum distiller must be able to treat these as first-class inputs.
|
||||||
|
|
||||||
|
## Adapter contract
|
||||||
|
|
||||||
|
Each adapter returns a normalized document record with:
|
||||||
|
- source path
|
||||||
|
- source type
|
||||||
|
- title
|
||||||
|
- extracted text
|
||||||
|
- sections
|
||||||
|
- metadata
|
||||||
|
|
||||||
|
This record is then passed into higher-level course/topic distillation logic.
|
||||||
34
docs/faq.md
34
docs/faq.md
|
|
@ -1,27 +1,25 @@
|
||||||
# FAQ
|
# FAQ
|
||||||
|
|
||||||
## Why multi-source ingestion?
|
## Why add document adapters now?
|
||||||
|
|
||||||
Because course structure is usually distributed across several files rather than
|
Because real educational material is rarely provided in only one plain-text format.
|
||||||
perfectly contained in one source.
|
|
||||||
|
|
||||||
## What kinds of conflicts can arise?
|
## Are these full-fidelity parsers?
|
||||||
|
|
||||||
Common examples:
|
Not yet. The current implementation is a stable scaffold for extraction and normalization.
|
||||||
- the same lesson with slightly different titles
|
|
||||||
- inconsistent terminology across notes and transcripts
|
|
||||||
- exercises present in one source but absent in another
|
|
||||||
- project prompts implied in one file and explicit in another
|
|
||||||
|
|
||||||
## Does the system resolve all conflicts automatically?
|
## Why add cross-course merging?
|
||||||
|
|
||||||
No. It produces a merged draft pack and a conflict report for human review.
|
Because one course often under-specifies a domain, while multiple sources together can produce a better draft pack.
|
||||||
|
|
||||||
## Why not rely only on embeddings for this?
|
## Does the merger resolve every concept conflict automatically?
|
||||||
|
|
||||||
Because Didactopus needs explicit structures such as:
|
No. It produces a merged draft plus a conflict report for human review.
|
||||||
- concepts
|
|
||||||
- prerequisites
|
## What kinds of issues are flagged?
|
||||||
- projects
|
|
||||||
- rubrics
|
Examples:
|
||||||
- checkpoints
|
- repeated concepts with different names
|
||||||
|
- same term used with different local contexts
|
||||||
|
- courses that introduce topics in conflicting orders
|
||||||
|
- weak or thin concept descriptions
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,42 @@
|
||||||
|
concepts:
|
||||||
|
- id: descriptive-statistics
|
||||||
|
title: Descriptive Statistics
|
||||||
|
description: 'Objective: Explain mean, median, and variance.
|
||||||
|
|
||||||
|
Exercise: Summarize a small dataset.
|
||||||
|
|
||||||
|
Descriptive Statistics introduces center and spread.'
|
||||||
|
prerequisites: []
|
||||||
|
mastery_signals:
|
||||||
|
- Summarize a small dataset.
|
||||||
|
mastery_profile: {}
|
||||||
|
- id: probability-basics
|
||||||
|
title: Probability Basics
|
||||||
|
description: 'Objective: Explain conditional probability.
|
||||||
|
|
||||||
|
Exercise: Compute a simple conditional probability.
|
||||||
|
|
||||||
|
Probability Basics introduces events and likelihood.'
|
||||||
|
prerequisites:
|
||||||
|
- descriptive-statistics
|
||||||
|
mastery_signals:
|
||||||
|
- Compute a simple conditional probability.
|
||||||
|
mastery_profile: {}
|
||||||
|
- id: prior-and-posterior
|
||||||
|
title: Prior And Posterior
|
||||||
|
description: 'Prior and Posterior are central concepts. Prior reflects assumptions
|
||||||
|
before evidence. Exercise: Compare prior and posterior beliefs.'
|
||||||
|
prerequisites:
|
||||||
|
- probability-basics
|
||||||
|
mastery_signals:
|
||||||
|
- Compare prior and posterior beliefs.
|
||||||
|
mastery_profile: {}
|
||||||
|
- id: model-checking
|
||||||
|
title: Model Checking
|
||||||
|
description: 'A weakness is hidden assumptions. A limitation is poor fit. Uncertainty
|
||||||
|
remains. Exercise: Critique a simple inference model.'
|
||||||
|
prerequisites:
|
||||||
|
- prior-and-posterior
|
||||||
|
mastery_signals:
|
||||||
|
- Critique a simple inference model.
|
||||||
|
mastery_profile: {}
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Conflict Report
|
||||||
|
|
||||||
|
- Lesson 'prior and posterior' was merged from multiple sources; review ordering assumptions.
|
||||||
|
|
@ -0,0 +1,30 @@
|
||||||
|
{
|
||||||
|
"rights_note": "REVIEW REQUIRED",
|
||||||
|
"sources": [
|
||||||
|
{
|
||||||
|
"source_path": "examples/intro_bayes_outline.md",
|
||||||
|
"source_type": "markdown",
|
||||||
|
"title": "Intro Bayes Outline"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source_path": "examples/intro_bayes_lecture.html",
|
||||||
|
"source_type": "html",
|
||||||
|
"title": "Intro Bayes Lecture"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source_path": "examples/intro_bayes_slides.pptx",
|
||||||
|
"source_type": "pptx",
|
||||||
|
"title": "Intro Bayes Slides"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source_path": "examples/intro_bayes_notes.docx",
|
||||||
|
"source_type": "docx",
|
||||||
|
"title": "Intro Bayes Notes"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source_path": "examples/intro_bayes_syllabus.pdf",
|
||||||
|
"source_type": "pdf",
|
||||||
|
"title": "Intro Bayes Syllabus"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,14 @@
|
||||||
|
name: introductory-bayesian-inference
|
||||||
|
display_name: Introductory Bayesian Inference
|
||||||
|
version: 0.1.0-draft
|
||||||
|
schema_version: '1'
|
||||||
|
didactopus_min_version: 0.1.0
|
||||||
|
didactopus_max_version: 0.9.99
|
||||||
|
description: Draft topic pack generated from multi-course inputs for 'Introductory
|
||||||
|
Bayesian Inference'.
|
||||||
|
author: Wesley R. Elsberry
|
||||||
|
license: REVIEW-REQUIRED
|
||||||
|
dependencies: []
|
||||||
|
overrides: []
|
||||||
|
profile_templates: {}
|
||||||
|
cross_pack_links: []
|
||||||
|
|
@ -0,0 +1,7 @@
|
||||||
|
projects:
|
||||||
|
- id: prior-and-posterior
|
||||||
|
title: Prior And Posterior
|
||||||
|
difficulty: review-required
|
||||||
|
prerequisites: []
|
||||||
|
deliverables:
|
||||||
|
- project artifact
|
||||||
|
|
@ -0,0 +1,3 @@
|
||||||
|
# Review Report
|
||||||
|
|
||||||
|
- Module 'Imported from PPTX' appears to contain project-like material; review project extraction.
|
||||||
|
|
@ -0,0 +1,17 @@
|
||||||
|
stages:
|
||||||
|
- id: stage-1
|
||||||
|
title: Imported from MARKDOWN
|
||||||
|
concepts:
|
||||||
|
- descriptive-statistics
|
||||||
|
- probability-basics
|
||||||
|
checkpoint: []
|
||||||
|
- id: stage-2
|
||||||
|
title: Imported from HTML
|
||||||
|
concepts:
|
||||||
|
- prior-and-posterior
|
||||||
|
checkpoint: []
|
||||||
|
- id: stage-3
|
||||||
|
title: Imported from DOCX
|
||||||
|
concepts:
|
||||||
|
- model-checking
|
||||||
|
checkpoint: []
|
||||||
|
|
@ -0,0 +1,6 @@
|
||||||
|
rubrics:
|
||||||
|
- id: draft-rubric
|
||||||
|
title: Draft Rubric
|
||||||
|
criteria:
|
||||||
|
- correctness
|
||||||
|
- explanation
|
||||||
|
|
@ -0,0 +1,7 @@
|
||||||
|
<html><body>
|
||||||
|
<h1>Introductory Bayesian Inference</h1>
|
||||||
|
<h2>Bayesian Updating</h2>
|
||||||
|
<h3>Prior and Posterior</h3>
|
||||||
|
<p>Prior and Posterior are central concepts. Prior reflects assumptions before evidence.</p>
|
||||||
|
<p>Exercise: Compare prior and posterior beliefs.</p>
|
||||||
|
</body></html>
|
||||||
|
|
@ -0,0 +1,6 @@
|
||||||
|
# Bayesian Notes
|
||||||
|
|
||||||
|
## Model Critique
|
||||||
|
### Model Checking
|
||||||
|
A weakness is hidden assumptions. A limitation is poor fit. Uncertainty remains.
|
||||||
|
Exercise: Critique a simple inference model.
|
||||||
|
|
@ -0,0 +1,12 @@
|
||||||
|
# Introductory Bayesian Inference
|
||||||
|
|
||||||
|
## Foundations
|
||||||
|
### Descriptive Statistics
|
||||||
|
Objective: Explain mean, median, and variance.
|
||||||
|
Exercise: Summarize a small dataset.
|
||||||
|
Descriptive Statistics introduces center and spread.
|
||||||
|
|
||||||
|
### Probability Basics
|
||||||
|
Objective: Explain conditional probability.
|
||||||
|
Exercise: Compute a simple conditional probability.
|
||||||
|
Probability Basics introduces events and likelihood.
|
||||||
|
|
@ -0,0 +1,7 @@
|
||||||
|
# Bayesian Slides
|
||||||
|
|
||||||
|
## Bayesian Updating
|
||||||
|
### Prior and Posterior
|
||||||
|
Prior and Posterior summary slide text.
|
||||||
|
Capstone Mini Project
|
||||||
|
Exercise: Write a short project report comparing priors and posteriors.
|
||||||
|
|
@ -0,0 +1,5 @@
|
||||||
|
# Bayesian Syllabus
|
||||||
|
|
||||||
|
## Schedule
|
||||||
|
### Foundations
|
||||||
|
Objective: Explain descriptive statistics and conditional probability.
|
||||||
|
|
@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
|
||||||
[project]
|
[project]
|
||||||
name = "didactopus"
|
name = "didactopus"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
description = "Didactopus: multi-source course-to-pack ingestion scaffold"
|
description = "Didactopus: document-adapter and cross-course merger scaffold"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
requires-python = ">=3.10"
|
requires-python = ">=3.10"
|
||||||
license = {text = "MIT"}
|
license = {text = "MIT"}
|
||||||
|
|
@ -16,7 +16,7 @@ dependencies = ["pydantic>=2.7", "pyyaml>=6.0"]
|
||||||
dev = ["pytest>=8.0", "ruff>=0.6"]
|
dev = ["pytest>=8.0", "ruff>=0.6"]
|
||||||
|
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
didactopus-course-ingest = "didactopus.main:main"
|
didactopus-topic-ingest = "didactopus.main:main"
|
||||||
|
|
||||||
[tool.setuptools.packages.find]
|
[tool.setuptools.packages.find]
|
||||||
where = ["src"]
|
where = ["src"]
|
||||||
|
|
|
||||||
|
|
@ -3,6 +3,15 @@ from pydantic import BaseModel, Field
|
||||||
import yaml
|
import yaml
|
||||||
|
|
||||||
|
|
||||||
|
class DocumentAdaptersConfig(BaseModel):
|
||||||
|
allow_pdf: bool = True
|
||||||
|
allow_docx: bool = True
|
||||||
|
allow_pptx: bool = True
|
||||||
|
allow_html: bool = True
|
||||||
|
allow_markdown: bool = True
|
||||||
|
allow_text: bool = True
|
||||||
|
|
||||||
|
|
||||||
class CourseIngestConfig(BaseModel):
|
class CourseIngestConfig(BaseModel):
|
||||||
default_pack_author: str = "Unknown"
|
default_pack_author: str = "Unknown"
|
||||||
default_license: str = "REVIEW-REQUIRED"
|
default_license: str = "REVIEW-REQUIRED"
|
||||||
|
|
@ -10,23 +19,17 @@ class CourseIngestConfig(BaseModel):
|
||||||
max_terms_per_lesson: int = 8
|
max_terms_per_lesson: int = 8
|
||||||
|
|
||||||
|
|
||||||
class RulePolicyConfig(BaseModel):
|
class CrossCourseConfig(BaseModel):
|
||||||
enable_prerequisite_order_rule: bool = True
|
detect_title_overlaps: bool = True
|
||||||
enable_duplicate_term_merge_rule: bool = True
|
|
||||||
enable_project_detection_rule: bool = True
|
|
||||||
enable_review_flags: bool = True
|
|
||||||
|
|
||||||
|
|
||||||
class MultisourceConfig(BaseModel):
|
|
||||||
detect_duplicate_lessons: bool = True
|
|
||||||
detect_term_conflicts: bool = True
|
detect_term_conflicts: bool = True
|
||||||
|
detect_order_conflicts: bool = True
|
||||||
merge_same_named_lessons: bool = True
|
merge_same_named_lessons: bool = True
|
||||||
|
|
||||||
|
|
||||||
class AppConfig(BaseModel):
|
class AppConfig(BaseModel):
|
||||||
|
document_adapters: DocumentAdaptersConfig = Field(default_factory=DocumentAdaptersConfig)
|
||||||
course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig)
|
course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig)
|
||||||
rule_policy: RulePolicyConfig = Field(default_factory=RulePolicyConfig)
|
cross_course: CrossCourseConfig = Field(default_factory=CrossCourseConfig)
|
||||||
multisource: MultisourceConfig = Field(default_factory=MultisourceConfig)
|
|
||||||
|
|
||||||
|
|
||||||
def load_config(path: str | Path) -> AppConfig:
|
def load_config(path: str | Path) -> AppConfig:
|
||||||
|
|
|
||||||
|
|
@ -1,8 +1,21 @@
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
|
||||||
|
class Section(BaseModel):
|
||||||
|
heading: str
|
||||||
|
body: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
class NormalizedDocument(BaseModel):
|
||||||
|
source_path: str
|
||||||
|
source_type: str
|
||||||
|
title: str = ""
|
||||||
|
text: str = ""
|
||||||
|
sections: list[Section] = Field(default_factory=list)
|
||||||
|
metadata: dict = Field(default_factory=dict)
|
||||||
|
|
||||||
|
|
||||||
class Lesson(BaseModel):
|
class Lesson(BaseModel):
|
||||||
title: str
|
title: str
|
||||||
body: str = ""
|
body: str = ""
|
||||||
|
|
@ -17,21 +30,18 @@ class Module(BaseModel):
|
||||||
lessons: list[Lesson] = Field(default_factory=list)
|
lessons: list[Lesson] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
class NormalizedSourceRecord(BaseModel):
|
|
||||||
source_name: str
|
|
||||||
source_type: str
|
|
||||||
source_path: str
|
|
||||||
title: str = ""
|
|
||||||
modules: list[Module] = Field(default_factory=list)
|
|
||||||
|
|
||||||
|
|
||||||
class NormalizedCourse(BaseModel):
|
class NormalizedCourse(BaseModel):
|
||||||
title: str
|
title: str
|
||||||
source_name: str = ""
|
source_name: str = ""
|
||||||
source_url: str = ""
|
source_url: str = ""
|
||||||
rights_note: str = ""
|
rights_note: str = ""
|
||||||
modules: list[Module] = Field(default_factory=list)
|
modules: list[Module] = Field(default_factory=list)
|
||||||
source_records: list[NormalizedSourceRecord] = Field(default_factory=list)
|
source_records: list[NormalizedDocument] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
|
class TopicBundle(BaseModel):
|
||||||
|
topic_title: str
|
||||||
|
courses: list[NormalizedCourse] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
class ConceptCandidate(BaseModel):
|
class ConceptCandidate(BaseModel):
|
||||||
|
|
@ -40,6 +50,7 @@ class ConceptCandidate(BaseModel):
|
||||||
description: str = ""
|
description: str = ""
|
||||||
source_modules: list[str] = Field(default_factory=list)
|
source_modules: list[str] = Field(default_factory=list)
|
||||||
source_lessons: list[str] = Field(default_factory=list)
|
source_lessons: list[str] = Field(default_factory=list)
|
||||||
|
source_courses: list[str] = Field(default_factory=list)
|
||||||
prerequisites: list[str] = Field(default_factory=list)
|
prerequisites: list[str] = Field(default_factory=list)
|
||||||
mastery_signals: list[str] = Field(default_factory=list)
|
mastery_signals: list[str] = Field(default_factory=list)
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,50 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from collections import defaultdict
|
||||||
|
from .course_schema import NormalizedCourse, ConceptCandidate
|
||||||
|
|
||||||
|
|
||||||
|
def detect_title_overlaps(course: NormalizedCourse) -> list[str]:
|
||||||
|
lesson_to_sources = defaultdict(set)
|
||||||
|
for module in course.modules:
|
||||||
|
for lesson in module.lessons:
|
||||||
|
for src in lesson.source_refs:
|
||||||
|
lesson_to_sources[lesson.title.lower()].add(src)
|
||||||
|
flags = []
|
||||||
|
for title, sources in lesson_to_sources.items():
|
||||||
|
if len(sources) > 1:
|
||||||
|
flags.append(f"Lesson title '{title}' appears across multiple sources: {', '.join(sorted(sources))}")
|
||||||
|
return flags
|
||||||
|
|
||||||
|
|
||||||
|
def detect_term_conflicts(course: NormalizedCourse) -> list[str]:
|
||||||
|
term_to_lessons = defaultdict(set)
|
||||||
|
for module in course.modules:
|
||||||
|
for lesson in module.lessons:
|
||||||
|
for term in lesson.key_terms:
|
||||||
|
term_to_lessons[term.lower()].add(lesson.title)
|
||||||
|
flags = []
|
||||||
|
for term, lessons in term_to_lessons.items():
|
||||||
|
if len(lessons) > 1:
|
||||||
|
flags.append(f"Key term '{term}' appears in multiple lesson contexts: {', '.join(sorted(lessons))}")
|
||||||
|
return flags
|
||||||
|
|
||||||
|
|
||||||
|
def detect_order_conflicts(course: NormalizedCourse) -> list[str]:
|
||||||
|
# Placeholder heuristic: if same lesson title appears in multiple source_refs, flag for order review.
|
||||||
|
flags = []
|
||||||
|
for module in course.modules:
|
||||||
|
for lesson in module.lessons:
|
||||||
|
if len(set(lesson.source_refs)) > 1:
|
||||||
|
flags.append(f"Lesson '{lesson.title}' was merged from multiple sources; review ordering assumptions.")
|
||||||
|
return flags
|
||||||
|
|
||||||
|
|
||||||
|
def detect_thin_concepts(concepts: list[ConceptCandidate]) -> list[str]:
|
||||||
|
flags = []
|
||||||
|
for concept in concepts:
|
||||||
|
if len(concept.description.strip()) < 20:
|
||||||
|
flags.append(f"Concept '{concept.title}' has a very thin description.")
|
||||||
|
if not concept.mastery_signals:
|
||||||
|
flags.append(f"Concept '{concept.title}' has no extracted mastery signals.")
|
||||||
|
return flags
|
||||||
|
|
@ -0,0 +1,141 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
import re
|
||||||
|
from .course_schema import NormalizedDocument, Section
|
||||||
|
|
||||||
|
|
||||||
|
def _title_from_path(path: str | Path) -> str:
|
||||||
|
p = Path(path)
|
||||||
|
return p.stem.replace("_", " ").replace("-", " ").title()
|
||||||
|
|
||||||
|
|
||||||
|
def _simple_section_split(text: str) -> list[Section]:
|
||||||
|
sections = []
|
||||||
|
current_heading = "Main"
|
||||||
|
current_lines = []
|
||||||
|
for line in text.splitlines():
|
||||||
|
if re.match(r"^(#{1,3})\s+", line):
|
||||||
|
if current_lines:
|
||||||
|
sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
|
||||||
|
current_heading = re.sub(r"^(#{1,3})\s+", "", line).strip()
|
||||||
|
current_lines = []
|
||||||
|
else:
|
||||||
|
current_lines.append(line)
|
||||||
|
if current_lines:
|
||||||
|
sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
|
||||||
|
return sections
|
||||||
|
|
||||||
|
|
||||||
|
def read_textish(path: str | Path) -> str:
|
||||||
|
return Path(path).read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
|
||||||
|
def adapt_markdown(path: str | Path) -> NormalizedDocument:
|
||||||
|
text = read_textish(path)
|
||||||
|
return NormalizedDocument(
|
||||||
|
source_path=str(path),
|
||||||
|
source_type="markdown",
|
||||||
|
title=_title_from_path(path),
|
||||||
|
text=text,
|
||||||
|
sections=_simple_section_split(text),
|
||||||
|
metadata={},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def adapt_text(path: str | Path) -> NormalizedDocument:
|
||||||
|
text = read_textish(path)
|
||||||
|
return NormalizedDocument(
|
||||||
|
source_path=str(path),
|
||||||
|
source_type="text",
|
||||||
|
title=_title_from_path(path),
|
||||||
|
text=text,
|
||||||
|
sections=_simple_section_split(text),
|
||||||
|
metadata={},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def adapt_html(path: str | Path) -> NormalizedDocument:
|
||||||
|
raw = read_textish(path)
|
||||||
|
text = re.sub(r"<[^>]+>", " ", raw)
|
||||||
|
text = re.sub(r"\s+", " ", text).strip()
|
||||||
|
return NormalizedDocument(
|
||||||
|
source_path=str(path),
|
||||||
|
source_type="html",
|
||||||
|
title=_title_from_path(path),
|
||||||
|
text=text,
|
||||||
|
sections=[Section(heading="HTML Extract", body=text)],
|
||||||
|
metadata={"extraction": "stub-html-strip"},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def adapt_pdf(path: str | Path) -> NormalizedDocument:
|
||||||
|
# Stub: in a real implementation, plug in PDF text extraction here.
|
||||||
|
text = read_textish(path)
|
||||||
|
return NormalizedDocument(
|
||||||
|
source_path=str(path),
|
||||||
|
source_type="pdf",
|
||||||
|
title=_title_from_path(path),
|
||||||
|
text=text,
|
||||||
|
sections=_simple_section_split(text),
|
||||||
|
metadata={"extraction": "stub-pdf-text"},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def adapt_docx(path: str | Path) -> NormalizedDocument:
|
||||||
|
# Stub: in a real implementation, plug in DOCX extraction here.
|
||||||
|
text = read_textish(path)
|
||||||
|
return NormalizedDocument(
|
||||||
|
source_path=str(path),
|
||||||
|
source_type="docx",
|
||||||
|
title=_title_from_path(path),
|
||||||
|
text=text,
|
||||||
|
sections=_simple_section_split(text),
|
||||||
|
metadata={"extraction": "stub-docx-text"},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def adapt_pptx(path: str | Path) -> NormalizedDocument:
|
||||||
|
# Stub: in a real implementation, plug in PPTX extraction here.
|
||||||
|
text = read_textish(path)
|
||||||
|
return NormalizedDocument(
|
||||||
|
source_path=str(path),
|
||||||
|
source_type="pptx",
|
||||||
|
title=_title_from_path(path),
|
||||||
|
text=text,
|
||||||
|
sections=_simple_section_split(text),
|
||||||
|
metadata={"extraction": "stub-pptx-text"},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def detect_adapter(path: str | Path) -> str:
|
||||||
|
p = Path(path)
|
||||||
|
suffix = p.suffix.lower()
|
||||||
|
if suffix == ".md":
|
||||||
|
return "markdown"
|
||||||
|
if suffix in {".txt"}:
|
||||||
|
return "text"
|
||||||
|
if suffix in {".html", ".htm"}:
|
||||||
|
return "html"
|
||||||
|
if suffix == ".pdf":
|
||||||
|
return "pdf"
|
||||||
|
if suffix == ".docx":
|
||||||
|
return "docx"
|
||||||
|
if suffix == ".pptx":
|
||||||
|
return "pptx"
|
||||||
|
return "text"
|
||||||
|
|
||||||
|
|
||||||
|
def adapt_document(path: str | Path) -> NormalizedDocument:
|
||||||
|
adapter = detect_adapter(path)
|
||||||
|
if adapter == "markdown":
|
||||||
|
return adapt_markdown(path)
|
||||||
|
if adapter == "html":
|
||||||
|
return adapt_html(path)
|
||||||
|
if adapter == "pdf":
|
||||||
|
return adapt_pdf(path)
|
||||||
|
if adapter == "docx":
|
||||||
|
return adapt_docx(path)
|
||||||
|
if adapter == "pptx":
|
||||||
|
return adapt_pptx(path)
|
||||||
|
return adapt_text(path)
|
||||||
|
|
@ -4,18 +4,19 @@ import argparse
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .config import load_config
|
from .config import load_config
|
||||||
from .course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
from .document_adapters import adapt_document
|
||||||
|
from .topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||||
|
from .cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
|
||||||
from .rule_policy import RuleContext, build_default_rules, run_rules
|
from .rule_policy import RuleContext, build_default_rules, run_rules
|
||||||
from .conflict_report import detect_duplicate_lessons, detect_term_conflicts, detect_thin_concepts
|
|
||||||
from .pack_emitter import build_draft_pack, write_draft_pack
|
from .pack_emitter import build_draft_pack, write_draft_pack
|
||||||
|
|
||||||
|
|
||||||
def build_parser() -> argparse.ArgumentParser:
|
def build_parser() -> argparse.ArgumentParser:
|
||||||
parser = argparse.ArgumentParser(description="Didactopus multi-source course-to-pack ingestion pipeline")
|
parser = argparse.ArgumentParser(description="Didactopus document-adapter and cross-course topic ingestion")
|
||||||
parser.add_argument("--inputs", nargs="+", required=True, help="Input source files")
|
parser.add_argument("--inputs", nargs="+", required=True, help="Document inputs")
|
||||||
parser.add_argument("--title", required=True, help="Course or topic title")
|
parser.add_argument("--title", required=True, help="Topic title")
|
||||||
parser.add_argument("--rights-note", default="REVIEW REQUIRED")
|
parser.add_argument("--rights-note", default="REVIEW REQUIRED")
|
||||||
parser.add_argument("--output-dir", default="generated-pack")
|
parser.add_argument("--output-dir", default="generated-topic-pack")
|
||||||
parser.add_argument("--config", default="configs/config.example.yaml")
|
parser.add_argument("--config", default="configs/config.example.yaml")
|
||||||
return parser
|
return parser
|
||||||
|
|
||||||
|
|
@ -24,33 +25,30 @@ def main() -> None:
|
||||||
args = build_parser().parse_args()
|
args = build_parser().parse_args()
|
||||||
config = load_config(args.config)
|
config = load_config(args.config)
|
||||||
|
|
||||||
records = [parse_source_file(path, title=args.title) for path in args.inputs]
|
docs = [adapt_document(path) for path in args.inputs]
|
||||||
course = merge_source_records(
|
courses = [document_to_course(doc, course_title=args.title) for doc in docs]
|
||||||
records=records,
|
topic = build_topic_bundle(args.title, courses)
|
||||||
course_title=args.title,
|
merged_course = merge_courses_into_topic_course(
|
||||||
rights_note=args.rights_note,
|
topic_bundle=topic,
|
||||||
merge_same_named_lessons=config.multisource.merge_same_named_lessons,
|
merge_same_named_lessons=config.cross_course.merge_same_named_lessons,
|
||||||
)
|
)
|
||||||
concepts = extract_concept_candidates(course)
|
concepts = extract_concept_candidates(merged_course)
|
||||||
context = RuleContext(course=course, concepts=concepts)
|
|
||||||
|
|
||||||
rules = build_default_rules(
|
context = RuleContext(course=merged_course, concepts=concepts)
|
||||||
enable_prereq=config.rule_policy.enable_prerequisite_order_rule,
|
rules = build_default_rules()
|
||||||
enable_merge=config.rule_policy.enable_duplicate_term_merge_rule,
|
|
||||||
enable_projects=config.rule_policy.enable_project_detection_rule,
|
|
||||||
enable_review=config.rule_policy.enable_review_flags,
|
|
||||||
)
|
|
||||||
run_rules(context, rules)
|
run_rules(context, rules)
|
||||||
|
|
||||||
conflicts = []
|
conflicts = []
|
||||||
if config.multisource.detect_duplicate_lessons:
|
if config.cross_course.detect_title_overlaps:
|
||||||
conflicts.extend(detect_duplicate_lessons(course))
|
conflicts.extend(detect_title_overlaps(merged_course))
|
||||||
if config.multisource.detect_term_conflicts:
|
if config.cross_course.detect_term_conflicts:
|
||||||
conflicts.extend(detect_term_conflicts(course))
|
conflicts.extend(detect_term_conflicts(merged_course))
|
||||||
|
if config.cross_course.detect_order_conflicts:
|
||||||
|
conflicts.extend(detect_order_conflicts(merged_course))
|
||||||
conflicts.extend(detect_thin_concepts(context.concepts))
|
conflicts.extend(detect_thin_concepts(context.concepts))
|
||||||
|
|
||||||
draft = build_draft_pack(
|
draft = build_draft_pack(
|
||||||
course=course,
|
course=merged_course,
|
||||||
concepts=context.concepts,
|
concepts=context.concepts,
|
||||||
author=config.course_ingest.default_pack_author,
|
author=config.course_ingest.default_pack_author,
|
||||||
license_name=config.course_ingest.default_license,
|
license_name=config.course_ingest.default_license,
|
||||||
|
|
@ -59,10 +57,11 @@ def main() -> None:
|
||||||
)
|
)
|
||||||
write_draft_pack(draft, args.output_dir)
|
write_draft_pack(draft, args.output_dir)
|
||||||
|
|
||||||
print("== Didactopus Multi-Source Course Ingest ==")
|
print("== Didactopus Cross-Course Topic Ingest ==")
|
||||||
print(f"Course: {course.title}")
|
print(f"Topic: {args.title}")
|
||||||
print(f"Sources: {len(records)}")
|
print(f"Documents: {len(docs)}")
|
||||||
print(f"Modules: {len(course.modules)}")
|
print(f"Courses: {len(courses)}")
|
||||||
|
print(f"Merged modules: {len(merged_course.modules)}")
|
||||||
print(f"Concept candidates: {len(context.concepts)}")
|
print(f"Concept candidates: {len(context.concepts)}")
|
||||||
print(f"Review flags: {len(context.review_flags)}")
|
print(f"Review flags: {len(context.review_flags)}")
|
||||||
print(f"Conflicts: {len(conflicts)}")
|
print(f"Conflicts: {len(conflicts)}")
|
||||||
|
|
|
||||||
|
|
@ -15,7 +15,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
|
||||||
"schema_version": "1",
|
"schema_version": "1",
|
||||||
"didactopus_min_version": "0.1.0",
|
"didactopus_min_version": "0.1.0",
|
||||||
"didactopus_max_version": "0.9.99",
|
"didactopus_max_version": "0.9.99",
|
||||||
"description": f"Draft pack generated from multi-source course inputs for '{course.title}'.",
|
"description": f"Draft topic pack generated from multi-course inputs for '{course.title}'.",
|
||||||
"author": author,
|
"author": author,
|
||||||
"license": license_name,
|
"license": license_name,
|
||||||
"dependencies": [],
|
"dependencies": [],
|
||||||
|
|
@ -64,7 +64,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
|
||||||
attribution = {
|
attribution = {
|
||||||
"rights_note": course.rights_note,
|
"rights_note": course.rights_note,
|
||||||
"sources": [
|
"sources": [
|
||||||
{"source_name": src.source_name, "source_type": src.source_type, "source_path": src.source_path}
|
{"source_path": src.source_path, "source_type": src.source_type, "title": src.title}
|
||||||
for src in course.source_records
|
for src in course.source_records
|
||||||
],
|
],
|
||||||
}
|
}
|
||||||
|
|
@ -88,11 +88,8 @@ def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None:
|
||||||
(out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8")
|
(out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8")
|
||||||
(out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8")
|
(out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8")
|
||||||
(out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8")
|
(out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8")
|
||||||
|
|
||||||
review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"]
|
review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"]
|
||||||
(out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8")
|
(out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8")
|
||||||
|
|
||||||
conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"]
|
conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"]
|
||||||
(out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8")
|
(out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8")
|
||||||
|
|
||||||
(out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8")
|
(out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8")
|
||||||
|
|
|
||||||
|
|
@ -39,6 +39,7 @@ def duplicate_term_merge_rule(context: RuleContext) -> None:
|
||||||
if key in seen:
|
if key in seen:
|
||||||
seen[key].source_modules.extend(x for x in concept.source_modules if x not in seen[key].source_modules)
|
seen[key].source_modules.extend(x for x in concept.source_modules if x not in seen[key].source_modules)
|
||||||
seen[key].source_lessons.extend(x for x in concept.source_lessons if x not in seen[key].source_lessons)
|
seen[key].source_lessons.extend(x for x in concept.source_lessons if x not in seen[key].source_lessons)
|
||||||
|
seen[key].source_courses.extend(x for x in concept.source_courses if x not in seen[key].source_courses)
|
||||||
if concept.description and len(seen[key].description) < len(concept.description):
|
if concept.description and len(seen[key].description) < len(concept.description):
|
||||||
seen[key].description = concept.description
|
seen[key].description = concept.description
|
||||||
else:
|
else:
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,126 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from collections import defaultdict
|
||||||
|
from .course_schema import NormalizedDocument, NormalizedCourse, Module, Lesson, TopicBundle, ConceptCandidate
|
||||||
|
|
||||||
|
|
||||||
|
def slugify(text: str) -> str:
|
||||||
|
cleaned = re.sub(r"[^a-zA-Z0-9]+", "-", text.strip().lower()).strip("-")
|
||||||
|
return cleaned or "untitled"
|
||||||
|
|
||||||
|
|
||||||
|
def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -> list[str]:
|
||||||
|
candidates = re.findall(r"\b[A-Z][A-Za-z0-9\-]{%d,}\b" % (min_term_length - 1), text)
|
||||||
|
seen = set()
|
||||||
|
out = []
|
||||||
|
for term in candidates:
|
||||||
|
if term not in seen:
|
||||||
|
seen.add(term)
|
||||||
|
out.append(term)
|
||||||
|
if len(out) >= max_terms:
|
||||||
|
break
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def document_to_course(doc: NormalizedDocument, course_title: str) -> NormalizedCourse:
|
||||||
|
# Conservative mapping: each section becomes a lesson; all lessons go into one module.
|
||||||
|
lessons = []
|
||||||
|
for section in doc.sections:
|
||||||
|
body = section.body.strip()
|
||||||
|
lines = body.splitlines()
|
||||||
|
objectives = []
|
||||||
|
exercises = []
|
||||||
|
for line in lines:
|
||||||
|
low = line.lower().strip()
|
||||||
|
if low.startswith("objective:"):
|
||||||
|
objectives.append(line.split(":", 1)[1].strip())
|
||||||
|
if low.startswith("exercise:"):
|
||||||
|
exercises.append(line.split(":", 1)[1].strip())
|
||||||
|
lessons.append(
|
||||||
|
Lesson(
|
||||||
|
title=section.heading.strip() or "Untitled Lesson",
|
||||||
|
body=body,
|
||||||
|
objectives=objectives,
|
||||||
|
exercises=exercises,
|
||||||
|
key_terms=extract_key_terms(section.heading + "\n" + body),
|
||||||
|
source_refs=[doc.source_path],
|
||||||
|
)
|
||||||
|
)
|
||||||
|
module = Module(title=f"Imported from {doc.source_type.upper()}", lessons=lessons)
|
||||||
|
return NormalizedCourse(title=course_title, modules=[module], source_records=[doc])
|
||||||
|
|
||||||
|
|
||||||
|
def build_topic_bundle(topic_title: str, courses: list[NormalizedCourse]) -> TopicBundle:
|
||||||
|
return TopicBundle(topic_title=topic_title, courses=courses)
|
||||||
|
|
||||||
|
|
||||||
|
def merge_courses_into_topic_course(topic_bundle: TopicBundle, merge_same_named_lessons: bool = True) -> NormalizedCourse:
|
||||||
|
modules_by_title: dict[str, Module] = {}
|
||||||
|
source_records = []
|
||||||
|
for course in topic_bundle.courses:
|
||||||
|
source_records.extend(course.source_records)
|
||||||
|
for module in course.modules:
|
||||||
|
target_module = modules_by_title.setdefault(module.title, Module(title=module.title, lessons=[]))
|
||||||
|
if merge_same_named_lessons:
|
||||||
|
lesson_map = {lesson.title: lesson for lesson in target_module.lessons}
|
||||||
|
for lesson in module.lessons:
|
||||||
|
if lesson.title in lesson_map:
|
||||||
|
existing = lesson_map[lesson.title]
|
||||||
|
if lesson.body and lesson.body not in existing.body:
|
||||||
|
existing.body = (existing.body + "\n\n" + lesson.body).strip()
|
||||||
|
for x in lesson.objectives:
|
||||||
|
if x not in existing.objectives:
|
||||||
|
existing.objectives.append(x)
|
||||||
|
for x in lesson.exercises:
|
||||||
|
if x not in existing.exercises:
|
||||||
|
existing.exercises.append(x)
|
||||||
|
for x in lesson.key_terms:
|
||||||
|
if x not in existing.key_terms:
|
||||||
|
existing.key_terms.append(x)
|
||||||
|
for x in lesson.source_refs:
|
||||||
|
if x not in existing.source_refs:
|
||||||
|
existing.source_refs.append(x)
|
||||||
|
else:
|
||||||
|
target_module.lessons.append(lesson)
|
||||||
|
else:
|
||||||
|
target_module.lessons.extend(module.lessons)
|
||||||
|
return NormalizedCourse(title=topic_bundle.topic_title, modules=list(modules_by_title.values()), source_records=source_records)
|
||||||
|
|
||||||
|
|
||||||
|
def extract_concept_candidates(course: NormalizedCourse) -> list[ConceptCandidate]:
|
||||||
|
concepts = []
|
||||||
|
seen_ids = set()
|
||||||
|
for module in course.modules:
|
||||||
|
for lesson in module.lessons:
|
||||||
|
cid = slugify(lesson.title)
|
||||||
|
if cid not in seen_ids:
|
||||||
|
seen_ids.add(cid)
|
||||||
|
concepts.append(
|
||||||
|
ConceptCandidate(
|
||||||
|
id=cid,
|
||||||
|
title=lesson.title,
|
||||||
|
description=lesson.body[:240].strip(),
|
||||||
|
source_modules=[module.title],
|
||||||
|
source_lessons=[lesson.title],
|
||||||
|
source_courses=list(lesson.source_refs),
|
||||||
|
mastery_signals=list(lesson.objectives[:3] or lesson.exercises[:2]),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
for term in lesson.key_terms:
|
||||||
|
tid = slugify(term)
|
||||||
|
if tid in seen_ids:
|
||||||
|
continue
|
||||||
|
seen_ids.add(tid)
|
||||||
|
concepts.append(
|
||||||
|
ConceptCandidate(
|
||||||
|
id=tid,
|
||||||
|
title=term,
|
||||||
|
description=f"Candidate concept extracted from lesson '{lesson.title}'.",
|
||||||
|
source_modules=[module.title],
|
||||||
|
source_lessons=[lesson.title],
|
||||||
|
source_courses=list(lesson.source_refs),
|
||||||
|
mastery_signals=list(lesson.objectives[:2]),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return concepts
|
||||||
|
|
@ -0,0 +1,19 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from didactopus.document_adapters import adapt_document
|
||||||
|
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||||
|
from didactopus.cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
|
||||||
|
|
||||||
|
|
||||||
|
def test_conflict_detection(tmp_path: Path) -> None:
|
||||||
|
a = tmp_path / "a.md"
|
||||||
|
b = tmp_path / "b.md"
|
||||||
|
a.write_text("# T\n\n## M1\n### Bayesian Updating\nPrior and Posterior appear here.", encoding="utf-8")
|
||||||
|
b.write_text("# T\n\n## M2\n### Bayesian Updating\nPrior and Posterior appear again.", encoding="utf-8")
|
||||||
|
docs = [adapt_document(a), adapt_document(b)]
|
||||||
|
courses = [document_to_course(doc, "Topic") for doc in docs]
|
||||||
|
merged = merge_courses_into_topic_course(build_topic_bundle("Topic", courses), merge_same_named_lessons=False)
|
||||||
|
concepts = extract_concept_candidates(merged)
|
||||||
|
assert isinstance(detect_title_overlaps(merged), list)
|
||||||
|
assert isinstance(detect_term_conflicts(merged), list)
|
||||||
|
assert isinstance(detect_order_conflicts(merged), list)
|
||||||
|
assert isinstance(detect_thin_concepts(concepts), list)
|
||||||
|
|
@ -0,0 +1,18 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from didactopus.document_adapters import adapt_document, detect_adapter
|
||||||
|
|
||||||
|
|
||||||
|
def test_detect_adapter() -> None:
|
||||||
|
assert detect_adapter("a.md") == "markdown"
|
||||||
|
assert detect_adapter("b.html") == "html"
|
||||||
|
assert detect_adapter("c.pdf") == "pdf"
|
||||||
|
assert detect_adapter("d.docx") == "docx"
|
||||||
|
assert detect_adapter("e.pptx") == "pptx"
|
||||||
|
|
||||||
|
|
||||||
|
def test_adapt_markdown(tmp_path: Path) -> None:
|
||||||
|
p = tmp_path / "x.md"
|
||||||
|
p.write_text("# T\n\n## A\nBody", encoding="utf-8")
|
||||||
|
doc = adapt_document(p)
|
||||||
|
assert doc.source_type == "markdown"
|
||||||
|
assert len(doc.sections) >= 1
|
||||||
|
|
@ -1,17 +1,20 @@
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
from didactopus.document_adapters import adapt_document
|
||||||
|
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||||
from didactopus.rule_policy import RuleContext, build_default_rules, run_rules
|
from didactopus.rule_policy import RuleContext, build_default_rules, run_rules
|
||||||
from didactopus.pack_emitter import build_draft_pack, write_draft_pack
|
from didactopus.pack_emitter import build_draft_pack, write_draft_pack
|
||||||
|
|
||||||
|
|
||||||
def test_emit_multisource_pack(tmp_path: Path) -> None:
|
def test_emit_topic_pack(tmp_path: Path) -> None:
|
||||||
src = tmp_path / "course.md"
|
src = tmp_path / "course.md"
|
||||||
src.write_text("# C\n\n## M1\n### Lesson A\n- Objective: Explain Topic A.\n- Exercise: Do task A.\nTopic A body.", encoding="utf-8")
|
src.write_text("# T\n\n## M\n### L\nExercise: Do task A.\nTopic A body.", encoding="utf-8")
|
||||||
course = merge_source_records([parse_source_file(src, title="Course")], course_title="Course")
|
doc = adapt_document(src)
|
||||||
concepts = extract_concept_candidates(course)
|
course = document_to_course(doc, "Topic")
|
||||||
ctx = RuleContext(course=course, concepts=concepts)
|
merged = merge_courses_into_topic_course(build_topic_bundle("Topic", [course]))
|
||||||
|
concepts = extract_concept_candidates(merged)
|
||||||
|
ctx = RuleContext(course=merged, concepts=concepts)
|
||||||
run_rules(ctx, build_default_rules())
|
run_rules(ctx, build_default_rules())
|
||||||
draft = build_draft_pack(course, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
|
draft = build_draft_pack(merged, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
|
||||||
write_draft_pack(draft, tmp_path / "out")
|
write_draft_pack(draft, tmp_path / "out")
|
||||||
assert (tmp_path / "out" / "pack.yaml").exists()
|
assert (tmp_path / "out" / "pack.yaml").exists()
|
||||||
assert (tmp_path / "out" / "conflict_report.md").exists()
|
assert (tmp_path / "out" / "conflict_report.md").exists()
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,26 @@
|
||||||
|
from pathlib import Path
|
||||||
|
from didactopus.document_adapters import adapt_document
|
||||||
|
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||||
|
|
||||||
|
|
||||||
|
def test_cross_course_merge(tmp_path: Path) -> None:
|
||||||
|
a = tmp_path / "a.md"
|
||||||
|
b = tmp_path / "b.docx"
|
||||||
|
a.write_text("# T\n\n## M\n### L1\nBody A", encoding="utf-8")
|
||||||
|
b.write_text("# T\n\n## M\n### L1\nBody B", encoding="utf-8")
|
||||||
|
|
||||||
|
docs = [adapt_document(a), adapt_document(b)]
|
||||||
|
courses = [document_to_course(doc, "Topic") for doc in docs]
|
||||||
|
topic = build_topic_bundle("Topic", courses)
|
||||||
|
merged = merge_courses_into_topic_course(topic)
|
||||||
|
assert len(merged.modules) >= 1
|
||||||
|
assert len(merged.modules[0].lessons) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_extract_concepts(tmp_path: Path) -> None:
|
||||||
|
a = tmp_path / "a.md"
|
||||||
|
a.write_text("# T\n\n## M\n### Lesson A\nObjective: Explain Topic A.\nBody.", encoding="utf-8")
|
||||||
|
doc = adapt_document(a)
|
||||||
|
course = document_to_course(doc, "Topic")
|
||||||
|
concepts = extract_concept_candidates(course)
|
||||||
|
assert len(concepts) >= 1
|
||||||
Loading…
Reference in New Issue