Added cross-course merger.
This commit is contained in:
parent
8defaab1c2
commit
0656f7bbe8
36
README.md
36
README.md
|
|
@ -8,6 +8,41 @@
|
|||
|
||||
## Recent revisions
|
||||
|
||||
### Course-to-course merger
|
||||
|
||||
This revision adds two major capabilities:
|
||||
|
||||
- **real document adapter scaffolds** for PDF, DOCX, PPTX, and HTML
|
||||
- a **cross-course merger** for combining multiple course-derived packs into one stronger domain draft
|
||||
|
||||
These additions extend the earlier multi-source ingestion layer from "multiple files for one course"
|
||||
to "multiple courses or course-like sources for one topic domain."
|
||||
|
||||
## What is included
|
||||
|
||||
- adapter registry for:
|
||||
- PDF
|
||||
- DOCX
|
||||
- PPTX
|
||||
- HTML
|
||||
- Markdown
|
||||
- text
|
||||
- normalized document extraction interface
|
||||
- course bundle ingestion across multiple source documents
|
||||
- cross-course terminology and overlap analysis
|
||||
- merged topic-pack emitter
|
||||
- cross-course conflict report
|
||||
- example source files and example merged output
|
||||
|
||||
## Design stance
|
||||
|
||||
This is still scaffold-level extraction. The purpose is to define stable interfaces and emitted artifacts,
|
||||
not to claim perfect semantic parsing of every teaching document.
|
||||
|
||||
The implementation is designed so stronger parsers can later replace the stub extractors without changing
|
||||
the surrounding pipeline.
|
||||
|
||||
|
||||
### Multi-Source Course Ingestion
|
||||
|
||||
This revision adds a **Multi-Source Course Ingestion Layer**.
|
||||
|
|
@ -216,3 +251,4 @@ didactopus/
|
|||
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1,16 +1,19 @@
|
|||
document_adapters:
|
||||
allow_pdf: true
|
||||
allow_docx: true
|
||||
allow_pptx: true
|
||||
allow_html: true
|
||||
allow_markdown: true
|
||||
allow_text: true
|
||||
|
||||
course_ingest:
|
||||
default_pack_author: "Wesley R. Elsberry"
|
||||
default_license: "REVIEW-REQUIRED"
|
||||
min_term_length: 4
|
||||
max_terms_per_lesson: 8
|
||||
|
||||
rule_policy:
|
||||
enable_prerequisite_order_rule: true
|
||||
enable_duplicate_term_merge_rule: true
|
||||
enable_project_detection_rule: true
|
||||
enable_review_flags: true
|
||||
|
||||
multisource:
|
||||
detect_duplicate_lessons: true
|
||||
cross_course:
|
||||
detect_title_overlaps: true
|
||||
detect_term_conflicts: true
|
||||
detect_order_conflicts: true
|
||||
merge_same_named_lessons: true
|
||||
|
|
|
|||
|
|
@ -0,0 +1,31 @@
|
|||
# Cross-Course Merger
|
||||
|
||||
The cross-course merger combines multiple course-like inputs covering the same subject area.
|
||||
|
||||
## Goal
|
||||
|
||||
Build a stronger draft topic pack from several partially overlapping sources.
|
||||
|
||||
## What it does
|
||||
|
||||
- merges normalized source records into course bundles
|
||||
- merges course bundles into one topic bundle
|
||||
- compares repeated concepts across courses
|
||||
- flags terminology conflicts and overlap
|
||||
- emits a merged draft pack
|
||||
- emits a cross-course conflict report
|
||||
|
||||
## Why this matters
|
||||
|
||||
No single course is usually ideal for mastery-oriented domain construction.
|
||||
Combining multiple sources can improve:
|
||||
- concept coverage
|
||||
- exercise diversity
|
||||
- project identification
|
||||
- terminology mapping
|
||||
- prerequisite robustness
|
||||
|
||||
## Important caveat
|
||||
|
||||
This merger is draft-oriented.
|
||||
Human review remains necessary before trusting the result as a final domain pack.
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
# Document Adapters
|
||||
|
||||
Didactopus now includes adapter scaffolds for several common educational document types.
|
||||
|
||||
## Supported adapter interfaces
|
||||
|
||||
- PDF adapter
|
||||
- DOCX adapter
|
||||
- PPTX adapter
|
||||
- HTML adapter
|
||||
- Markdown adapter
|
||||
- text adapter
|
||||
|
||||
## Current status
|
||||
|
||||
The current implementation is intentionally conservative:
|
||||
- it focuses on stable interfaces
|
||||
- it extracts text in a simplified way
|
||||
- it normalizes results into shared internal structures
|
||||
|
||||
## Why this matters
|
||||
|
||||
Educational material commonly lives in:
|
||||
- syllabi PDFs
|
||||
- DOCX notes
|
||||
- PowerPoint slide decks
|
||||
- LMS HTML exports
|
||||
- markdown lesson files
|
||||
|
||||
A useful curriculum distiller must be able to treat these as first-class inputs.
|
||||
|
||||
## Adapter contract
|
||||
|
||||
Each adapter returns a normalized document record with:
|
||||
- source path
|
||||
- source type
|
||||
- title
|
||||
- extracted text
|
||||
- sections
|
||||
- metadata
|
||||
|
||||
This record is then passed into higher-level course/topic distillation logic.
|
||||
34
docs/faq.md
34
docs/faq.md
|
|
@ -1,27 +1,25 @@
|
|||
# FAQ
|
||||
|
||||
## Why multi-source ingestion?
|
||||
## Why add document adapters now?
|
||||
|
||||
Because course structure is usually distributed across several files rather than
|
||||
perfectly contained in one source.
|
||||
Because real educational material is rarely provided in only one plain-text format.
|
||||
|
||||
## What kinds of conflicts can arise?
|
||||
## Are these full-fidelity parsers?
|
||||
|
||||
Common examples:
|
||||
- the same lesson with slightly different titles
|
||||
- inconsistent terminology across notes and transcripts
|
||||
- exercises present in one source but absent in another
|
||||
- project prompts implied in one file and explicit in another
|
||||
Not yet. The current implementation is a stable scaffold for extraction and normalization.
|
||||
|
||||
## Does the system resolve all conflicts automatically?
|
||||
## Why add cross-course merging?
|
||||
|
||||
No. It produces a merged draft pack and a conflict report for human review.
|
||||
Because one course often under-specifies a domain, while multiple sources together can produce a better draft pack.
|
||||
|
||||
## Why not rely only on embeddings for this?
|
||||
## Does the merger resolve every concept conflict automatically?
|
||||
|
||||
Because Didactopus needs explicit structures such as:
|
||||
- concepts
|
||||
- prerequisites
|
||||
- projects
|
||||
- rubrics
|
||||
- checkpoints
|
||||
No. It produces a merged draft plus a conflict report for human review.
|
||||
|
||||
## What kinds of issues are flagged?
|
||||
|
||||
Examples:
|
||||
- repeated concepts with different names
|
||||
- same term used with different local contexts
|
||||
- courses that introduce topics in conflicting orders
|
||||
- weak or thin concept descriptions
|
||||
|
|
|
|||
|
|
@ -0,0 +1,42 @@
|
|||
concepts:
|
||||
- id: descriptive-statistics
|
||||
title: Descriptive Statistics
|
||||
description: 'Objective: Explain mean, median, and variance.
|
||||
|
||||
Exercise: Summarize a small dataset.
|
||||
|
||||
Descriptive Statistics introduces center and spread.'
|
||||
prerequisites: []
|
||||
mastery_signals:
|
||||
- Summarize a small dataset.
|
||||
mastery_profile: {}
|
||||
- id: probability-basics
|
||||
title: Probability Basics
|
||||
description: 'Objective: Explain conditional probability.
|
||||
|
||||
Exercise: Compute a simple conditional probability.
|
||||
|
||||
Probability Basics introduces events and likelihood.'
|
||||
prerequisites:
|
||||
- descriptive-statistics
|
||||
mastery_signals:
|
||||
- Compute a simple conditional probability.
|
||||
mastery_profile: {}
|
||||
- id: prior-and-posterior
|
||||
title: Prior And Posterior
|
||||
description: 'Prior and Posterior are central concepts. Prior reflects assumptions
|
||||
before evidence. Exercise: Compare prior and posterior beliefs.'
|
||||
prerequisites:
|
||||
- probability-basics
|
||||
mastery_signals:
|
||||
- Compare prior and posterior beliefs.
|
||||
mastery_profile: {}
|
||||
- id: model-checking
|
||||
title: Model Checking
|
||||
description: 'A weakness is hidden assumptions. A limitation is poor fit. Uncertainty
|
||||
remains. Exercise: Critique a simple inference model.'
|
||||
prerequisites:
|
||||
- prior-and-posterior
|
||||
mastery_signals:
|
||||
- Critique a simple inference model.
|
||||
mastery_profile: {}
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Conflict Report
|
||||
|
||||
- Lesson 'prior and posterior' was merged from multiple sources; review ordering assumptions.
|
||||
|
|
@ -0,0 +1,30 @@
|
|||
{
|
||||
"rights_note": "REVIEW REQUIRED",
|
||||
"sources": [
|
||||
{
|
||||
"source_path": "examples/intro_bayes_outline.md",
|
||||
"source_type": "markdown",
|
||||
"title": "Intro Bayes Outline"
|
||||
},
|
||||
{
|
||||
"source_path": "examples/intro_bayes_lecture.html",
|
||||
"source_type": "html",
|
||||
"title": "Intro Bayes Lecture"
|
||||
},
|
||||
{
|
||||
"source_path": "examples/intro_bayes_slides.pptx",
|
||||
"source_type": "pptx",
|
||||
"title": "Intro Bayes Slides"
|
||||
},
|
||||
{
|
||||
"source_path": "examples/intro_bayes_notes.docx",
|
||||
"source_type": "docx",
|
||||
"title": "Intro Bayes Notes"
|
||||
},
|
||||
{
|
||||
"source_path": "examples/intro_bayes_syllabus.pdf",
|
||||
"source_type": "pdf",
|
||||
"title": "Intro Bayes Syllabus"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
@ -0,0 +1,14 @@
|
|||
name: introductory-bayesian-inference
|
||||
display_name: Introductory Bayesian Inference
|
||||
version: 0.1.0-draft
|
||||
schema_version: '1'
|
||||
didactopus_min_version: 0.1.0
|
||||
didactopus_max_version: 0.9.99
|
||||
description: Draft topic pack generated from multi-course inputs for 'Introductory
|
||||
Bayesian Inference'.
|
||||
author: Wesley R. Elsberry
|
||||
license: REVIEW-REQUIRED
|
||||
dependencies: []
|
||||
overrides: []
|
||||
profile_templates: {}
|
||||
cross_pack_links: []
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
projects:
|
||||
- id: prior-and-posterior
|
||||
title: Prior And Posterior
|
||||
difficulty: review-required
|
||||
prerequisites: []
|
||||
deliverables:
|
||||
- project artifact
|
||||
|
|
@ -0,0 +1,3 @@
|
|||
# Review Report
|
||||
|
||||
- Module 'Imported from PPTX' appears to contain project-like material; review project extraction.
|
||||
|
|
@ -0,0 +1,17 @@
|
|||
stages:
|
||||
- id: stage-1
|
||||
title: Imported from MARKDOWN
|
||||
concepts:
|
||||
- descriptive-statistics
|
||||
- probability-basics
|
||||
checkpoint: []
|
||||
- id: stage-2
|
||||
title: Imported from HTML
|
||||
concepts:
|
||||
- prior-and-posterior
|
||||
checkpoint: []
|
||||
- id: stage-3
|
||||
title: Imported from DOCX
|
||||
concepts:
|
||||
- model-checking
|
||||
checkpoint: []
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
rubrics:
|
||||
- id: draft-rubric
|
||||
title: Draft Rubric
|
||||
criteria:
|
||||
- correctness
|
||||
- explanation
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
<html><body>
|
||||
<h1>Introductory Bayesian Inference</h1>
|
||||
<h2>Bayesian Updating</h2>
|
||||
<h3>Prior and Posterior</h3>
|
||||
<p>Prior and Posterior are central concepts. Prior reflects assumptions before evidence.</p>
|
||||
<p>Exercise: Compare prior and posterior beliefs.</p>
|
||||
</body></html>
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
# Bayesian Notes
|
||||
|
||||
## Model Critique
|
||||
### Model Checking
|
||||
A weakness is hidden assumptions. A limitation is poor fit. Uncertainty remains.
|
||||
Exercise: Critique a simple inference model.
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
# Introductory Bayesian Inference
|
||||
|
||||
## Foundations
|
||||
### Descriptive Statistics
|
||||
Objective: Explain mean, median, and variance.
|
||||
Exercise: Summarize a small dataset.
|
||||
Descriptive Statistics introduces center and spread.
|
||||
|
||||
### Probability Basics
|
||||
Objective: Explain conditional probability.
|
||||
Exercise: Compute a simple conditional probability.
|
||||
Probability Basics introduces events and likelihood.
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
# Bayesian Slides
|
||||
|
||||
## Bayesian Updating
|
||||
### Prior and Posterior
|
||||
Prior and Posterior summary slide text.
|
||||
Capstone Mini Project
|
||||
Exercise: Write a short project report comparing priors and posteriors.
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
# Bayesian Syllabus
|
||||
|
||||
## Schedule
|
||||
### Foundations
|
||||
Objective: Explain descriptive statistics and conditional probability.
|
||||
|
|
@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
|
|||
[project]
|
||||
name = "didactopus"
|
||||
version = "0.1.0"
|
||||
description = "Didactopus: multi-source course-to-pack ingestion scaffold"
|
||||
description = "Didactopus: document-adapter and cross-course merger scaffold"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.10"
|
||||
license = {text = "MIT"}
|
||||
|
|
@ -16,7 +16,7 @@ dependencies = ["pydantic>=2.7", "pyyaml>=6.0"]
|
|||
dev = ["pytest>=8.0", "ruff>=0.6"]
|
||||
|
||||
[project.scripts]
|
||||
didactopus-course-ingest = "didactopus.main:main"
|
||||
didactopus-topic-ingest = "didactopus.main:main"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["src"]
|
||||
|
|
|
|||
|
|
@ -3,6 +3,15 @@ from pydantic import BaseModel, Field
|
|||
import yaml
|
||||
|
||||
|
||||
class DocumentAdaptersConfig(BaseModel):
|
||||
allow_pdf: bool = True
|
||||
allow_docx: bool = True
|
||||
allow_pptx: bool = True
|
||||
allow_html: bool = True
|
||||
allow_markdown: bool = True
|
||||
allow_text: bool = True
|
||||
|
||||
|
||||
class CourseIngestConfig(BaseModel):
|
||||
default_pack_author: str = "Unknown"
|
||||
default_license: str = "REVIEW-REQUIRED"
|
||||
|
|
@ -10,23 +19,17 @@ class CourseIngestConfig(BaseModel):
|
|||
max_terms_per_lesson: int = 8
|
||||
|
||||
|
||||
class RulePolicyConfig(BaseModel):
|
||||
enable_prerequisite_order_rule: bool = True
|
||||
enable_duplicate_term_merge_rule: bool = True
|
||||
enable_project_detection_rule: bool = True
|
||||
enable_review_flags: bool = True
|
||||
|
||||
|
||||
class MultisourceConfig(BaseModel):
|
||||
detect_duplicate_lessons: bool = True
|
||||
class CrossCourseConfig(BaseModel):
|
||||
detect_title_overlaps: bool = True
|
||||
detect_term_conflicts: bool = True
|
||||
detect_order_conflicts: bool = True
|
||||
merge_same_named_lessons: bool = True
|
||||
|
||||
|
||||
class AppConfig(BaseModel):
|
||||
document_adapters: DocumentAdaptersConfig = Field(default_factory=DocumentAdaptersConfig)
|
||||
course_ingest: CourseIngestConfig = Field(default_factory=CourseIngestConfig)
|
||||
rule_policy: RulePolicyConfig = Field(default_factory=RulePolicyConfig)
|
||||
multisource: MultisourceConfig = Field(default_factory=MultisourceConfig)
|
||||
cross_course: CrossCourseConfig = Field(default_factory=CrossCourseConfig)
|
||||
|
||||
|
||||
def load_config(path: str | Path) -> AppConfig:
|
||||
|
|
|
|||
|
|
@ -1,8 +1,21 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
class Section(BaseModel):
|
||||
heading: str
|
||||
body: str = ""
|
||||
|
||||
|
||||
class NormalizedDocument(BaseModel):
|
||||
source_path: str
|
||||
source_type: str
|
||||
title: str = ""
|
||||
text: str = ""
|
||||
sections: list[Section] = Field(default_factory=list)
|
||||
metadata: dict = Field(default_factory=dict)
|
||||
|
||||
|
||||
class Lesson(BaseModel):
|
||||
title: str
|
||||
body: str = ""
|
||||
|
|
@ -17,21 +30,18 @@ class Module(BaseModel):
|
|||
lessons: list[Lesson] = Field(default_factory=list)
|
||||
|
||||
|
||||
class NormalizedSourceRecord(BaseModel):
|
||||
source_name: str
|
||||
source_type: str
|
||||
source_path: str
|
||||
title: str = ""
|
||||
modules: list[Module] = Field(default_factory=list)
|
||||
|
||||
|
||||
class NormalizedCourse(BaseModel):
|
||||
title: str
|
||||
source_name: str = ""
|
||||
source_url: str = ""
|
||||
rights_note: str = ""
|
||||
modules: list[Module] = Field(default_factory=list)
|
||||
source_records: list[NormalizedSourceRecord] = Field(default_factory=list)
|
||||
source_records: list[NormalizedDocument] = Field(default_factory=list)
|
||||
|
||||
|
||||
class TopicBundle(BaseModel):
|
||||
topic_title: str
|
||||
courses: list[NormalizedCourse] = Field(default_factory=list)
|
||||
|
||||
|
||||
class ConceptCandidate(BaseModel):
|
||||
|
|
@ -40,6 +50,7 @@ class ConceptCandidate(BaseModel):
|
|||
description: str = ""
|
||||
source_modules: list[str] = Field(default_factory=list)
|
||||
source_lessons: list[str] = Field(default_factory=list)
|
||||
source_courses: list[str] = Field(default_factory=list)
|
||||
prerequisites: list[str] = Field(default_factory=list)
|
||||
mastery_signals: list[str] = Field(default_factory=list)
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,50 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from collections import defaultdict
|
||||
from .course_schema import NormalizedCourse, ConceptCandidate
|
||||
|
||||
|
||||
def detect_title_overlaps(course: NormalizedCourse) -> list[str]:
|
||||
lesson_to_sources = defaultdict(set)
|
||||
for module in course.modules:
|
||||
for lesson in module.lessons:
|
||||
for src in lesson.source_refs:
|
||||
lesson_to_sources[lesson.title.lower()].add(src)
|
||||
flags = []
|
||||
for title, sources in lesson_to_sources.items():
|
||||
if len(sources) > 1:
|
||||
flags.append(f"Lesson title '{title}' appears across multiple sources: {', '.join(sorted(sources))}")
|
||||
return flags
|
||||
|
||||
|
||||
def detect_term_conflicts(course: NormalizedCourse) -> list[str]:
|
||||
term_to_lessons = defaultdict(set)
|
||||
for module in course.modules:
|
||||
for lesson in module.lessons:
|
||||
for term in lesson.key_terms:
|
||||
term_to_lessons[term.lower()].add(lesson.title)
|
||||
flags = []
|
||||
for term, lessons in term_to_lessons.items():
|
||||
if len(lessons) > 1:
|
||||
flags.append(f"Key term '{term}' appears in multiple lesson contexts: {', '.join(sorted(lessons))}")
|
||||
return flags
|
||||
|
||||
|
||||
def detect_order_conflicts(course: NormalizedCourse) -> list[str]:
|
||||
# Placeholder heuristic: if same lesson title appears in multiple source_refs, flag for order review.
|
||||
flags = []
|
||||
for module in course.modules:
|
||||
for lesson in module.lessons:
|
||||
if len(set(lesson.source_refs)) > 1:
|
||||
flags.append(f"Lesson '{lesson.title}' was merged from multiple sources; review ordering assumptions.")
|
||||
return flags
|
||||
|
||||
|
||||
def detect_thin_concepts(concepts: list[ConceptCandidate]) -> list[str]:
|
||||
flags = []
|
||||
for concept in concepts:
|
||||
if len(concept.description.strip()) < 20:
|
||||
flags.append(f"Concept '{concept.title}' has a very thin description.")
|
||||
if not concept.mastery_signals:
|
||||
flags.append(f"Concept '{concept.title}' has no extracted mastery signals.")
|
||||
return flags
|
||||
|
|
@ -0,0 +1,141 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
import re
|
||||
from .course_schema import NormalizedDocument, Section
|
||||
|
||||
|
||||
def _title_from_path(path: str | Path) -> str:
|
||||
p = Path(path)
|
||||
return p.stem.replace("_", " ").replace("-", " ").title()
|
||||
|
||||
|
||||
def _simple_section_split(text: str) -> list[Section]:
|
||||
sections = []
|
||||
current_heading = "Main"
|
||||
current_lines = []
|
||||
for line in text.splitlines():
|
||||
if re.match(r"^(#{1,3})\s+", line):
|
||||
if current_lines:
|
||||
sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
|
||||
current_heading = re.sub(r"^(#{1,3})\s+", "", line).strip()
|
||||
current_lines = []
|
||||
else:
|
||||
current_lines.append(line)
|
||||
if current_lines:
|
||||
sections.append(Section(heading=current_heading, body="\n".join(current_lines).strip()))
|
||||
return sections
|
||||
|
||||
|
||||
def read_textish(path: str | Path) -> str:
|
||||
return Path(path).read_text(encoding="utf-8")
|
||||
|
||||
|
||||
def adapt_markdown(path: str | Path) -> NormalizedDocument:
|
||||
text = read_textish(path)
|
||||
return NormalizedDocument(
|
||||
source_path=str(path),
|
||||
source_type="markdown",
|
||||
title=_title_from_path(path),
|
||||
text=text,
|
||||
sections=_simple_section_split(text),
|
||||
metadata={},
|
||||
)
|
||||
|
||||
|
||||
def adapt_text(path: str | Path) -> NormalizedDocument:
|
||||
text = read_textish(path)
|
||||
return NormalizedDocument(
|
||||
source_path=str(path),
|
||||
source_type="text",
|
||||
title=_title_from_path(path),
|
||||
text=text,
|
||||
sections=_simple_section_split(text),
|
||||
metadata={},
|
||||
)
|
||||
|
||||
|
||||
def adapt_html(path: str | Path) -> NormalizedDocument:
|
||||
raw = read_textish(path)
|
||||
text = re.sub(r"<[^>]+>", " ", raw)
|
||||
text = re.sub(r"\s+", " ", text).strip()
|
||||
return NormalizedDocument(
|
||||
source_path=str(path),
|
||||
source_type="html",
|
||||
title=_title_from_path(path),
|
||||
text=text,
|
||||
sections=[Section(heading="HTML Extract", body=text)],
|
||||
metadata={"extraction": "stub-html-strip"},
|
||||
)
|
||||
|
||||
|
||||
def adapt_pdf(path: str | Path) -> NormalizedDocument:
|
||||
# Stub: in a real implementation, plug in PDF text extraction here.
|
||||
text = read_textish(path)
|
||||
return NormalizedDocument(
|
||||
source_path=str(path),
|
||||
source_type="pdf",
|
||||
title=_title_from_path(path),
|
||||
text=text,
|
||||
sections=_simple_section_split(text),
|
||||
metadata={"extraction": "stub-pdf-text"},
|
||||
)
|
||||
|
||||
|
||||
def adapt_docx(path: str | Path) -> NormalizedDocument:
|
||||
# Stub: in a real implementation, plug in DOCX extraction here.
|
||||
text = read_textish(path)
|
||||
return NormalizedDocument(
|
||||
source_path=str(path),
|
||||
source_type="docx",
|
||||
title=_title_from_path(path),
|
||||
text=text,
|
||||
sections=_simple_section_split(text),
|
||||
metadata={"extraction": "stub-docx-text"},
|
||||
)
|
||||
|
||||
|
||||
def adapt_pptx(path: str | Path) -> NormalizedDocument:
|
||||
# Stub: in a real implementation, plug in PPTX extraction here.
|
||||
text = read_textish(path)
|
||||
return NormalizedDocument(
|
||||
source_path=str(path),
|
||||
source_type="pptx",
|
||||
title=_title_from_path(path),
|
||||
text=text,
|
||||
sections=_simple_section_split(text),
|
||||
metadata={"extraction": "stub-pptx-text"},
|
||||
)
|
||||
|
||||
|
||||
def detect_adapter(path: str | Path) -> str:
|
||||
p = Path(path)
|
||||
suffix = p.suffix.lower()
|
||||
if suffix == ".md":
|
||||
return "markdown"
|
||||
if suffix in {".txt"}:
|
||||
return "text"
|
||||
if suffix in {".html", ".htm"}:
|
||||
return "html"
|
||||
if suffix == ".pdf":
|
||||
return "pdf"
|
||||
if suffix == ".docx":
|
||||
return "docx"
|
||||
if suffix == ".pptx":
|
||||
return "pptx"
|
||||
return "text"
|
||||
|
||||
|
||||
def adapt_document(path: str | Path) -> NormalizedDocument:
|
||||
adapter = detect_adapter(path)
|
||||
if adapter == "markdown":
|
||||
return adapt_markdown(path)
|
||||
if adapter == "html":
|
||||
return adapt_html(path)
|
||||
if adapter == "pdf":
|
||||
return adapt_pdf(path)
|
||||
if adapter == "docx":
|
||||
return adapt_docx(path)
|
||||
if adapter == "pptx":
|
||||
return adapt_pptx(path)
|
||||
return adapt_text(path)
|
||||
|
|
@ -4,18 +4,19 @@ import argparse
|
|||
from pathlib import Path
|
||||
|
||||
from .config import load_config
|
||||
from .course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
||||
from .document_adapters import adapt_document
|
||||
from .topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||
from .cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
|
||||
from .rule_policy import RuleContext, build_default_rules, run_rules
|
||||
from .conflict_report import detect_duplicate_lessons, detect_term_conflicts, detect_thin_concepts
|
||||
from .pack_emitter import build_draft_pack, write_draft_pack
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Didactopus multi-source course-to-pack ingestion pipeline")
|
||||
parser.add_argument("--inputs", nargs="+", required=True, help="Input source files")
|
||||
parser.add_argument("--title", required=True, help="Course or topic title")
|
||||
parser = argparse.ArgumentParser(description="Didactopus document-adapter and cross-course topic ingestion")
|
||||
parser.add_argument("--inputs", nargs="+", required=True, help="Document inputs")
|
||||
parser.add_argument("--title", required=True, help="Topic title")
|
||||
parser.add_argument("--rights-note", default="REVIEW REQUIRED")
|
||||
parser.add_argument("--output-dir", default="generated-pack")
|
||||
parser.add_argument("--output-dir", default="generated-topic-pack")
|
||||
parser.add_argument("--config", default="configs/config.example.yaml")
|
||||
return parser
|
||||
|
||||
|
|
@ -24,33 +25,30 @@ def main() -> None:
|
|||
args = build_parser().parse_args()
|
||||
config = load_config(args.config)
|
||||
|
||||
records = [parse_source_file(path, title=args.title) for path in args.inputs]
|
||||
course = merge_source_records(
|
||||
records=records,
|
||||
course_title=args.title,
|
||||
rights_note=args.rights_note,
|
||||
merge_same_named_lessons=config.multisource.merge_same_named_lessons,
|
||||
docs = [adapt_document(path) for path in args.inputs]
|
||||
courses = [document_to_course(doc, course_title=args.title) for doc in docs]
|
||||
topic = build_topic_bundle(args.title, courses)
|
||||
merged_course = merge_courses_into_topic_course(
|
||||
topic_bundle=topic,
|
||||
merge_same_named_lessons=config.cross_course.merge_same_named_lessons,
|
||||
)
|
||||
concepts = extract_concept_candidates(course)
|
||||
context = RuleContext(course=course, concepts=concepts)
|
||||
concepts = extract_concept_candidates(merged_course)
|
||||
|
||||
rules = build_default_rules(
|
||||
enable_prereq=config.rule_policy.enable_prerequisite_order_rule,
|
||||
enable_merge=config.rule_policy.enable_duplicate_term_merge_rule,
|
||||
enable_projects=config.rule_policy.enable_project_detection_rule,
|
||||
enable_review=config.rule_policy.enable_review_flags,
|
||||
)
|
||||
context = RuleContext(course=merged_course, concepts=concepts)
|
||||
rules = build_default_rules()
|
||||
run_rules(context, rules)
|
||||
|
||||
conflicts = []
|
||||
if config.multisource.detect_duplicate_lessons:
|
||||
conflicts.extend(detect_duplicate_lessons(course))
|
||||
if config.multisource.detect_term_conflicts:
|
||||
conflicts.extend(detect_term_conflicts(course))
|
||||
if config.cross_course.detect_title_overlaps:
|
||||
conflicts.extend(detect_title_overlaps(merged_course))
|
||||
if config.cross_course.detect_term_conflicts:
|
||||
conflicts.extend(detect_term_conflicts(merged_course))
|
||||
if config.cross_course.detect_order_conflicts:
|
||||
conflicts.extend(detect_order_conflicts(merged_course))
|
||||
conflicts.extend(detect_thin_concepts(context.concepts))
|
||||
|
||||
draft = build_draft_pack(
|
||||
course=course,
|
||||
course=merged_course,
|
||||
concepts=context.concepts,
|
||||
author=config.course_ingest.default_pack_author,
|
||||
license_name=config.course_ingest.default_license,
|
||||
|
|
@ -59,10 +57,11 @@ def main() -> None:
|
|||
)
|
||||
write_draft_pack(draft, args.output_dir)
|
||||
|
||||
print("== Didactopus Multi-Source Course Ingest ==")
|
||||
print(f"Course: {course.title}")
|
||||
print(f"Sources: {len(records)}")
|
||||
print(f"Modules: {len(course.modules)}")
|
||||
print("== Didactopus Cross-Course Topic Ingest ==")
|
||||
print(f"Topic: {args.title}")
|
||||
print(f"Documents: {len(docs)}")
|
||||
print(f"Courses: {len(courses)}")
|
||||
print(f"Merged modules: {len(merged_course.modules)}")
|
||||
print(f"Concept candidates: {len(context.concepts)}")
|
||||
print(f"Review flags: {len(context.review_flags)}")
|
||||
print(f"Conflicts: {len(conflicts)}")
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
|
|||
"schema_version": "1",
|
||||
"didactopus_min_version": "0.1.0",
|
||||
"didactopus_max_version": "0.9.99",
|
||||
"description": f"Draft pack generated from multi-source course inputs for '{course.title}'.",
|
||||
"description": f"Draft topic pack generated from multi-course inputs for '{course.title}'.",
|
||||
"author": author,
|
||||
"license": license_name,
|
||||
"dependencies": [],
|
||||
|
|
@ -64,7 +64,7 @@ def build_draft_pack(course: NormalizedCourse, concepts: list[ConceptCandidate],
|
|||
attribution = {
|
||||
"rights_note": course.rights_note,
|
||||
"sources": [
|
||||
{"source_name": src.source_name, "source_type": src.source_type, "source_path": src.source_path}
|
||||
{"source_path": src.source_path, "source_type": src.source_type, "title": src.title}
|
||||
for src in course.source_records
|
||||
],
|
||||
}
|
||||
|
|
@ -88,11 +88,8 @@ def write_draft_pack(pack: DraftPack, outdir: str | Path) -> None:
|
|||
(out / "roadmap.yaml").write_text(yaml.safe_dump(pack.roadmap, sort_keys=False), encoding="utf-8")
|
||||
(out / "projects.yaml").write_text(yaml.safe_dump(pack.projects, sort_keys=False), encoding="utf-8")
|
||||
(out / "rubrics.yaml").write_text(yaml.safe_dump(pack.rubrics, sort_keys=False), encoding="utf-8")
|
||||
|
||||
review_lines = ["# Review Report", ""] + [f"- {flag}" for flag in pack.review_report] if pack.review_report else ["# Review Report", "", "- none"]
|
||||
(out / "review_report.md").write_text("\n".join(review_lines), encoding="utf-8")
|
||||
|
||||
conflict_lines = ["# Conflict Report", ""] + [f"- {flag}" for flag in pack.conflicts] if pack.conflicts else ["# Conflict Report", "", "- none"]
|
||||
(out / "conflict_report.md").write_text("\n".join(conflict_lines), encoding="utf-8")
|
||||
|
||||
(out / "license_attribution.json").write_text(json.dumps(pack.attribution, indent=2), encoding="utf-8")
|
||||
|
|
|
|||
|
|
@ -39,6 +39,7 @@ def duplicate_term_merge_rule(context: RuleContext) -> None:
|
|||
if key in seen:
|
||||
seen[key].source_modules.extend(x for x in concept.source_modules if x not in seen[key].source_modules)
|
||||
seen[key].source_lessons.extend(x for x in concept.source_lessons if x not in seen[key].source_lessons)
|
||||
seen[key].source_courses.extend(x for x in concept.source_courses if x not in seen[key].source_courses)
|
||||
if concept.description and len(seen[key].description) < len(concept.description):
|
||||
seen[key].description = concept.description
|
||||
else:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,126 @@
|
|||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from collections import defaultdict
|
||||
from .course_schema import NormalizedDocument, NormalizedCourse, Module, Lesson, TopicBundle, ConceptCandidate
|
||||
|
||||
|
||||
def slugify(text: str) -> str:
|
||||
cleaned = re.sub(r"[^a-zA-Z0-9]+", "-", text.strip().lower()).strip("-")
|
||||
return cleaned or "untitled"
|
||||
|
||||
|
||||
def extract_key_terms(text: str, min_term_length: int = 4, max_terms: int = 8) -> list[str]:
|
||||
candidates = re.findall(r"\b[A-Z][A-Za-z0-9\-]{%d,}\b" % (min_term_length - 1), text)
|
||||
seen = set()
|
||||
out = []
|
||||
for term in candidates:
|
||||
if term not in seen:
|
||||
seen.add(term)
|
||||
out.append(term)
|
||||
if len(out) >= max_terms:
|
||||
break
|
||||
return out
|
||||
|
||||
|
||||
def document_to_course(doc: NormalizedDocument, course_title: str) -> NormalizedCourse:
|
||||
# Conservative mapping: each section becomes a lesson; all lessons go into one module.
|
||||
lessons = []
|
||||
for section in doc.sections:
|
||||
body = section.body.strip()
|
||||
lines = body.splitlines()
|
||||
objectives = []
|
||||
exercises = []
|
||||
for line in lines:
|
||||
low = line.lower().strip()
|
||||
if low.startswith("objective:"):
|
||||
objectives.append(line.split(":", 1)[1].strip())
|
||||
if low.startswith("exercise:"):
|
||||
exercises.append(line.split(":", 1)[1].strip())
|
||||
lessons.append(
|
||||
Lesson(
|
||||
title=section.heading.strip() or "Untitled Lesson",
|
||||
body=body,
|
||||
objectives=objectives,
|
||||
exercises=exercises,
|
||||
key_terms=extract_key_terms(section.heading + "\n" + body),
|
||||
source_refs=[doc.source_path],
|
||||
)
|
||||
)
|
||||
module = Module(title=f"Imported from {doc.source_type.upper()}", lessons=lessons)
|
||||
return NormalizedCourse(title=course_title, modules=[module], source_records=[doc])
|
||||
|
||||
|
||||
def build_topic_bundle(topic_title: str, courses: list[NormalizedCourse]) -> TopicBundle:
|
||||
return TopicBundle(topic_title=topic_title, courses=courses)
|
||||
|
||||
|
||||
def merge_courses_into_topic_course(topic_bundle: TopicBundle, merge_same_named_lessons: bool = True) -> NormalizedCourse:
|
||||
modules_by_title: dict[str, Module] = {}
|
||||
source_records = []
|
||||
for course in topic_bundle.courses:
|
||||
source_records.extend(course.source_records)
|
||||
for module in course.modules:
|
||||
target_module = modules_by_title.setdefault(module.title, Module(title=module.title, lessons=[]))
|
||||
if merge_same_named_lessons:
|
||||
lesson_map = {lesson.title: lesson for lesson in target_module.lessons}
|
||||
for lesson in module.lessons:
|
||||
if lesson.title in lesson_map:
|
||||
existing = lesson_map[lesson.title]
|
||||
if lesson.body and lesson.body not in existing.body:
|
||||
existing.body = (existing.body + "\n\n" + lesson.body).strip()
|
||||
for x in lesson.objectives:
|
||||
if x not in existing.objectives:
|
||||
existing.objectives.append(x)
|
||||
for x in lesson.exercises:
|
||||
if x not in existing.exercises:
|
||||
existing.exercises.append(x)
|
||||
for x in lesson.key_terms:
|
||||
if x not in existing.key_terms:
|
||||
existing.key_terms.append(x)
|
||||
for x in lesson.source_refs:
|
||||
if x not in existing.source_refs:
|
||||
existing.source_refs.append(x)
|
||||
else:
|
||||
target_module.lessons.append(lesson)
|
||||
else:
|
||||
target_module.lessons.extend(module.lessons)
|
||||
return NormalizedCourse(title=topic_bundle.topic_title, modules=list(modules_by_title.values()), source_records=source_records)
|
||||
|
||||
|
||||
def extract_concept_candidates(course: NormalizedCourse) -> list[ConceptCandidate]:
|
||||
concepts = []
|
||||
seen_ids = set()
|
||||
for module in course.modules:
|
||||
for lesson in module.lessons:
|
||||
cid = slugify(lesson.title)
|
||||
if cid not in seen_ids:
|
||||
seen_ids.add(cid)
|
||||
concepts.append(
|
||||
ConceptCandidate(
|
||||
id=cid,
|
||||
title=lesson.title,
|
||||
description=lesson.body[:240].strip(),
|
||||
source_modules=[module.title],
|
||||
source_lessons=[lesson.title],
|
||||
source_courses=list(lesson.source_refs),
|
||||
mastery_signals=list(lesson.objectives[:3] or lesson.exercises[:2]),
|
||||
)
|
||||
)
|
||||
for term in lesson.key_terms:
|
||||
tid = slugify(term)
|
||||
if tid in seen_ids:
|
||||
continue
|
||||
seen_ids.add(tid)
|
||||
concepts.append(
|
||||
ConceptCandidate(
|
||||
id=tid,
|
||||
title=term,
|
||||
description=f"Candidate concept extracted from lesson '{lesson.title}'.",
|
||||
source_modules=[module.title],
|
||||
source_lessons=[lesson.title],
|
||||
source_courses=list(lesson.source_refs),
|
||||
mastery_signals=list(lesson.objectives[:2]),
|
||||
)
|
||||
)
|
||||
return concepts
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
from pathlib import Path
|
||||
from didactopus.document_adapters import adapt_document
|
||||
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||
from didactopus.cross_course_conflicts import detect_title_overlaps, detect_term_conflicts, detect_order_conflicts, detect_thin_concepts
|
||||
|
||||
|
||||
def test_conflict_detection(tmp_path: Path) -> None:
|
||||
a = tmp_path / "a.md"
|
||||
b = tmp_path / "b.md"
|
||||
a.write_text("# T\n\n## M1\n### Bayesian Updating\nPrior and Posterior appear here.", encoding="utf-8")
|
||||
b.write_text("# T\n\n## M2\n### Bayesian Updating\nPrior and Posterior appear again.", encoding="utf-8")
|
||||
docs = [adapt_document(a), adapt_document(b)]
|
||||
courses = [document_to_course(doc, "Topic") for doc in docs]
|
||||
merged = merge_courses_into_topic_course(build_topic_bundle("Topic", courses), merge_same_named_lessons=False)
|
||||
concepts = extract_concept_candidates(merged)
|
||||
assert isinstance(detect_title_overlaps(merged), list)
|
||||
assert isinstance(detect_term_conflicts(merged), list)
|
||||
assert isinstance(detect_order_conflicts(merged), list)
|
||||
assert isinstance(detect_thin_concepts(concepts), list)
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
from pathlib import Path
|
||||
from didactopus.document_adapters import adapt_document, detect_adapter
|
||||
|
||||
|
||||
def test_detect_adapter() -> None:
|
||||
assert detect_adapter("a.md") == "markdown"
|
||||
assert detect_adapter("b.html") == "html"
|
||||
assert detect_adapter("c.pdf") == "pdf"
|
||||
assert detect_adapter("d.docx") == "docx"
|
||||
assert detect_adapter("e.pptx") == "pptx"
|
||||
|
||||
|
||||
def test_adapt_markdown(tmp_path: Path) -> None:
|
||||
p = tmp_path / "x.md"
|
||||
p.write_text("# T\n\n## A\nBody", encoding="utf-8")
|
||||
doc = adapt_document(p)
|
||||
assert doc.source_type == "markdown"
|
||||
assert len(doc.sections) >= 1
|
||||
|
|
@ -1,17 +1,20 @@
|
|||
from pathlib import Path
|
||||
from didactopus.course_ingest import parse_source_file, merge_source_records, extract_concept_candidates
|
||||
from didactopus.document_adapters import adapt_document
|
||||
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||
from didactopus.rule_policy import RuleContext, build_default_rules, run_rules
|
||||
from didactopus.pack_emitter import build_draft_pack, write_draft_pack
|
||||
|
||||
|
||||
def test_emit_multisource_pack(tmp_path: Path) -> None:
|
||||
def test_emit_topic_pack(tmp_path: Path) -> None:
|
||||
src = tmp_path / "course.md"
|
||||
src.write_text("# C\n\n## M1\n### Lesson A\n- Objective: Explain Topic A.\n- Exercise: Do task A.\nTopic A body.", encoding="utf-8")
|
||||
course = merge_source_records([parse_source_file(src, title="Course")], course_title="Course")
|
||||
concepts = extract_concept_candidates(course)
|
||||
ctx = RuleContext(course=course, concepts=concepts)
|
||||
src.write_text("# T\n\n## M\n### L\nExercise: Do task A.\nTopic A body.", encoding="utf-8")
|
||||
doc = adapt_document(src)
|
||||
course = document_to_course(doc, "Topic")
|
||||
merged = merge_courses_into_topic_course(build_topic_bundle("Topic", [course]))
|
||||
concepts = extract_concept_candidates(merged)
|
||||
ctx = RuleContext(course=merged, concepts=concepts)
|
||||
run_rules(ctx, build_default_rules())
|
||||
draft = build_draft_pack(course, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
|
||||
draft = build_draft_pack(merged, ctx.concepts, "Tester", "REVIEW", ctx.review_flags, [])
|
||||
write_draft_pack(draft, tmp_path / "out")
|
||||
assert (tmp_path / "out" / "pack.yaml").exists()
|
||||
assert (tmp_path / "out" / "conflict_report.md").exists()
|
||||
|
|
|
|||
|
|
@ -0,0 +1,26 @@
|
|||
from pathlib import Path
|
||||
from didactopus.document_adapters import adapt_document
|
||||
from didactopus.topic_ingest import document_to_course, build_topic_bundle, merge_courses_into_topic_course, extract_concept_candidates
|
||||
|
||||
|
||||
def test_cross_course_merge(tmp_path: Path) -> None:
|
||||
a = tmp_path / "a.md"
|
||||
b = tmp_path / "b.docx"
|
||||
a.write_text("# T\n\n## M\n### L1\nBody A", encoding="utf-8")
|
||||
b.write_text("# T\n\n## M\n### L1\nBody B", encoding="utf-8")
|
||||
|
||||
docs = [adapt_document(a), adapt_document(b)]
|
||||
courses = [document_to_course(doc, "Topic") for doc in docs]
|
||||
topic = build_topic_bundle("Topic", courses)
|
||||
merged = merge_courses_into_topic_course(topic)
|
||||
assert len(merged.modules) >= 1
|
||||
assert len(merged.modules[0].lessons) == 1
|
||||
|
||||
|
||||
def test_extract_concepts(tmp_path: Path) -> None:
|
||||
a = tmp_path / "a.md"
|
||||
a.write_text("# T\n\n## M\n### Lesson A\nObjective: Explain Topic A.\nBody.", encoding="utf-8")
|
||||
doc = adapt_document(a)
|
||||
course = document_to_course(doc, "Topic")
|
||||
concepts = extract_concept_candidates(course)
|
||||
assert len(concepts) >= 1
|
||||
Loading…
Reference in New Issue