Initial pass at making Didactopus multilingual.

This commit is contained in:
welsberr 2026-03-17 19:29:12 -04:00
parent aac1e5c8bc
commit 8862e00194
25 changed files with 1215 additions and 4 deletions

View File

@ -173,6 +173,7 @@ The main mentor-style backend now has a dedicated demo entry point:
```bash
python -m didactopus.learner_session_demo
python -m didactopus.learner_session_demo --language es
```
That demo builds a graph-grounded session from the MIT OCW skill bundle and emits:
@ -183,6 +184,8 @@ That demo builds a graph-grounded session from the MIT OCW skill bundle and emit
- evaluator feedback
- a recommended next step
The learner-facing CLI now treats language as a first-class parameter, so the same session flow can target another output language while preserving the English source-grounding context.
The point of this module is architectural as much as demonstrational: it is the session core that future accessibility, model-benchmark, and voice-interaction work should build on.
The learner-session demo also writes accessible companion outputs:
@ -198,6 +201,14 @@ python -m didactopus.model_bench
It evaluates local-model adequacy for the `mentor`, `practice`, and `evaluator` roles using the MIT OCW skill bundle as grounded context.
There is also now a Didactopus-specific arena for comparing provider/model/prompt combinations:
```bash
python -m didactopus.arena --arena-spec configs/arena.example.yaml
```
That produces rankings, a human review queue, and an optional LLM-written comparative summary for reviewer triage.
### Easiest LLM setup paths
If you want live LLM-backed Didactopus behavior without the complexity of RoleMesh, start with one of these:
@ -466,6 +477,7 @@ What remains heuristic or lightweight:
## Recommended Reading
- [docs/roadmap.md](docs/roadmap.md)
- [docs/arena.md](docs/arena.md)
- [docs/learner-accessibility.md](docs/learner-accessibility.md)
- [docs/local-model-benchmark.md](docs/local-model-benchmark.md)
- [docs/model-provider-setup.md](docs/model-provider-setup.md)

View File

@ -0,0 +1,18 @@
candidates:
- name: "stub-baseline"
config: "configs/config.example.yaml"
prompt_variant: "baseline"
language: "en"
- name: "stub-strict-grounding"
config: "configs/config.example.yaml"
prompt_variant: "strict_grounding"
language: "es"
- name: "stub-trust-preserving"
config: "configs/config.example.yaml"
prompt_variant: "trust_preserving"
language: "fr"
review:
enabled: true
config: "configs/config.example.yaml"
role: "mentor"

96
docs/arena.md Normal file
View File

@ -0,0 +1,96 @@
# Didactopus Arena
The Didactopus arena compares candidate combinations of:
- provider configuration
- model choice
- role prompt variant
- output language
It is not a generic chatbot arena. It is a Didactopus-specific behavior arena for grounded learner tasks.
## What It Does
For each candidate, the arena runs the current graph-grounded learner-task shape for:
- `mentor`
- `practice`
- `evaluator`
It then produces:
- deterministic role scores
- candidate rankings
- a human review queue
- an optional LLM-written review summary to help the human reviewer triage results
## Why This Exists
Didactopus needs a practical way to improve:
- local model choice
- prompt variants
- trust-preserving behavior
- source-grounded behavior
This is an aid to benchmarking and review, not an automatic certification system.
## How To Run It
Use the example spec:
```bash
python -m didactopus.arena --arena-spec configs/arena.example.yaml
```
That writes outputs under:
- `examples/arena/`
## Spec Shape
The arena spec is a YAML file with:
- `candidates`
- `review`
Example candidate fields:
- `name`
- `config`
- `prompt_variant`
- `language`
Example review fields:
- `enabled`
- `config`
- `role`
## Current Prompt Variants
- `baseline`
- `strict_grounding`
- `trust_preserving`
- `concise`
These are applied to Didactopus role prompts, not to arbitrary raw prompt strings.
## Outputs
The arena currently writes:
- `arena_results.json`
- `arena_review_queue.json`
- `arena_report.md`
## Human Review Position
The LLM review summary should be treated as initial triage support only.
The intended order of trust is:
1. deterministic checks
2. arena comparison results
3. LLM comparative summary
4. human reviewer decision

View File

@ -153,6 +153,7 @@ Run:
```bash
python -m didactopus.learner_session_demo
python -m didactopus.learner_session_demo --language es
```
That demo loads the MIT OCW skill bundle, retrieves grounded concept neighborhoods and source fragments, and emits a single learner session containing:
@ -166,6 +167,8 @@ That demo loads the MIT OCW skill bundle, retrieves grounded concept neighborhoo
This is the backend shape the repository should now treat as the base for future accessibility, benchmarking, and voice-interaction work.
The learner-facing commands are also starting to expose `--language`, so output language becomes an explicit session parameter rather than an implicit prompt tweak.
## How should I use it if I am taking a course and do not want to hire a tutor?
Use it as a structured study companion:

View File

@ -106,3 +106,7 @@ As the learner session backend grows, the benchmark should expand to include:
- first-token delay and tokens-per-second capture
- memory and thermal observations on constrained hardware
- accessibility-specific checks for structure and spoken-output quality
For model-and-prompt comparison across multiple candidates, use:
- `docs/arena.md`

View File

@ -37,6 +37,7 @@ Example commands:
```bash
ollama pull llama3.2:3b
python -m didactopus.learner_session_demo --config configs/config.ollama.example.yaml
python -m didactopus.learner_session_demo --config configs/config.ollama.example.yaml --language es
```
If you want a different local model, change:
@ -69,6 +70,7 @@ Example:
```bash
python -m didactopus.learner_session_demo --config configs/config.openai-compatible.example.yaml
python -m didactopus.learner_session_demo --config configs/config.openai-compatible.example.yaml --language fr
```
## Option 3: RoleMesh Gateway

View File

@ -0,0 +1,16 @@
# Didactopus Arena Report
- Candidates: 3
## Rankings
- `stub-baseline` via `stub` / prompt variant `baseline`: borderline (0.667), language `en`
- `stub-strict-grounding` via `stub` / prompt variant `strict_grounding`: borderline (0.667), language `es`
- `stub-trust-preserving` via `stub` / prompt variant `trust_preserving`: borderline (0.667), language `fr`
## Human Review Queue
- `stub-baseline`: needs_human_review=True, weak_roles=['mentor', 'evaluator']
- `stub-strict-grounding`: needs_human_review=True, weak_roles=['mentor', 'evaluator']
- `stub-trust-preserving`: needs_human_review=True, weak_roles=['mentor', 'evaluator']
## LLM Review Summary
[stubbed-response] [mentor] Review these Didactopus arena results for a human reviewer. Rank the strongest candidates, identify likely prompt improv

View File

@ -0,0 +1,202 @@
{
"arena": {
"name": "didactopus-behavior-arena",
"candidate_count": 3
},
"ranked_candidates": [
{
"candidate_name": "stub-baseline",
"config": "configs/config.example.yaml",
"prompt_variant": "baseline",
"language": "en",
"provider": "stub",
"overall_score": 0.667,
"overall_rating": "borderline",
"role_results": [
{
"role": "mentor",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "baseline",
"language": "en",
"latency_ms": 0.027,
"adequacy_score": 0.65,
"adequacy_rating": "borderline",
"response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not ask a focused learner question."
]
},
{
"role": "practice",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "baseline",
"language": "en",
"latency_ms": 0.006,
"adequacy_score": 1.0,
"adequacy_rating": "adequate",
"response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": []
},
{
"role": "evaluator",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "baseline",
"language": "en",
"latency_ms": 0.005,
"adequacy_score": 0.35,
"adequacy_rating": "inadequate",
"response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not acknowledge learner strengths.",
"Did not provide a concrete next step."
]
}
]
},
{
"candidate_name": "stub-strict-grounding",
"config": "configs/config.example.yaml",
"prompt_variant": "strict_grounding",
"language": "es",
"provider": "stub",
"overall_score": 0.667,
"overall_rating": "borderline",
"role_results": [
{
"role": "mentor",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "strict_grounding",
"language": "es",
"latency_ms": 0.019,
"adequacy_score": 0.65,
"adequacy_rating": "borderline",
"response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not ask a focused learner question."
]
},
{
"role": "practice",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "strict_grounding",
"language": "es",
"latency_ms": 0.005,
"adequacy_score": 1.0,
"adequacy_rating": "adequate",
"response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": []
},
{
"role": "evaluator",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "strict_grounding",
"language": "es",
"latency_ms": 0.004,
"adequacy_score": 0.35,
"adequacy_rating": "inadequate",
"response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not acknowledge learner strengths.",
"Did not provide a concrete next step."
]
}
]
},
{
"candidate_name": "stub-trust-preserving",
"config": "configs/config.example.yaml",
"prompt_variant": "trust_preserving",
"language": "fr",
"provider": "stub",
"overall_score": 0.667,
"overall_rating": "borderline",
"role_results": [
{
"role": "mentor",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "trust_preserving",
"language": "fr",
"latency_ms": 0.025,
"adequacy_score": 0.65,
"adequacy_rating": "borderline",
"response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not ask a focused learner question."
]
},
{
"role": "practice",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "trust_preserving",
"language": "fr",
"latency_ms": 0.005,
"adequacy_score": 1.0,
"adequacy_rating": "adequate",
"response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": []
},
{
"role": "evaluator",
"provider": "stub",
"model_name": "local-demo",
"prompt_variant": "trust_preserving",
"language": "fr",
"latency_ms": 0.005,
"adequacy_score": 0.35,
"adequacy_rating": "inadequate",
"response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not acknowledge learner strengths.",
"Did not provide a concrete next step."
]
}
]
}
],
"review_queue": [
{
"candidate_name": "stub-baseline",
"overall_rating": "borderline",
"overall_score": 0.667,
"needs_human_review": true,
"weak_roles": [
"mentor",
"evaluator"
]
},
{
"candidate_name": "stub-strict-grounding",
"overall_rating": "borderline",
"overall_score": 0.667,
"needs_human_review": true,
"weak_roles": [
"mentor",
"evaluator"
]
},
{
"candidate_name": "stub-trust-preserving",
"overall_rating": "borderline",
"overall_score": 0.667,
"needs_human_review": true,
"weak_roles": [
"mentor",
"evaluator"
]
}
],
"llm_review": {
"provider": "stub",
"model_name": "local-demo",
"role": "mentor",
"summary": "[stubbed-response] [mentor] Review these Didactopus arena results for a human reviewer. Rank the strongest candidates, identify likely prompt improv"
}
}

View File

@ -0,0 +1,32 @@
[
{
"candidate_name": "stub-baseline",
"overall_rating": "borderline",
"overall_score": 0.667,
"needs_human_review": true,
"weak_roles": [
"mentor",
"evaluator"
]
},
{
"candidate_name": "stub-strict-grounding",
"overall_rating": "borderline",
"overall_score": 0.667,
"needs_human_review": true,
"weak_roles": [
"mentor",
"evaluator"
]
},
{
"candidate_name": "stub-trust-preserving",
"overall_rating": "borderline",
"overall_score": 0.667,
"needs_human_review": true,
"weak_roles": [
"mentor",
"evaluator"
]
}
]

View File

@ -0,0 +1,118 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Didactopus Learner Session</title>
<style>
:root { color-scheme: light; --bg: #f7f4ed; --panel: #fffdf8; --ink: #1e2b31; --muted: #53656d; --line: #d3c8b7; --accent: #155e63; }
body { margin: 0; font-family: Georgia, 'Times New Roman', serif; background: var(--bg); color: var(--ink); line-height: 1.55; }
a { color: var(--accent); }
.skip { position: absolute; left: 12px; top: 12px; background: #fff; padding: 8px 10px; border: 1px solid var(--line); }
main { max-width: 980px; margin: 0 auto; padding: 24px; }
section { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 20px; margin-bottom: 18px; }
h1, h2, h3 { line-height: 1.2; }
ol, ul { padding-left: 22px; }
.meta { color: var(--muted); }
.turn { border-top: 1px solid var(--line); padding-top: 12px; margin-top: 12px; }
.turn:first-of-type { border-top: 0; padding-top: 0; margin-top: 0; }
.fragment { background: #f3efe5; padding: 10px; border-radius: 10px; margin: 8px 0; }
.sr-note { color: var(--muted); font-size: 0.95rem; }
</style>
</head>
<body>
<a class="skip" href="#session-main">Skip to learner session</a>
<main id="session-main" aria-label="Didactopus learner session">
<section aria-labelledby="session-title">
<h1 id="session-title">Didactopus Learner Session</h1>
<p class="sr-note">This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.</p>
<p><strong>Learner goal:</strong> Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.</p>
<p><strong>Source language:</strong> en</p>
<p><strong>Output language:</strong> es</p>
</section>
<section aria-labelledby="study-plan-title">
<h2 id="study-plan-title">Study Plan</h2>
<ol>
<li>
<h3>Independent Reasoning and Careful Comparison</h3>
<p><strong>Status:</strong> mastered</p>
<p><strong>Prerequisites:</strong> Course Notes and Reference Texts</p>
<p><strong>Supporting lessons:</strong> Independent Reasoning and Careful Comparison</p>
<p><strong>Grounding fragments:</strong></p>
<ul>
<li><div class="fragment"><strong>Independent Reasoning and Careful Comparison</strong> (lesson_body)<br>- Objective: Explain why the course requires precise comparison of related but non-identical concepts.
- Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.
The syllabus framing implies a style of work where analogy is useful but dangerous when used loosely. Learners must compare models carefully, state assumptions, and notice where similar mathematics does not imply identical interpretation.</div></li>
<li><div class="fragment"><strong>Independent Reasoning and Careful Comparison</strong> (objective)<br>Explain why the course requires precise comparison of related but non-identical concepts.</div></li>
</ul>
</li>
<li>
<h3>Thermodynamics and Entropy</h3>
<p><strong>Status:</strong> mastered</p>
<p><strong>Prerequisites:</strong> Cryptography and Information Hiding</p>
<p><strong>Supporting lessons:</strong> Thermodynamics and Entropy</p>
<p><strong>Grounding fragments:</strong></p>
<ul>
<li><div class="fragment"><strong>Thermodynamics and Entropy</strong> (lesson_body)<br>- Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
- Exercise: Compare the two entropy notions and identify what is preserved across the analogy.
The course uses entropy as a bridge concept between communication theory and physics while insisting on careful interpretation.</div></li>
<li><div class="fragment"><strong>Thermodynamics and Entropy</strong> (objective)<br>Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.</div></li>
</ul>
</li>
<li>
<h3>Shannon Entropy</h3>
<p><strong>Status:</strong> mastered</p>
<p><strong>Prerequisites:</strong> Counting and Probability</p>
<p><strong>Supporting lessons:</strong> Shannon Entropy</p>
<p><strong>Grounding fragments:</strong></p>
<ul>
<li><div class="fragment"><strong>Shannon Entropy</strong> (lesson_body)<br>- Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
- Exercise: Compute the entropy of a Bernoulli source and interpret the result.
The course then introduces entropy as a quantitative measure of uncertainty for a source model and uses it to reason about representation cost and surprise.</div></li>
<li><div class="fragment"><strong>Shannon Entropy</strong> (objective)<br>Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.</div></li>
</ul>
</li>
</ol>
</section>
<section aria-labelledby="conversation-title">
<h2 id="conversation-title">Conversation</h2>
<article class="turn" aria-label="Conversation turn">
<h3>Learner Goal</h3>
<p class="meta">Role: user</p>
<p>Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.</p>
</article>
<article class="turn" aria-label="Conversation turn">
<h3>Didactopus Mentor</h3>
<p class="meta">Role: assistant</p>
<p>[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons</p>
</article>
<article class="turn" aria-label="Conversation turn">
<h3>Didactopus Practice Designer</h3>
<p class="meta">Role: assistant</p>
<p>[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons</p>
</article>
<article class="turn" aria-label="Conversation turn">
<h3>Learner Submission</h3>
<p class="meta">Role: user</p>
<p>Entropy measures uncertainty because more possible outcomes require more information to describe, but one limitation is that thermodynamic entropy is not identical to Shannon entropy.</p>
</article>
<article class="turn" aria-label="Conversation turn">
<h3>Didactopus Evaluator</h3>
<p class="meta">Role: assistant</p>
<p>[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons</p>
</article>
<article class="turn" aria-label="Conversation turn">
<h3>Didactopus Mentor</h3>
<p class="meta">Role: assistant</p>
<p>[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons</p>
</article>
</section>
<section aria-labelledby="evaluation-title">
<h2 id="evaluation-title">Evaluation Summary</h2>
<p><strong>Verdict:</strong> needs_revision</p>
<p><strong>Aggregated dimensions:</strong> {&quot;correctness&quot;: 0.6000000000000001, &quot;critique&quot;: 0.6499999999999999, &quot;explanation&quot;: 0.85}</p>
<p><strong>Follow-up:</strong> Rework the answer so it states the equality/relationship explicitly and explains why it matters.</p>
</section>
</main>
</body>
</html>

View File

@ -0,0 +1,244 @@
{
"goal": "Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.",
"output_language": "es",
"source_language": "en",
"study_plan": {
"skill": "ocw-information-entropy-agent",
"task": "Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.",
"source_language": "en",
"steps": [
{
"concept_key": "mit-ocw-information-and-entropy::independent-reasoning-and-careful-comparison",
"title": "Independent Reasoning and Careful Comparison",
"status": "mastered",
"prerequisites": [
"mit-ocw-information-and-entropy::course-notes-and-reference-texts"
],
"prerequisite_titles": [
"Course Notes and Reference Texts"
],
"supporting_lessons": [
"Independent Reasoning and Careful Comparison"
],
"source_fragments": [
{
"lesson_title": "Independent Reasoning and Careful Comparison",
"kind": "lesson_body",
"text": "- Objective: Explain why the course requires precise comparison of related but non-identical concepts.\n- Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.\nThe syllabus framing implies a style of work where analogy is useful but dangerous when used loosely. Learners must compare models carefully, state assumptions, and notice where similar mathematics does not imply identical interpretation."
},
{
"lesson_title": "Independent Reasoning and Careful Comparison",
"kind": "objective",
"text": "Explain why the course requires precise comparison of related but non-identical concepts."
}
],
"recommended_action": "Use Independent Reasoning and Careful Comparison as the primary teaching anchor."
},
{
"concept_key": "mit-ocw-information-and-entropy::thermodynamics-and-entropy",
"title": "Thermodynamics and Entropy",
"status": "mastered",
"prerequisites": [
"mit-ocw-information-and-entropy::cryptography-and-information-hiding"
],
"prerequisite_titles": [
"Cryptography and Information Hiding"
],
"supporting_lessons": [
"Thermodynamics and Entropy"
],
"source_fragments": [
{
"lesson_title": "Thermodynamics and Entropy",
"kind": "lesson_body",
"text": "- Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.\n- Exercise: Compare the two entropy notions and identify what is preserved across the analogy.\nThe course uses entropy as a bridge concept between communication theory and physics while insisting on careful interpretation."
},
{
"lesson_title": "Thermodynamics and Entropy",
"kind": "objective",
"text": "Explain how thermodynamic entropy relates to, and differs from, Shannon entropy."
}
],
"recommended_action": "Use Thermodynamics and Entropy as the primary teaching anchor."
},
{
"concept_key": "mit-ocw-information-and-entropy::shannon-entropy",
"title": "Shannon Entropy",
"status": "mastered",
"prerequisites": [
"mit-ocw-information-and-entropy::counting-and-probability"
],
"prerequisite_titles": [
"Counting and Probability"
],
"supporting_lessons": [
"Shannon Entropy"
],
"source_fragments": [
{
"lesson_title": "Shannon Entropy",
"kind": "lesson_body",
"text": "- Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.\n- Exercise: Compute the entropy of a Bernoulli source and interpret the result.\nThe course then introduces entropy as a quantitative measure of uncertainty for a source model and uses it to reason about representation cost and surprise."
},
{
"lesson_title": "Shannon Entropy",
"kind": "objective",
"text": "Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources."
}
],
"recommended_action": "Use Shannon Entropy as the primary teaching anchor."
}
],
"guided_path_reference": [
"mit-ocw-information-and-entropy::mit-ocw-6-050j-information-and-entropy-course-home",
"mit-ocw-information-and-entropy::information-and-entropy",
"mit-ocw-information-and-entropy::ultimate-limits-to-communication-and-computation",
"mit-ocw-information-and-entropy::open-textbooks-problem-sets-and-programming-work",
"mit-ocw-information-and-entropy::mit-ocw-6-050j-information-and-entropy-syllabus",
"mit-ocw-information-and-entropy::prerequisites-and-mathematical-background",
"mit-ocw-information-and-entropy::assessment-structure",
"mit-ocw-information-and-entropy::course-notes-and-reference-texts",
"mit-ocw-information-and-entropy::independent-reasoning-and-careful-comparison",
"mit-ocw-information-and-entropy::mit-ocw-6-050j-information-and-entropy-unit-sequence",
"mit-ocw-information-and-entropy::counting-and-probability",
"mit-ocw-information-and-entropy::shannon-entropy",
"mit-ocw-information-and-entropy::mutual-information",
"mit-ocw-information-and-entropy::source-coding-and-compression",
"mit-ocw-information-and-entropy::huffman-coding",
"mit-ocw-information-and-entropy::channel-capacity",
"mit-ocw-information-and-entropy::channel-coding",
"mit-ocw-information-and-entropy::error-correcting-codes",
"mit-ocw-information-and-entropy::cryptography-and-information-hiding",
"mit-ocw-information-and-entropy::thermodynamics-and-entropy"
]
},
"primary_concept": {
"concept_key": "mit-ocw-information-and-entropy::independent-reasoning-and-careful-comparison",
"title": "Independent Reasoning and Careful Comparison",
"status": "mastered",
"prerequisites": [
"mit-ocw-information-and-entropy::course-notes-and-reference-texts"
],
"prerequisite_titles": [
"Course Notes and Reference Texts"
],
"supporting_lessons": [
"Independent Reasoning and Careful Comparison"
],
"source_fragments": [
{
"lesson_title": "Independent Reasoning and Careful Comparison",
"kind": "lesson_body",
"text": "- Objective: Explain why the course requires precise comparison of related but non-identical concepts.\n- Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.\nThe syllabus framing implies a style of work where analogy is useful but dangerous when used loosely. Learners must compare models carefully, state assumptions, and notice where similar mathematics does not imply identical interpretation."
},
{
"lesson_title": "Independent Reasoning and Careful Comparison",
"kind": "objective",
"text": "Explain why the course requires precise comparison of related but non-identical concepts."
}
],
"recommended_action": "Use Independent Reasoning and Careful Comparison as the primary teaching anchor."
},
"secondary_concept": {
"concept_key": "mit-ocw-information-and-entropy::thermodynamics-and-entropy",
"title": "Thermodynamics and Entropy",
"status": "mastered",
"prerequisites": [
"mit-ocw-information-and-entropy::cryptography-and-information-hiding"
],
"prerequisite_titles": [
"Cryptography and Information Hiding"
],
"supporting_lessons": [
"Thermodynamics and Entropy"
],
"source_fragments": [
{
"lesson_title": "Thermodynamics and Entropy",
"kind": "lesson_body",
"text": "- Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.\n- Exercise: Compare the two entropy notions and identify what is preserved across the analogy.\nThe course uses entropy as a bridge concept between communication theory and physics while insisting on careful interpretation."
},
{
"lesson_title": "Thermodynamics and Entropy",
"kind": "objective",
"text": "Explain how thermodynamic entropy relates to, and differs from, Shannon entropy."
}
],
"recommended_action": "Use Thermodynamics and Entropy as the primary teaching anchor."
},
"practice_task": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"evaluation": {
"concept_key": "mit-ocw-information-and-entropy::independent-reasoning-and-careful-comparison",
"submission": "Entropy measures uncertainty because more possible outcomes require more information to describe, but one limitation is that thermodynamic entropy is not identical to Shannon entropy.",
"verdict": "needs_revision",
"aggregated": {
"correctness": 0.6000000000000001,
"explanation": 0.85,
"critique": 0.6499999999999999
},
"evaluators": [
{
"name": "rubric",
"dimensions": {
"correctness": 0.8,
"explanation": 0.85
},
"notes": "Heuristic scaffold rubric score."
},
{
"name": "symbolic_rule",
"dimensions": {
"correctness": 0.4
},
"notes": "Stub symbolic evaluator."
},
{
"name": "critique",
"dimensions": {
"critique": 0.6499999999999999
},
"notes": "Stub critique evaluator."
}
],
"skill_reference": {
"skill_name": "ocw-information-entropy-agent",
"mastered_by_demo_agent": true,
"supporting_lessons": [
"Independent Reasoning and Careful Comparison"
]
},
"follow_up": "Rework the answer so it states the equality/relationship explicitly and explains why it matters."
},
"turns": [
{
"role": "user",
"label": "Learner Goal",
"content": "Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy."
},
{
"role": "assistant",
"label": "Didactopus Mentor",
"content": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons"
},
{
"role": "assistant",
"label": "Didactopus Practice Designer",
"content": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons"
},
{
"role": "user",
"label": "Learner Submission",
"content": "Entropy measures uncertainty because more possible outcomes require more information to describe, but one limitation is that thermodynamic entropy is not identical to Shannon entropy."
},
{
"role": "assistant",
"label": "Didactopus Evaluator",
"content": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons"
},
{
"role": "assistant",
"label": "Didactopus Mentor",
"content": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons"
}
]
}

View File

@ -0,0 +1,55 @@
Didactopus Learner Session
Learner goal: Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
Source language: en
Output language: es
Study plan:
1. Independent Reasoning and Careful Comparison
Status: mastered
Prerequisites: Course Notes and Reference Texts
Supporting lessons: Independent Reasoning and Careful Comparison
Source fragment (lesson_body): - Objective: Explain why the course requires precise comparison of related but non-identical concepts.
- Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.
The syllabus framing implies a style of work where analogy is useful but dangerous when used loosely. Learners must compare models carefully, state assumptions, and notice where similar mathematics does not imply identical interpretation.
Source fragment (objective): Explain why the course requires precise comparison of related but non-identical concepts.
2. Thermodynamics and Entropy
Status: mastered
Prerequisites: Cryptography and Information Hiding
Supporting lessons: Thermodynamics and Entropy
Source fragment (lesson_body): - Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
- Exercise: Compare the two entropy notions and identify what is preserved across the analogy.
The course uses entropy as a bridge concept between communication theory and physics while insisting on careful interpretation.
Source fragment (objective): Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
3. Shannon Entropy
Status: mastered
Prerequisites: Counting and Probability
Supporting lessons: Shannon Entropy
Source fragment (lesson_body): - Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
- Exercise: Compute the entropy of a Bernoulli source and interpret the result.
The course then introduces entropy as a quantitative measure of uncertainty for a source model and uses it to reason about representation cost and surprise.
Source fragment (objective): Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
Conversation:
Learner Goal:
Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
Didactopus Mentor:
[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons
Didactopus Practice Designer:
[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons
Learner Submission:
Entropy measures uncertainty because more possible outcomes require more information to describe, but one limitation is that thermodynamic entropy is not identical to Shannon entropy.
Didactopus Evaluator:
[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons
Didactopus Mentor:
[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons
Evaluation summary:
Verdict: needs_revision
Aggregated dimensions: {"correctness": 0.6000000000000001, "critique": 0.6499999999999999, "explanation": 0.85}
Follow-up: Rework the answer so it states the equality/relationship explicitly and explains why it matters.

262
src/didactopus/arena.py Normal file
View File

@ -0,0 +1,262 @@
from __future__ import annotations
import json
from pathlib import Path
from time import perf_counter
import yaml
from .config import load_config
from .language_support import response_language_instruction
from .learner_session import _grounding_block
from .model_bench import _adequacy_rating, _score_evaluator_response, _score_mentor_response, _score_practice_response
from .model_provider import ModelProvider
from .ocw_skill_agent_demo import build_skill_grounded_study_plan, evaluate_submission_with_skill, load_ocw_skill_context
from .role_prompts import system_prompt_for_role_variant
def _default_arena_spec() -> dict:
return {
"candidates": [
{"name": "stub-baseline", "config": "configs/config.example.yaml", "prompt_variant": "baseline", "language": "en"},
{"name": "stub-strict-grounding", "config": "configs/config.example.yaml", "prompt_variant": "strict_grounding", "language": "en"},
{"name": "stub-trust-preserving", "config": "configs/config.example.yaml", "prompt_variant": "trust_preserving", "language": "en"},
],
"review": {
"enabled": True,
"config": "configs/config.example.yaml",
"role": "mentor",
},
}
def load_arena_spec(path: str | Path | None) -> dict:
if path is None:
return _default_arena_spec()
data = yaml.safe_load(Path(path).read_text(encoding="utf-8")) or {}
if "candidates" not in data:
data["candidates"] = _default_arena_spec()["candidates"]
return data
def _arena_tasks(context) -> tuple[list[dict], str, dict]:
study_plan = build_skill_grounded_study_plan(
context,
"Help a learner connect Shannon entropy, channel capacity, and thermodynamic entropy.",
)
steps = study_plan.get("steps", [])
if len(steps) < 2:
raise ValueError("Arena requires at least two grounded study-plan steps.")
learner_submission = (
"Entropy measures uncertainty because more possible outcomes require more information to describe, "
"but thermodynamic entropy is not identical to Shannon entropy without careful interpretation."
)
deterministic = evaluate_submission_with_skill(
context,
steps[0]["concept_key"].split("::", 1)[-1],
learner_submission,
)
return steps[:2], learner_submission, deterministic
def _arena_role_prompts(primary: dict, secondary: dict, learner_submission: str, deterministic: dict) -> dict[str, str]:
return {
"mentor": (
f"{_grounding_block(primary)}\n\n"
f"{_grounding_block(secondary)}\n\n"
"Give a grounded mentor response that orients the learner, explains the sequence, and asks one focused question."
),
"practice": (
f"{_grounding_block(primary)}\n\n"
"Create one reasoning-heavy practice task. Keep it grounded and do not provide the full solution."
),
"evaluator": (
f"{_grounding_block(primary)}\n\n"
f"Learner submission: {learner_submission}\n"
f"Deterministic evaluator result: verdict={deterministic['verdict']}, aggregated={deterministic['aggregated']}\n"
"Respond as evaluator. Acknowledge what the learner already did correctly, preserve existing caveats, and give one next revision target."
),
}
def _scorer_for_role(role: str):
return {
"mentor": _score_mentor_response,
"practice": _score_practice_response,
"evaluator": _score_evaluator_response,
}[role]
def _run_candidate(candidate: dict, skill_dir: str | Path) -> dict:
config = load_config(candidate["config"])
provider = ModelProvider(config.model_provider)
context = load_ocw_skill_context(skill_dir)
steps, learner_submission, deterministic = _arena_tasks(context)
primary, secondary = steps
prompts = _arena_role_prompts(primary, secondary, learner_submission, deterministic)
variant = candidate.get("prompt_variant", "baseline")
language = candidate.get("language", "en")
role_results = []
overall = 0.0
for role, prompt in prompts.items():
started = perf_counter()
response = provider.generate(
f"{prompt}{response_language_instruction(language, 'en')}",
role=role,
system_prompt=system_prompt_for_role_variant(role, variant),
temperature=0.2,
max_tokens=220,
)
elapsed_ms = round((perf_counter() - started) * 1000.0, 3)
score, notes = _scorer_for_role(role)(response.text)
overall += score
role_results.append(
{
"role": role,
"provider": response.provider,
"model_name": response.model_name,
"prompt_variant": variant,
"language": language,
"latency_ms": elapsed_ms,
"adequacy_score": round(score, 3),
"adequacy_rating": _adequacy_rating(score),
"response_preview": response.text[:280],
"notes": notes,
}
)
overall /= len(role_results)
return {
"candidate_name": candidate["name"],
"config": candidate["config"],
"prompt_variant": variant,
"language": language,
"provider": config.model_provider.provider,
"overall_score": round(overall, 3),
"overall_rating": _adequacy_rating(overall),
"role_results": role_results,
}
def _build_review_queue(candidate_results: list[dict]) -> list[dict]:
queue = []
for result in candidate_results:
weak_roles = [item["role"] for item in result["role_results"] if item["adequacy_rating"] != "adequate"]
queue.append(
{
"candidate_name": result["candidate_name"],
"overall_rating": result["overall_rating"],
"overall_score": result["overall_score"],
"needs_human_review": result["overall_rating"] != "adequate" or bool(weak_roles),
"weak_roles": weak_roles,
}
)
return queue
def _llm_review_summary(candidate_results: list[dict], spec: dict) -> dict | None:
review_spec = spec.get("review", {})
if not review_spec.get("enabled", False):
return None
config = load_config(review_spec.get("config", "configs/config.example.yaml"))
provider = ModelProvider(config.model_provider)
ranked = sorted(candidate_results, key=lambda item: item["overall_score"], reverse=True)
summary_lines = []
for result in ranked[:3]:
summary_lines.append(
f"- {result['candidate_name']}: overall {result['overall_rating']} ({result['overall_score']}), "
f"language={result['language']}, roles={[(item['role'], item['adequacy_rating'], item['adequacy_score']) for item in result['role_results']]}"
)
prompt = "\n".join(
[
"Review these Didactopus arena results for a human reviewer.",
"Rank the strongest candidates, identify likely prompt improvements, and state uncertainty clearly.",
"Do not claim that any candidate is fully validated. Treat this as initial review support only.",
"",
"Arena results:",
*summary_lines,
]
)
role = review_spec.get("role", "mentor")
response = provider.generate(
prompt,
role=role,
system_prompt=system_prompt_for_role_variant(role, "trust_preserving"),
temperature=0.2,
max_tokens=260,
)
return {
"provider": response.provider,
"model_name": response.model_name,
"role": role,
"summary": response.text,
}
def run_didactopus_arena(
*,
arena_spec_path: str | Path | None,
skill_dir: str | Path,
out_dir: str | Path,
) -> dict:
spec = load_arena_spec(arena_spec_path)
candidate_results = [_run_candidate(candidate, skill_dir) for candidate in spec.get("candidates", [])]
ranked = sorted(candidate_results, key=lambda item: item["overall_score"], reverse=True)
payload = {
"arena": {
"name": "didactopus-behavior-arena",
"candidate_count": len(candidate_results),
},
"ranked_candidates": ranked,
"review_queue": _build_review_queue(ranked),
"llm_review": _llm_review_summary(ranked, spec),
}
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
(out_dir / "arena_results.json").write_text(json.dumps(payload, indent=2), encoding="utf-8")
(out_dir / "arena_review_queue.json").write_text(json.dumps(payload["review_queue"], indent=2), encoding="utf-8")
lines = [
"# Didactopus Arena Report",
"",
f"- Candidates: {payload['arena']['candidate_count']}",
"",
"## Rankings",
]
for result in ranked:
lines.append(
f"- `{result['candidate_name']}` via `{result['provider']}` / prompt variant `{result['prompt_variant']}`: "
f"{result['overall_rating']} ({result['overall_score']}), language `{result['language']}`"
)
lines.extend(["", "## Human Review Queue"])
for item in payload["review_queue"]:
lines.append(
f"- `{item['candidate_name']}`: needs_human_review={item['needs_human_review']}, weak_roles={item['weak_roles']}"
)
if payload["llm_review"] is not None:
lines.extend(["", "## LLM Review Summary", payload["llm_review"]["summary"]])
(out_dir / "arena_report.md").write_text("\n".join(lines), encoding="utf-8")
return payload
def main() -> None:
import argparse
root = Path(__file__).resolve().parents[2]
parser = argparse.ArgumentParser(description="Run the Didactopus model-and-prompt arena.")
parser.add_argument("--arena-spec", default=None)
parser.add_argument("--skill-dir", default=str(root / "skills" / "ocw-information-entropy-agent"))
parser.add_argument("--out-dir", default=str(root / "examples" / "arena"))
args = parser.parse_args()
payload = run_didactopus_arena(
arena_spec_path=args.arena_spec,
skill_dir=args.skill_dir,
out_dir=args.out_dir,
)
print(json.dumps(payload, indent=2))
if __name__ == "__main__":
main()

View File

@ -0,0 +1,28 @@
from __future__ import annotations
LANGUAGE_LABELS = {
"en": "English",
"es": "Spanish",
"fr": "French",
"de": "German",
"it": "Italian",
"pt": "Portuguese",
"ar": "Arabic",
"sw": "Swahili",
"zh": "Chinese",
"ja": "Japanese",
}
def language_label(language: str) -> str:
return LANGUAGE_LABELS.get(language, language)
def response_language_instruction(language: str, source_language: str = "en") -> str:
if language == source_language:
return ""
return (
f" Respond in {language_label(language)}. Preserve key source-grounded concepts and caveats faithfully, "
f"and make clear when you are explaining material whose source language is {language_label(source_language)}."
)

View File

@ -14,6 +14,8 @@ def build_accessible_session_text(session: dict) -> str:
"Didactopus Learner Session",
"",
f"Learner goal: {session.get('goal', '')}",
f"Source language: {session.get('source_language', 'en')}",
f"Output language: {session.get('output_language', 'en')}",
"",
"Study plan:",
]
@ -88,6 +90,8 @@ def build_accessible_session_html(session: dict) -> str:
'<h1 id="session-title">Didactopus Learner Session</h1>',
'<p class="sr-note">This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.</p>',
f"<p><strong>Learner goal:</strong> {_escape(session.get('goal', ''))}</p>",
f"<p><strong>Source language:</strong> {_escape(session.get('source_language', 'en'))}</p>",
f"<p><strong>Output language:</strong> {_escape(session.get('output_language', 'en'))}</p>",
"</section>",
'<section aria-labelledby="study-plan-title">',
'<h2 id="study-plan-title">Study Plan</h2>',

View File

@ -2,6 +2,7 @@ from __future__ import annotations
from dataclasses import dataclass
from .language_support import response_language_instruction
from .model_provider import ModelProvider
from .ocw_skill_agent_demo import (
SkillContext,
@ -31,11 +32,13 @@ def _generate_role_text(
*,
role: str,
prompt: str,
language: str = "en",
source_language: str = "en",
temperature: float = 0.2,
max_tokens: int = 220,
) -> str:
return provider.generate(
prompt,
f"{prompt}{response_language_instruction(language, source_language)}",
role=role,
system_prompt=system_prompt_for_role(role),
temperature=temperature,
@ -55,6 +58,8 @@ def build_graph_grounded_session(
provider: ModelProvider,
learner_goal: str,
learner_submission: str,
language: str = "en",
source_language: str = "en",
) -> dict:
study_plan = build_skill_grounded_study_plan(context, learner_goal)
steps = study_plan.get("steps", [])
@ -74,6 +79,8 @@ def build_graph_grounded_session(
provider,
role="mentor",
prompt=mentor_prompt,
language=language,
source_language=source_language,
temperature=0.2,
max_tokens=260,
)
@ -87,6 +94,8 @@ def build_graph_grounded_session(
provider,
role="practice",
prompt=practice_prompt,
language=language,
source_language=source_language,
temperature=0.3,
max_tokens=220,
)
@ -103,6 +112,8 @@ def build_graph_grounded_session(
provider,
role="evaluator",
prompt=evaluator_prompt,
language=language,
source_language=source_language,
temperature=0.2,
max_tokens=240,
)
@ -117,6 +128,8 @@ def build_graph_grounded_session(
provider,
role="mentor",
prompt=next_step_prompt,
language=language,
source_language=source_language,
temperature=0.2,
max_tokens=220,
)
@ -132,6 +145,8 @@ def build_graph_grounded_session(
return {
"goal": learner_goal,
"output_language": language,
"source_language": source_language,
"study_plan": study_plan,
"primary_concept": primary,
"secondary_concept": secondary,

View File

@ -16,6 +16,7 @@ def run_learner_session_demo(
out_path: str | Path | None = None,
accessible_html_path: str | Path | None = None,
accessible_text_path: str | Path | None = None,
language: str = "en",
) -> dict:
config = load_config(config_path)
provider = ModelProvider(config.model_provider)
@ -25,6 +26,8 @@ def run_learner_session_demo(
provider=provider,
learner_goal="Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.",
learner_submission="Entropy measures uncertainty because more possible outcomes require more information to describe, but one limitation is that thermodynamic entropy is not identical to Shannon entropy.",
language=language,
source_language="en",
)
if out_path is not None:
out_path = Path(out_path)
@ -45,6 +48,7 @@ def main() -> None:
parser.add_argument("--out", default=str(root / "examples" / "ocw-information-entropy-session.json"))
parser.add_argument("--accessible-html", default=None)
parser.add_argument("--accessible-text", default=None)
parser.add_argument("--language", default="en")
args = parser.parse_args()
payload = run_learner_session_demo(
args.config,
@ -52,6 +56,7 @@ def main() -> None:
args.out,
args.accessible_html,
args.accessible_text,
args.language,
)
print(json.dumps(payload, indent=2))

View File

@ -5,6 +5,7 @@ from pathlib import Path
from time import perf_counter
from .config import load_config
from .language_support import response_language_instruction
from .learner_session import _grounding_block
from .model_provider import ModelProvider
from .ocw_skill_agent_demo import build_skill_grounded_study_plan, evaluate_submission_with_skill, load_ocw_skill_context
@ -100,6 +101,7 @@ def run_model_benchmark(
hardware_cpu: str = "unknown",
hardware_ram_gb: float | None = None,
hardware_notes: str | None = None,
language: str = "en",
) -> dict:
config = load_config(config_path)
provider = ModelProvider(config.model_provider)
@ -153,7 +155,7 @@ def run_model_benchmark(
for role, prompt in prompts.items():
started = perf_counter()
response = provider.generate(
prompt,
f"{prompt}{response_language_instruction(language, 'en')}",
role=role,
system_prompt=system_prompt_for_role(role),
temperature=0.2,
@ -193,6 +195,8 @@ def run_model_benchmark(
"study_plan_task": study_plan["task"],
"primary_concept": primary["title"],
"secondary_concept": secondary["title"],
"source_language": "en",
"output_language": language,
},
"role_results": role_results,
"summary": {
@ -248,6 +252,7 @@ def main() -> None:
parser.add_argument("--hardware-cpu", default="unknown")
parser.add_argument("--hardware-ram-gb", type=float, default=None)
parser.add_argument("--hardware-notes", default="")
parser.add_argument("--language", default="en")
args = parser.parse_args()
payload = run_model_benchmark(
config_path=args.config,
@ -257,6 +262,7 @@ def main() -> None:
hardware_cpu=args.hardware_cpu,
hardware_ram_gb=args.hardware_ram_gb,
hardware_notes=args.hardware_notes,
language=args.language,
)
print(json.dumps(payload, indent=2))

View File

@ -116,6 +116,7 @@ def build_skill_grounded_study_plan(context: SkillContext, target_task: str) ->
return {
"skill": context.skill_name,
"task": target_task,
"source_language": "en",
"steps": steps,
"guided_path_reference": list(context.run_summary.get("curriculum_path", [])),
}
@ -193,7 +194,7 @@ def evaluate_submission_with_skill(context: SkillContext, concept_id: str, submi
}
def run_ocw_skill_agent_demo(skill_dir: str | Path, out_dir: str | Path) -> dict:
def run_ocw_skill_agent_demo(skill_dir: str | Path, out_dir: str | Path, language: str = "en") -> dict:
context = load_ocw_skill_context(skill_dir)
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
@ -214,6 +215,8 @@ def run_ocw_skill_agent_demo(skill_dir: str | Path, out_dir: str | Path) -> dict
"name": context.skill_name,
"description": context.skill_description,
},
"source_language": "en",
"output_language": language,
"study_plan": study_plan,
"explanation": explanation,
"evaluation": evaluation,
@ -225,6 +228,8 @@ def run_ocw_skill_agent_demo(skill_dir: str | Path, out_dir: str | Path) -> dict
"",
f"- Skill: `{context.skill_name}`",
f"- Description: {context.skill_description}",
f"- Source language: en",
f"- Output language: {language}",
"",
"## Study Plan",
]
@ -260,8 +265,9 @@ def main() -> None:
"--out-dir",
default=str(root / "examples" / "ocw-information-entropy-skill-demo"),
)
parser.add_argument("--language", default="en")
args = parser.parse_args()
payload = run_ocw_skill_agent_demo(args.skill_dir, args.out_dir)
payload = run_ocw_skill_agent_demo(args.skill_dir, args.out_dir, args.language)
print(json.dumps(payload, indent=2))

View File

@ -1,6 +1,37 @@
from __future__ import annotations
def _variant_suffix(role: str, variant: str) -> str:
variant_map = {
"baseline": "",
"strict_grounding": {
"mentor": " Ground every major claim in the supplied concept structure or source fragments, and say when you are inferring beyond them.",
"practice": " Keep every exercise tightly tied to the supplied grounded material and avoid introducing outside topic drift.",
"learner": " Keep the reflection tied to the supplied grounded material and avoid importing outside claims unless you mark them as inference.",
"project_advisor": " Keep project suggestions anchored to the supplied grounded material and state assumptions explicitly.",
"evaluator": " Quote or paraphrase the learner text before judging gaps, and distinguish grounded criticism from inference.",
},
"trust_preserving": {
"mentor": " Be especially careful to preserve learner trust: acknowledge what is already correct before redirecting, and avoid overstating errors.",
"practice": " Prefer clear, calm task framing that emphasizes exploration over performance pressure.",
"learner": " Preserve an honest, effortful learner voice and explicitly note uncertainty without collapsing into self-dismissal.",
"project_advisor": " Emphasize realistic next steps and avoid grandiose scope.",
"evaluator": " Preserve learner trust by naming strengths first, avoiding invented omissions, and framing revisions as specific improvements rather than blanket criticism.",
},
"concise": {
"mentor": " Keep the response compact: no more than four short paragraphs or bullets worth of content.",
"practice": " Keep the task compact and direct.",
"learner": " Keep the reflection short and direct.",
"project_advisor": " Keep the advice short and concrete.",
"evaluator": " Keep the evaluation compact and specific.",
},
}
entry = variant_map.get(variant, "")
if isinstance(entry, str):
return entry
return entry.get(role, "")
def mentor_system_prompt() -> str:
return (
"You are Didactopus in mentor mode. Help the learner think through the topic without doing the work for them. "
@ -53,3 +84,9 @@ def system_prompt_for_role(role: str) -> str:
if factory is None:
raise KeyError(f"Unknown Didactopus role: {role}")
return factory()
def system_prompt_for_role_variant(role: str, variant: str = "baseline") -> str:
base = system_prompt_for_role(role)
suffix = _variant_suffix(role, variant)
return f"{base}{suffix}" if suffix else base

38
tests/test_arena.py Normal file
View File

@ -0,0 +1,38 @@
import json
from pathlib import Path
from didactopus.arena import load_arena_spec, run_didactopus_arena
from didactopus.role_prompts import system_prompt_for_role_variant
def test_system_prompt_for_role_variant_changes_prompt() -> None:
baseline = system_prompt_for_role_variant("mentor", "baseline")
strict = system_prompt_for_role_variant("mentor", "strict_grounding")
trust = system_prompt_for_role_variant("evaluator", "trust_preserving")
assert baseline != strict
assert "supplied concept structure" in strict
assert "preserve learner trust" in trust.lower()
def test_load_arena_spec_reads_candidates() -> None:
spec = load_arena_spec("configs/arena.example.yaml")
assert len(spec["candidates"]) == 3
assert spec["review"]["enabled"] is True
assert {candidate["language"] for candidate in spec["candidates"]} == {"en", "es", "fr"}
def test_run_didactopus_arena_writes_outputs(tmp_path: Path) -> None:
payload = run_didactopus_arena(
arena_spec_path="configs/arena.example.yaml",
skill_dir="skills/ocw-information-entropy-agent",
out_dir=tmp_path,
)
assert payload["arena"]["name"] == "didactopus-behavior-arena"
assert len(payload["ranked_candidates"]) == 3
assert (tmp_path / "arena_results.json").exists()
assert (tmp_path / "arena_review_queue.json").exists()
assert (tmp_path / "arena_report.md").exists()
queue = json.loads((tmp_path / "arena_review_queue.json").read_text(encoding="utf-8"))
assert queue
assert payload["ranked_candidates"][0]["language"] in {"en", "es", "fr"}
assert "LLM Review Summary" in (tmp_path / "arena_report.md").read_text(encoding="utf-8")

View File

@ -28,6 +28,8 @@ def test_accessible_session_html_has_landmarks() -> None:
def test_accessible_session_text_is_linearized() -> None:
text = build_accessible_session_text(_session_payload())
assert "Learner goal:" in text
assert "Source language:" in text
assert "Output language:" in text
assert "Study plan:" in text
assert "Conversation:" in text
assert "Evaluation summary:" in text

View File

@ -17,12 +17,14 @@ def test_build_graph_grounded_session_uses_grounded_steps() -> None:
provider=provider,
learner_goal="Help me connect Shannon entropy and channel capacity.",
learner_submission="Entropy measures uncertainty because unlikely outcomes carry more information, but one limitation is that idealized source models may not match physical systems.",
language="es",
)
assert payload["study_plan"]["steps"]
assert payload["primary_concept"]["supporting_lessons"]
assert payload["evaluation"]["verdict"] in {"acceptable", "needs_revision"}
assert len(payload["turns"]) == 6
assert payload["output_language"] == "es"
assert any("Grounding fragments" in turn["content"] or "Concept:" in turn["content"] for turn in payload["turns"])
@ -39,3 +41,4 @@ def test_run_learner_session_demo_writes_output(tmp_path: Path) -> None:
assert (tmp_path / "session.txt").exists()
assert payload["practice_task"]
assert payload["evaluation"]["aggregated"]
assert payload["output_language"] == "en"

View File

@ -19,6 +19,7 @@ def test_run_model_benchmark_writes_reports(tmp_path) -> None:
assert len(payload["role_results"]) == 3
assert {result["role"] for result in payload["role_results"]} == {"mentor", "practice", "evaluator"}
assert payload["summary"]["overall_adequacy_rating"] in {"adequate", "borderline", "inadequate"}
assert payload["context"]["output_language"] == "en"
json_path = tmp_path / "model_benchmark.json"
md_path = tmp_path / "model_benchmark.md"

View File

@ -21,6 +21,8 @@ def test_run_ocw_skill_agent_demo(tmp_path: Path) -> None:
assert "grounding" in payload["explanation"]
assert payload["explanation"]["grounding"]["supporting_lessons"]
assert payload["evaluation"]["verdict"] in {"acceptable", "needs_revision"}
assert payload["output_language"] == "en"
assert payload["source_language"] == "en"
def test_skill_demo_flags_weak_submission() -> None: