Revised multilignual support with round-trip warnings, updated philosophy.

This commit is contained in:
welsberr 2026-03-17 20:36:38 -04:00
parent 34b60ac529
commit 58466bbf9f
29 changed files with 1208 additions and 137 deletions

View File

@ -49,6 +49,8 @@ In practice, that means Didactopus tries to help with:
It explicitly tries not to become a silent answer surrogate.
The project is also being advanced with a future-compatibility constraint: avoid choices that assume abundant compute, fluent English, expert supervision, or only mature learners. That keeps the current roadmap moving while preserving eventual usefulness for more constrained and equity-sensitive educational settings.
## Who It Is For
Didactopus has several real audiences:
@ -74,6 +76,13 @@ Current priorities are:
The live detailed roadmap is in:
- `docs/roadmap.md`
- `docs/multilingual-qa.md`
Didactopus can also generate a starter multilingual QA draft from a pack:
```bash
python -m didactopus.multilingual_qa_seed domain-packs/mit-ocw-information-entropy
```
## Start Here If You Just Want To Learn

View File

@ -84,6 +84,12 @@ The arena currently writes:
- `arena_review_queue.json`
- `arena_report.md`
When a candidate sets a non-English `language`, the arena now also tracks a heuristic `multilingual_score` alongside the grounded behavior score. This is meant to catch obvious failures where a model ignores the requested output language or drops key grounded terms.
If the pack provides `multilingual_qa.yaml`, the arena also uses that spec to check required terms, required caveats, and forbidden confusions for the target language.
For non-English candidates, the arena now also records round-trip warnings by back-translating outputs into English and checking whether required source phrases remain recoverable.
## Human Review Position
The LLM review summary should be treated as initial triage support only.

View File

@ -26,6 +26,7 @@ The HTML output is meant to be screen-reader-friendly and keyboard-friendly:
- semantic headings
- reading-order sections for study plan, conversation, and evaluation
- grounded source fragments rendered as ordinary text instead of only visual diagrams
- deterministic learner-facing labels localized for supported output languages
The plain-text output is a linearized learner-session transcript that is suitable for:

View File

@ -89,6 +89,15 @@ The current heuristic scoring asks whether each role does the right kind of work
This is deliberately narrower than a general-purpose benchmark. Didactopus cares about trustworthy learner guidance, not maximal generic fluency.
When `--language` is set to a non-English value, the benchmark now also applies a heuristic multilingual check:
- does the response appear to actually be in the target language?
- does it still preserve key grounded concept terms and caveats?
If the pack provides `multilingual_qa.yaml`, the benchmark also applies per-pack preservation checks from that spec.
For non-English runs, the benchmark now also records a round-trip warning layer by back-translating role outputs into English and checking whether required source phrases are still recoverable. This is a warning-oriented signal, not a proof of correctness.
## Interpreting Ratings
- `adequate`

80
docs/multilingual-qa.md Normal file
View File

@ -0,0 +1,80 @@
# Multilingual QA
Didactopus now supports an optional per-pack multilingual QA spec.
The goal is not to certify perfect translation quality. The goal is to make multilingual evaluation less dependent on vague fluency judgments by checking whether key terms, caveats, and forbidden confusions survive across languages.
## Spec File
Place this file in a pack directory:
- `multilingual_qa.yaml`
It is currently optional.
## Current Shape
```yaml
source_language: en
targets:
es:
required_terms:
- id: shannon-entropy
accepted:
- "entropía de shannon"
required_caveats:
- id: shannon-vs-thermo-not-identical
accepted:
- "no es idéntica"
forbidden_confusions:
- id: shannon-equals-thermodynamic-entropy
patterns:
- "es idéntica a la entropía termodinámica"
```
## Starter Generation
Didactopus can now generate a draft starter spec for reviewer refinement:
```bash
python -m didactopus.multilingual_qa_seed domain-packs/mit-ocw-information-entropy \
--out domain-packs/mit-ocw-information-entropy/multilingual_qa.seed.yaml \
--languages es fr
```
The generated `multilingual_qa.seed.yaml` is not meant for immediate trust. It is a reviewer aid that pulls:
- multi-word concept titles as draft required terms
- likely caveat candidates from grounded source fragments
- likely forbidden confusions derived from negated caveat language
## What It Checks
For a target language, the QA layer can check:
- required terms that should appear in acceptable translated or multilingual output
- required caveats that must survive explanation
- forbidden confusions that should trigger warnings
## Where It Is Used
This spec now feeds:
- the local model benchmark
- the Didactopus arena
Those tools still use heuristic scoring, but multilingual QA spec checks now contribute an explicit preservation signal.
## Why This Helps
This gives Didactopus a better layered multilingual evaluation model:
1. language-alignment heuristics
2. term and caveat preservation checks
3. round-trip warning checks on required phrases
4. arena comparison and LLM review support
5. human bilingual review for promoted or disputed outputs
## Current Limitation
This is still a lightweight preservation framework. It does not yet prove semantic equivalence across whole explanations. It is best treated as an early QA filter and promotion aid.

View File

@ -190,6 +190,7 @@ Examples:
- Prefer role-adequate local models over chasing a single best model.
- Keep accessibility and low-cost deployment in scope from the start, not as cleanup work.
- Preserve provenance and license compliance as first-class constraints.
- Advance the current roadmap without assuming abundant compute, fluent English, expert supervision, or mature learners.
## Suggested Implementation Sequence

View File

@ -0,0 +1,60 @@
source_language: en
generated_by: didactopus.multilingual_qa_seed
review_status: draft-seed
targets:
es:
required_terms: &id001
- id: mit-ocw-6-050j-information-and-entropy-course-home
accepted:
- MIT OCW 6.050J Information and Entropy Course Home
- id: information-and-entropy
accepted:
- Information and Entropy
- id: ultimate-limits-to-communication-and-computation
accepted:
- Ultimate Limits to Communication and Computation
- id: open-textbooks-problem-sets-and-programming-work
accepted:
- Open Textbooks, Problem Sets, and Programming Work
- id: mit-ocw-6-050j-information-and-entropy-syllabus
accepted:
- MIT OCW 6.050J Information and Entropy Syllabus
- id: prerequisites-and-mathematical-background
accepted:
- Prerequisites and Mathematical Background
- id: assessment-structure
accepted:
- Assessment Structure
- id: course-notes-and-reference-texts
accepted:
- Course Notes and Reference Texts
- id: independent-reasoning-and-careful-comparison
accepted:
- Independent Reasoning and Careful Comparison
- id: mit-ocw-6-050j-information-and-entropy-unit-sequence
accepted:
- MIT OCW 6.050J Information and Entropy Unit Sequence
- id: counting-and-probability
accepted:
- Counting and Probability
- id: shannon-entropy
accepted:
- Shannon Entropy
required_caveats: &id002
- id: thermodynamics-and-entropy
accepted:
- Objective Explain how thermodynamic entropy relates to, and differs from,
Shannon entropy. Exercise Compare the two entropy notions and identify what
is preserved across the analogy. The course uses entropy as a bridge concept
between communication theory and physics while insisting on careful interpretation.
forbidden_confusions: &id003
- id: thermodynamics-and-entropy-confusion
patterns:
- Objective Explain how thermodynamic entropy relates to, and is identical to,
Shannon entropy. Exercise Compare the two entropy notions and identify what
is preserved across the analogy. The course uses entropy as a bridge concept
between communication theory and physics while insisting on careful interpretation.
fr:
required_terms: *id001
required_caveats: *id002
forbidden_confusions: *id003

View File

@ -0,0 +1,59 @@
source_language: en
targets:
es:
required_terms:
- id: shannon-entropy
accepted:
- "entropia"
- "entropía"
- "entropia de shannon"
- "entropía de shannon"
- id: channel-capacity
accepted:
- "capacidad del canal"
- "capacidad de canal"
- id: thermodynamic-entropy
accepted:
- "entropia termodinamica"
- "entropía termodinámica"
required_caveats:
- id: shannon-vs-thermo-not-identical
accepted:
- "no es identica"
- "no es idéntica"
- "no son identicas"
- "no son idénticas"
- "no equivale exactamente"
forbidden_confusions:
- id: shannon-equals-thermodynamic-entropy
patterns:
- "es identica a la entropia termodinamica"
- "es idéntica a la entropía termodinámica"
- "son identicas"
- "son idénticas"
fr:
required_terms:
- id: shannon-entropy
accepted:
- "entropie"
- "entropie de shannon"
- id: channel-capacity
accepted:
- "capacite du canal"
- "capacité du canal"
- id: thermodynamic-entropy
accepted:
- "entropie thermodynamique"
required_caveats:
- id: shannon-vs-thermo-not-identical
accepted:
- "n'est pas identique"
- "ne sont pas identiques"
- "n'est pas equivalente"
- "n'est pas équivalente"
forbidden_confusions:
- id: shannon-equals-thermodynamic-entropy
patterns:
- "est identique a l'entropie thermodynamique"
- "est identique à l'entropie thermodynamique"
- "sont identiques"

View File

@ -3,9 +3,9 @@
- Candidates: 3
## Rankings
- `stub-baseline` via `stub` / prompt variant `baseline`: borderline (0.667), language `en`
- `stub-strict-grounding` via `stub` / prompt variant `strict_grounding`: borderline (0.667), language `es`
- `stub-trust-preserving` via `stub` / prompt variant `trust_preserving`: borderline (0.667), language `fr`
- `stub-baseline` via `stub` / prompt variant `baseline`: borderline (0.733), language `en`
- `stub-strict-grounding` via `stub` / prompt variant `strict_grounding`: inadequate (0.547), language `es`
- `stub-trust-preserving` via `stub` / prompt variant `trust_preserving`: inadequate (0.547), language `fr`
## Human Review Queue
- `stub-baseline`: needs_human_review=True, weak_roles=['mentor', 'evaluator']

View File

@ -10,7 +10,7 @@
"prompt_variant": "baseline",
"language": "en",
"provider": "stub",
"overall_score": 0.667,
"overall_score": 0.733,
"overall_rating": "borderline",
"role_results": [
{
@ -19,9 +19,11 @@
"model_name": "local-demo",
"prompt_variant": "baseline",
"language": "en",
"latency_ms": 0.027,
"adequacy_score": 0.65,
"latency_ms": 0.021,
"adequacy_score": 0.72,
"adequacy_rating": "borderline",
"grounded_score": 0.65,
"multilingual_score": 1.0,
"response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not ask a focused learner question."
@ -33,9 +35,11 @@
"model_name": "local-demo",
"prompt_variant": "baseline",
"language": "en",
"latency_ms": 0.006,
"latency_ms": 0.005,
"adequacy_score": 1.0,
"adequacy_rating": "adequate",
"grounded_score": 1.0,
"multilingual_score": 1.0,
"response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": []
},
@ -45,9 +49,11 @@
"model_name": "local-demo",
"prompt_variant": "baseline",
"language": "en",
"latency_ms": 0.005,
"adequacy_score": 0.35,
"latency_ms": 0.004,
"adequacy_score": 0.48,
"adequacy_rating": "inadequate",
"grounded_score": 0.35,
"multilingual_score": 1.0,
"response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not acknowledge learner strengths.",
@ -62,8 +68,8 @@
"prompt_variant": "strict_grounding",
"language": "es",
"provider": "stub",
"overall_score": 0.667,
"overall_rating": "borderline",
"overall_score": 0.547,
"overall_rating": "inadequate",
"role_results": [
{
"role": "mentor",
@ -71,12 +77,20 @@
"model_name": "local-demo",
"prompt_variant": "strict_grounding",
"language": "es",
"latency_ms": 0.019,
"adequacy_score": 0.65,
"adequacy_rating": "borderline",
"latency_ms": 0.028,
"adequacy_score": 0.52,
"adequacy_rating": "inadequate",
"grounded_score": 0.65,
"multilingual_score": 0.0,
"response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not ask a focused learner question."
"Did not ask a focused learner question.",
"Response does not appear to be in Spanish.",
"Missing required multilingual term 'shannon-entropy' for language 'es'.",
"Missing required multilingual term 'channel-capacity' for language 'es'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
"Did not visibly preserve a key grounded concept term in multilingual output."
]
},
{
@ -85,11 +99,19 @@
"model_name": "local-demo",
"prompt_variant": "strict_grounding",
"language": "es",
"latency_ms": 0.005,
"adequacy_score": 1.0,
"latency_ms": 0.006,
"adequacy_score": 0.82,
"adequacy_rating": "adequate",
"grounded_score": 1.0,
"multilingual_score": 0.1,
"response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": []
"notes": [
"Response does not appear to be in Spanish.",
"Missing required multilingual term 'shannon-entropy' for language 'es'.",
"Missing required multilingual term 'channel-capacity' for language 'es'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'."
]
},
{
"role": "evaluator",
@ -97,13 +119,20 @@
"model_name": "local-demo",
"prompt_variant": "strict_grounding",
"language": "es",
"latency_ms": 0.004,
"adequacy_score": 0.35,
"latency_ms": 0.006,
"adequacy_score": 0.3,
"adequacy_rating": "inadequate",
"grounded_score": 0.35,
"multilingual_score": 0.1,
"response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not acknowledge learner strengths.",
"Did not provide a concrete next step."
"Did not provide a concrete next step.",
"Response does not appear to be in Spanish.",
"Missing required multilingual term 'shannon-entropy' for language 'es'.",
"Missing required multilingual term 'channel-capacity' for language 'es'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'."
]
}
]
@ -114,8 +143,8 @@
"prompt_variant": "trust_preserving",
"language": "fr",
"provider": "stub",
"overall_score": 0.667,
"overall_rating": "borderline",
"overall_score": 0.547,
"overall_rating": "inadequate",
"role_results": [
{
"role": "mentor",
@ -123,12 +152,20 @@
"model_name": "local-demo",
"prompt_variant": "trust_preserving",
"language": "fr",
"latency_ms": 0.025,
"adequacy_score": 0.65,
"adequacy_rating": "borderline",
"latency_ms": 0.024,
"adequacy_score": 0.52,
"adequacy_rating": "inadequate",
"grounded_score": 0.65,
"multilingual_score": 0.0,
"response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not ask a focused learner question."
"Did not ask a focused learner question.",
"Response does not appear to be in French.",
"Missing required multilingual term 'shannon-entropy' for language 'fr'.",
"Missing required multilingual term 'channel-capacity' for language 'fr'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'fr'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'fr'.",
"Did not visibly preserve a key grounded concept term in multilingual output."
]
},
{
@ -137,11 +174,19 @@
"model_name": "local-demo",
"prompt_variant": "trust_preserving",
"language": "fr",
"latency_ms": 0.005,
"adequacy_score": 1.0,
"latency_ms": 0.006,
"adequacy_score": 0.82,
"adequacy_rating": "adequate",
"grounded_score": 1.0,
"multilingual_score": 0.1,
"response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": []
"notes": [
"Response does not appear to be in French.",
"Missing required multilingual term 'shannon-entropy' for language 'fr'.",
"Missing required multilingual term 'channel-capacity' for language 'fr'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'fr'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'fr'."
]
},
{
"role": "evaluator",
@ -150,12 +195,19 @@
"prompt_variant": "trust_preserving",
"language": "fr",
"latency_ms": 0.005,
"adequacy_score": 0.35,
"adequacy_score": 0.3,
"adequacy_rating": "inadequate",
"grounded_score": 0.35,
"multilingual_score": 0.1,
"response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"notes": [
"Did not acknowledge learner strengths.",
"Did not provide a concrete next step."
"Did not provide a concrete next step.",
"Response does not appear to be in French.",
"Missing required multilingual term 'shannon-entropy' for language 'fr'.",
"Missing required multilingual term 'channel-capacity' for language 'fr'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'fr'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'fr'."
]
}
]
@ -165,7 +217,7 @@
{
"candidate_name": "stub-baseline",
"overall_rating": "borderline",
"overall_score": 0.667,
"overall_score": 0.733,
"needs_human_review": true,
"weak_roles": [
"mentor",
@ -174,8 +226,8 @@
},
{
"candidate_name": "stub-strict-grounding",
"overall_rating": "borderline",
"overall_score": 0.667,
"overall_rating": "inadequate",
"overall_score": 0.547,
"needs_human_review": true,
"weak_roles": [
"mentor",
@ -184,8 +236,8 @@
},
{
"candidate_name": "stub-trust-preserving",
"overall_rating": "borderline",
"overall_score": 0.667,
"overall_rating": "inadequate",
"overall_score": 0.547,
"needs_human_review": true,
"weak_roles": [
"mentor",

View File

@ -2,7 +2,7 @@
{
"candidate_name": "stub-baseline",
"overall_rating": "borderline",
"overall_score": 0.667,
"overall_score": 0.733,
"needs_human_review": true,
"weak_roles": [
"mentor",
@ -11,8 +11,8 @@
},
{
"candidate_name": "stub-strict-grounding",
"overall_rating": "borderline",
"overall_score": 0.667,
"overall_rating": "inadequate",
"overall_score": 0.547,
"needs_human_review": true,
"weak_roles": [
"mentor",
@ -21,8 +21,8 @@
},
{
"candidate_name": "stub-trust-preserving",
"overall_rating": "borderline",
"overall_score": 0.667,
"overall_rating": "inadequate",
"overall_score": 0.547,
"needs_human_review": true,
"weak_roles": [
"mentor",

View File

@ -0,0 +1,152 @@
{
"benchmark": {
"name": "didactopus-local-model-adequacy",
"task_family": "graph-grounded-mentor-loop",
"provider": "stub",
"hardware_profile": {
"profile_name": "unspecified-local",
"cpu": "unknown",
"ram_gb": null,
"notes": ""
}
},
"context": {
"skill_name": "ocw-information-entropy-agent",
"study_plan_task": "Help a learner connect Shannon entropy, channel capacity, and thermodynamic entropy.",
"primary_concept": "Independent Reasoning and Careful Comparison",
"secondary_concept": "Thermodynamics and Entropy",
"source_language": "en",
"output_language": "es"
},
"role_results": [
{
"role": "mentor",
"provider": "stub",
"model_name": "local-demo",
"latency_ms": 0.025,
"response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"adequacy_score": 0.52,
"adequacy_rating": "inadequate",
"grounded_score": 0.65,
"multilingual_score": 0.0,
"round_trip": {
"warnings": [
"Round-trip translation did not preserve source phrase 'entropia'.",
"Round-trip translation did not preserve source phrase 'capacidad del canal'.",
"Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
"Round-trip translation did not preserve source phrase 'no es identica'."
],
"summary": {
"source_phrase_count": 4,
"round_trip_warning_count": 4,
"drifted_phrases": [
"entropia",
"capacidad del canal",
"entropia termodinamica",
"no es identica"
]
}
},
"notes": [
"Did not ask a focused learner question.",
"Response does not appear to be in Spanish.",
"Missing required multilingual term 'shannon-entropy' for language 'es'.",
"Missing required multilingual term 'channel-capacity' for language 'es'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
"Did not visibly preserve a key grounded concept term in multilingual output.",
"Round-trip translation did not preserve source phrase 'entropia'.",
"Round-trip translation did not preserve source phrase 'capacidad del canal'.",
"Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
"Round-trip translation did not preserve source phrase 'no es identica'."
]
},
{
"role": "practice",
"provider": "stub",
"model_name": "local-demo",
"latency_ms": 0.004,
"response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"adequacy_score": 0.82,
"adequacy_rating": "adequate",
"grounded_score": 1.0,
"multilingual_score": 0.1,
"round_trip": {
"warnings": [
"Round-trip translation did not preserve source phrase 'entropia'.",
"Round-trip translation did not preserve source phrase 'capacidad del canal'.",
"Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
"Round-trip translation did not preserve source phrase 'no es identica'."
],
"summary": {
"source_phrase_count": 4,
"round_trip_warning_count": 4,
"drifted_phrases": [
"entropia",
"capacidad del canal",
"entropia termodinamica",
"no es identica"
]
}
},
"notes": [
"Response does not appear to be in Spanish.",
"Missing required multilingual term 'shannon-entropy' for language 'es'.",
"Missing required multilingual term 'channel-capacity' for language 'es'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
"Round-trip translation did not preserve source phrase 'entropia'.",
"Round-trip translation did not preserve source phrase 'capacidad del canal'.",
"Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
"Round-trip translation did not preserve source phrase 'no es identica'."
]
},
{
"role": "evaluator",
"provider": "stub",
"model_name": "local-demo",
"latency_ms": 0.004,
"response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
"adequacy_score": 0.3,
"adequacy_rating": "inadequate",
"grounded_score": 0.35,
"multilingual_score": 0.1,
"round_trip": {
"warnings": [
"Round-trip translation did not preserve source phrase 'entropia'.",
"Round-trip translation did not preserve source phrase 'capacidad del canal'.",
"Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
"Round-trip translation did not preserve source phrase 'no es identica'."
],
"summary": {
"source_phrase_count": 4,
"round_trip_warning_count": 4,
"drifted_phrases": [
"entropia",
"capacidad del canal",
"entropia termodinamica",
"no es identica"
]
}
},
"notes": [
"Did not acknowledge learner strengths.",
"Did not provide a concrete next step.",
"Response does not appear to be in Spanish.",
"Missing required multilingual term 'shannon-entropy' for language 'es'.",
"Missing required multilingual term 'channel-capacity' for language 'es'.",
"Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
"Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
"Round-trip translation did not preserve source phrase 'entropia'.",
"Round-trip translation did not preserve source phrase 'capacidad del canal'.",
"Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
"Round-trip translation did not preserve source phrase 'no es identica'."
]
}
],
"summary": {
"overall_adequacy_score": 0.547,
"overall_adequacy_rating": "inadequate",
"recommended_use": "Not recommended for learner-facing local deployment."
}
}

View File

@ -0,0 +1,16 @@
# Didactopus Local Model Benchmark
- Provider: `stub`
- Hardware profile: `unspecified-local`
- Primary concept: Independent Reasoning and Careful Comparison
- Secondary concept: Thermodynamics and Entropy
- Overall adequacy: inadequate (0.547)
- Recommended use: Not recommended for learner-facing local deployment.
## Role Results
- `mentor` via `local-demo`: inadequate (0.52), latency 0.025 ms
Notes: Did not ask a focused learner question.; Response does not appear to be in Spanish.; Missing required multilingual term 'shannon-entropy' for language 'es'.; Missing required multilingual term 'channel-capacity' for language 'es'.; Missing required multilingual term 'thermodynamic-entropy' for language 'es'.; Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.; Did not visibly preserve a key grounded concept term in multilingual output.; Round-trip translation did not preserve source phrase 'entropia'.; Round-trip translation did not preserve source phrase 'capacidad del canal'.; Round-trip translation did not preserve source phrase 'entropia termodinamica'.; Round-trip translation did not preserve source phrase 'no es identica'.
- `practice` via `local-demo`: adequate (0.82), latency 0.004 ms
Notes: Response does not appear to be in Spanish.; Missing required multilingual term 'shannon-entropy' for language 'es'.; Missing required multilingual term 'channel-capacity' for language 'es'.; Missing required multilingual term 'thermodynamic-entropy' for language 'es'.; Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.; Round-trip translation did not preserve source phrase 'entropia'.; Round-trip translation did not preserve source phrase 'capacidad del canal'.; Round-trip translation did not preserve source phrase 'entropia termodinamica'.; Round-trip translation did not preserve source phrase 'no es identica'.
- `evaluator` via `local-demo`: inadequate (0.3), latency 0.004 ms
Notes: Did not acknowledge learner strengths.; Did not provide a concrete next step.; Response does not appear to be in Spanish.; Missing required multilingual term 'shannon-entropy' for language 'es'.; Missing required multilingual term 'channel-capacity' for language 'es'.; Missing required multilingual term 'thermodynamic-entropy' for language 'es'.; Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.; Round-trip translation did not preserve source phrase 'entropia'.; Round-trip translation did not preserve source phrase 'capacidad del canal'.; Round-trip translation did not preserve source phrase 'entropia termodinamica'.; Round-trip translation did not preserve source phrase 'no es identica'.

View File

@ -1,9 +1,9 @@
<!doctype html>
<html lang="en">
<html lang="es">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Didactopus Learner Session</title>
<title>Sesion de aprendizaje de Didactopus</title>
<style>
:root { color-scheme: light; --bg: #f7f4ed; --panel: #fffdf8; --ink: #1e2b31; --muted: #53656d; --line: #d3c8b7; --accent: #155e63; }
body { margin: 0; font-family: Georgia, 'Times New Roman', serif; background: var(--bg); color: var(--ink); line-height: 1.55; }
@ -21,24 +21,24 @@ ol, ul { padding-left: 22px; }
</style>
</head>
<body>
<a class="skip" href="#session-main">Skip to learner session</a>
<a class="skip" href="#session-main">Saltar a la sesion de aprendizaje</a>
<main id="session-main" aria-label="Didactopus learner session">
<section aria-labelledby="session-title">
<h1 id="session-title">Didactopus Learner Session</h1>
<p class="sr-note">This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.</p>
<p><strong>Learner goal:</strong> Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.</p>
<p><strong>Source language:</strong> en</p>
<p><strong>Output language:</strong> es</p>
<h1 id="session-title">Sesion de aprendizaje de Didactopus</h1>
<p class="sr-note">Esta pagina esta estructurada para uso con teclado y lector de pantalla. Presenta el objetivo del aprendiz, el plan de estudio, los fragmentos de fundamento y los turnos de conversacion en orden de lectura.</p>
<p><strong>Objetivo del aprendiz:</strong> Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.</p>
<p><strong>Idioma de origen:</strong> en</p>
<p><strong>Idioma de salida:</strong> es</p>
</section>
<section aria-labelledby="study-plan-title">
<h2 id="study-plan-title">Study Plan</h2>
<h2 id="study-plan-title">Plan de estudio</h2>
<ol>
<li>
<h3>Independent Reasoning and Careful Comparison</h3>
<p><strong>Status:</strong> mastered</p>
<p><strong>Prerequisites:</strong> Course Notes and Reference Texts</p>
<p><strong>Supporting lessons:</strong> Independent Reasoning and Careful Comparison</p>
<p><strong>Grounding fragments:</strong></p>
<p><strong>Estado:</strong> mastered</p>
<p><strong>Prerrequisitos:</strong> Course Notes and Reference Texts</p>
<p><strong>Lecciones de apoyo:</strong> Independent Reasoning and Careful Comparison</p>
<p><strong>Fragmentos de fundamento:</strong></p>
<ul>
<li><div class="fragment"><strong>Independent Reasoning and Careful Comparison</strong> (lesson_body)<br>- Objective: Explain why the course requires precise comparison of related but non-identical concepts.
- Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.
@ -48,10 +48,10 @@ The syllabus framing implies a style of work where analogy is useful but dangero
</li>
<li>
<h3>Thermodynamics and Entropy</h3>
<p><strong>Status:</strong> mastered</p>
<p><strong>Prerequisites:</strong> Cryptography and Information Hiding</p>
<p><strong>Supporting lessons:</strong> Thermodynamics and Entropy</p>
<p><strong>Grounding fragments:</strong></p>
<p><strong>Estado:</strong> mastered</p>
<p><strong>Prerrequisitos:</strong> Cryptography and Information Hiding</p>
<p><strong>Lecciones de apoyo:</strong> Thermodynamics and Entropy</p>
<p><strong>Fragmentos de fundamento:</strong></p>
<ul>
<li><div class="fragment"><strong>Thermodynamics and Entropy</strong> (lesson_body)<br>- Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
- Exercise: Compare the two entropy notions and identify what is preserved across the analogy.
@ -61,10 +61,10 @@ The course uses entropy as a bridge concept between communication theory and phy
</li>
<li>
<h3>Shannon Entropy</h3>
<p><strong>Status:</strong> mastered</p>
<p><strong>Prerequisites:</strong> Counting and Probability</p>
<p><strong>Supporting lessons:</strong> Shannon Entropy</p>
<p><strong>Grounding fragments:</strong></p>
<p><strong>Estado:</strong> mastered</p>
<p><strong>Prerrequisitos:</strong> Counting and Probability</p>
<p><strong>Lecciones de apoyo:</strong> Shannon Entropy</p>
<p><strong>Fragmentos de fundamento:</strong></p>
<ul>
<li><div class="fragment"><strong>Shannon Entropy</strong> (lesson_body)<br>- Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
- Exercise: Compute the entropy of a Bernoulli source and interpret the result.
@ -75,7 +75,7 @@ The course then introduces entropy as a quantitative measure of uncertainty for
</ol>
</section>
<section aria-labelledby="conversation-title">
<h2 id="conversation-title">Conversation</h2>
<h2 id="conversation-title">Conversacion</h2>
<article class="turn" aria-label="Conversation turn">
<h3>Learner Goal</h3>
<p class="meta">Role: user</p>
@ -108,10 +108,10 @@ The course then introduces entropy as a quantitative measure of uncertainty for
</article>
</section>
<section aria-labelledby="evaluation-title">
<h2 id="evaluation-title">Evaluation Summary</h2>
<p><strong>Verdict:</strong> needs_revision</p>
<p><strong>Aggregated dimensions:</strong> {&quot;correctness&quot;: 0.6000000000000001, &quot;critique&quot;: 0.6499999999999999, &quot;explanation&quot;: 0.85}</p>
<p><strong>Follow-up:</strong> Rework the answer so it states the equality/relationship explicitly and explains why it matters.</p>
<h2 id="evaluation-title">Resumen de evaluacion</h2>
<p><strong>Veredicto:</strong> needs_revision</p>
<p><strong>Dimensiones agregadas:</strong> {&quot;correctness&quot;: 0.6000000000000001, &quot;critique&quot;: 0.6499999999999999, &quot;explanation&quot;: 0.85}</p>
<p><strong>Siguiente paso:</strong> Rework the answer so it states the equality/relationship explicitly and explains why it matters.</p>
</section>
</main>
</body>

View File

@ -1,36 +1,36 @@
Didactopus Learner Session
Sesion de aprendizaje de Didactopus
Learner goal: Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
Source language: en
Output language: es
Objetivo del aprendiz: Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
Idioma de origen: en
Idioma de salida: es
Study plan:
Plan de estudio:
1. Independent Reasoning and Careful Comparison
Status: mastered
Prerequisites: Course Notes and Reference Texts
Supporting lessons: Independent Reasoning and Careful Comparison
Source fragment (lesson_body): - Objective: Explain why the course requires precise comparison of related but non-identical concepts.
Estado: mastered
Prerrequisitos: Course Notes and Reference Texts
Lecciones de apoyo: Independent Reasoning and Careful Comparison
Fragmento de fuente (lesson_body): - Objective: Explain why the course requires precise comparison of related but non-identical concepts.
- Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.
The syllabus framing implies a style of work where analogy is useful but dangerous when used loosely. Learners must compare models carefully, state assumptions, and notice where similar mathematics does not imply identical interpretation.
Source fragment (objective): Explain why the course requires precise comparison of related but non-identical concepts.
Fragmento de fuente (objective): Explain why the course requires precise comparison of related but non-identical concepts.
2. Thermodynamics and Entropy
Status: mastered
Prerequisites: Cryptography and Information Hiding
Supporting lessons: Thermodynamics and Entropy
Source fragment (lesson_body): - Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
Estado: mastered
Prerrequisitos: Cryptography and Information Hiding
Lecciones de apoyo: Thermodynamics and Entropy
Fragmento de fuente (lesson_body): - Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
- Exercise: Compare the two entropy notions and identify what is preserved across the analogy.
The course uses entropy as a bridge concept between communication theory and physics while insisting on careful interpretation.
Source fragment (objective): Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
Fragmento de fuente (objective): Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
3. Shannon Entropy
Status: mastered
Prerequisites: Counting and Probability
Supporting lessons: Shannon Entropy
Source fragment (lesson_body): - Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
Estado: mastered
Prerrequisitos: Counting and Probability
Lecciones de apoyo: Shannon Entropy
Fragmento de fuente (lesson_body): - Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
- Exercise: Compute the entropy of a Bernoulli source and interpret the result.
The course then introduces entropy as a quantitative measure of uncertainty for a source model and uses it to reason about representation cost and surprise.
Source fragment (objective): Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
Fragmento de fuente (objective): Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
Conversation:
Conversacion:
Learner Goal:
Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
@ -49,7 +49,7 @@ Didactopus Evaluator:
Didactopus Mentor:
[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons
Evaluation summary:
Verdict: needs_revision
Aggregated dimensions: {"correctness": 0.6000000000000001, "critique": 0.6499999999999999, "explanation": 0.85}
Follow-up: Rework the answer so it states the equality/relationship explicitly and explains why it matters.
Resumen de evaluacion:
Veredicto: needs_revision
Dimensiones agregadas: {"correctness": 0.6000000000000001, "critique": 0.6499999999999999, "explanation": 0.85}
Siguiente paso: Rework the answer so it states the equality/relationship explicitly and explains why it matters.

View File

@ -0,0 +1,59 @@
source_language: en
targets:
es:
required_terms:
- id: shannon-entropy
accepted:
- "entropia"
- "entropía"
- "entropia de shannon"
- "entropía de shannon"
- id: channel-capacity
accepted:
- "capacidad del canal"
- "capacidad de canal"
- id: thermodynamic-entropy
accepted:
- "entropia termodinamica"
- "entropía termodinámica"
required_caveats:
- id: shannon-vs-thermo-not-identical
accepted:
- "no es identica"
- "no es idéntica"
- "no son identicas"
- "no son idénticas"
- "no equivale exactamente"
forbidden_confusions:
- id: shannon-equals-thermodynamic-entropy
patterns:
- "es identica a la entropia termodinamica"
- "es idéntica a la entropía termodinámica"
- "son identicas"
- "son idénticas"
fr:
required_terms:
- id: shannon-entropy
accepted:
- "entropie"
- "entropie de shannon"
- id: channel-capacity
accepted:
- "capacite du canal"
- "capacité du canal"
- id: thermodynamic-entropy
accepted:
- "entropie thermodynamique"
required_caveats:
- id: shannon-vs-thermo-not-identical
accepted:
- "n'est pas identique"
- "ne sont pas identiques"
- "n'est pas equivalente"
- "n'est pas équivalente"
forbidden_confusions:
- id: shannon-equals-thermodynamic-entropy
patterns:
- "est identique a l'entropie thermodynamique"
- "est identique à l'entropie thermodynamique"
- "sont identiques"

View File

@ -9,8 +9,16 @@ import yaml
from .config import load_config
from .language_support import response_language_instruction
from .learner_session import _grounding_block
from .model_bench import _adequacy_rating, _score_evaluator_response, _score_mentor_response, _score_practice_response
from .model_bench import (
_adequacy_rating,
_multilingual_score,
_round_trip_phrases,
_score_evaluator_response,
_score_mentor_response,
_score_practice_response,
)
from .model_provider import ModelProvider
from .multilingual_qa import round_trip_warning_for_phrases
from .ocw_skill_agent_demo import build_skill_grounded_study_plan, evaluate_submission_with_skill, load_ocw_skill_context
from .role_prompts import system_prompt_for_role_variant
@ -110,7 +118,24 @@ def _run_candidate(candidate: dict, skill_dir: str | Path) -> dict:
)
elapsed_ms = round((perf_counter() - started) * 1000.0, 3)
score, notes = _scorer_for_role(role)(response.text)
overall += score
multilingual_score, multilingual_notes = _multilingual_score(role, response.text, language, context.multilingual_qa)
combined_score = (score * 0.8) + (multilingual_score * 0.2)
round_trip = {"warnings": [], "summary": {"source_phrase_count": 0, "round_trip_warning_count": 0, "drifted_phrases": []}}
if language != "en":
source_phrases = _round_trip_phrases(context.multilingual_qa, language)
if source_phrases:
back_translation = provider.generate(
(
"Translate the following text into English as faithfully as possible, preserving technical meaning and caveats.\n\n"
f"{response.text}"
),
role=role,
system_prompt=system_prompt_for_role_variant(role, variant),
temperature=0.0,
max_tokens=220,
).text
round_trip = round_trip_warning_for_phrases(source_phrases, back_translation)
overall += combined_score
role_results.append(
{
"role": role,
@ -119,10 +144,13 @@ def _run_candidate(candidate: dict, skill_dir: str | Path) -> dict:
"prompt_variant": variant,
"language": language,
"latency_ms": elapsed_ms,
"adequacy_score": round(score, 3),
"adequacy_rating": _adequacy_rating(score),
"adequacy_score": round(combined_score, 3),
"adequacy_rating": _adequacy_rating(combined_score),
"grounded_score": round(score, 3),
"multilingual_score": round(multilingual_score, 3),
"round_trip": round_trip,
"response_preview": response.text[:280],
"notes": notes,
"notes": [*notes, *multilingual_notes, *round_trip["warnings"]],
}
)

View File

@ -14,11 +14,84 @@ LANGUAGE_LABELS = {
"ja": "Japanese",
}
UI_STRINGS = {
"en": {
"didactopus_learner_session": "Didactopus Learner Session",
"learner_goal": "Learner goal",
"source_language": "Source language",
"output_language": "Output language",
"study_plan": "Study Plan",
"conversation": "Conversation",
"evaluation_summary": "Evaluation Summary",
"verdict": "Verdict",
"aggregated_dimensions": "Aggregated dimensions",
"follow_up": "Follow-up",
"status": "Status",
"prerequisites": "Prerequisites",
"supporting_lessons": "Supporting lessons",
"grounding_fragments": "Grounding fragments",
"source_fragment": "Source fragment",
"skip_to_session": "Skip to learner session",
"screen_reader_note": "This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.",
},
"es": {
"didactopus_learner_session": "Sesion de aprendizaje de Didactopus",
"learner_goal": "Objetivo del aprendiz",
"source_language": "Idioma de origen",
"output_language": "Idioma de salida",
"study_plan": "Plan de estudio",
"conversation": "Conversacion",
"evaluation_summary": "Resumen de evaluacion",
"verdict": "Veredicto",
"aggregated_dimensions": "Dimensiones agregadas",
"follow_up": "Siguiente paso",
"status": "Estado",
"prerequisites": "Prerrequisitos",
"supporting_lessons": "Lecciones de apoyo",
"grounding_fragments": "Fragmentos de fundamento",
"source_fragment": "Fragmento de fuente",
"skip_to_session": "Saltar a la sesion de aprendizaje",
"screen_reader_note": "Esta pagina esta estructurada para uso con teclado y lector de pantalla. Presenta el objetivo del aprendiz, el plan de estudio, los fragmentos de fundamento y los turnos de conversacion en orden de lectura.",
},
"fr": {
"didactopus_learner_session": "Session d'apprentissage Didactopus",
"learner_goal": "Objectif de l'apprenant",
"source_language": "Langue source",
"output_language": "Langue de sortie",
"study_plan": "Plan d'etude",
"conversation": "Conversation",
"evaluation_summary": "Resume de l'evaluation",
"verdict": "Verdict",
"aggregated_dimensions": "Dimensions agregees",
"follow_up": "Etape suivante",
"status": "Statut",
"prerequisites": "Prerquis",
"supporting_lessons": "Lecons de soutien",
"grounding_fragments": "Fragments d'ancrage",
"source_fragment": "Fragment source",
"skip_to_session": "Aller a la session d'apprentissage",
"screen_reader_note": "Cette page est structuree pour une utilisation au clavier et avec un lecteur d'ecran. Elle presente l'objectif de l'apprenant, le plan d'etude, les fragments d'ancrage et les tours de conversation dans l'ordre de lecture.",
},
}
LANGUAGE_MARKERS = {
"es": (" el ", " la ", " de ", " y ", " que ", " para ", " no ", "una ", "un "),
"fr": (" le ", " la ", " de ", " et ", " que ", " pour ", " pas ", "une ", "un "),
"de": (" der ", " die ", " und ", " nicht ", " ist ", " fur "),
"pt": (" o ", " a ", " de ", " e ", " para ", " nao "),
"it": (" il ", " la ", " di ", " e ", " per ", " non "),
}
def language_label(language: str) -> str:
return LANGUAGE_LABELS.get(language, language)
def ui_text(key: str, language: str) -> str:
table = UI_STRINGS.get(language, UI_STRINGS["en"])
return table.get(key, UI_STRINGS["en"].get(key, key))
def response_language_instruction(language: str, source_language: str = "en") -> str:
if language == source_language:
return ""
@ -26,3 +99,18 @@ def response_language_instruction(language: str, source_language: str = "en") ->
f" Respond in {language_label(language)}. Preserve key source-grounded concepts and caveats faithfully, "
f"and make clear when you are explaining material whose source language is {language_label(source_language)}."
)
def language_alignment_score(text: str, language: str) -> tuple[float, list[str]]:
if language == "en":
return 1.0, []
lowered = f" {text.lower()} "
markers = LANGUAGE_MARKERS.get(language)
if markers is None:
return 0.5, [f"No language-specific heuristic markers are defined for {language} yet."]
marker_hits = sum(1 for marker in markers if marker in lowered)
if marker_hits >= 2:
return 1.0, []
if marker_hits == 1:
return 0.6, [f"Only weak evidence that the response is actually in {language_label(language)}."]
return 0.0, [f"Response does not appear to be in {language_label(language)}."]

View File

@ -4,36 +4,38 @@ import html
import json
from pathlib import Path
from .language_support import ui_text
def _escape(value: object) -> str:
return html.escape(str(value))
def build_accessible_session_text(session: dict) -> str:
language = str(session.get("output_language", "en"))
lines = [
"Didactopus Learner Session",
ui_text("didactopus_learner_session", language),
"",
f"Learner goal: {session.get('goal', '')}",
f"Source language: {session.get('source_language', 'en')}",
f"Output language: {session.get('output_language', 'en')}",
f"{ui_text('learner_goal', language)}: {session.get('goal', '')}",
f"{ui_text('source_language', language)}: {session.get('source_language', 'en')}",
f"{ui_text('output_language', language)}: {session.get('output_language', 'en')}",
"",
"Study plan:",
f"{ui_text('study_plan', language)}:",
]
for index, step in enumerate(session.get("study_plan", {}).get("steps", []), start=1):
lines.extend(
[
f"{index}. {step.get('title', '')}",
f" Status: {step.get('status', '')}",
f" Prerequisites: {', '.join(step.get('prerequisite_titles', []) or ['none explicit'])}",
f" Supporting lessons: {', '.join(step.get('supporting_lessons', []) or ['none listed'])}",
f" {ui_text('status', language)}: {step.get('status', '')}",
f" {ui_text('prerequisites', language)}: {', '.join(step.get('prerequisite_titles', []) or ['none explicit'])}",
f" {ui_text('supporting_lessons', language)}: {', '.join(step.get('supporting_lessons', []) or ['none listed'])}",
]
)
for fragment in step.get("source_fragments", [])[:2]:
lines.append(f" Source fragment ({fragment.get('kind', 'fragment')}): {fragment.get('text', '')}")
lines.append(f" {ui_text('source_fragment', language)} ({fragment.get('kind', 'fragment')}): {fragment.get('text', '')}")
lines.extend(
[
"",
"Conversation:",
f"{ui_text('conversation', language)}:",
]
)
for turn in session.get("turns", []):
@ -47,26 +49,27 @@ def build_accessible_session_text(session: dict) -> str:
evaluation = session.get("evaluation", {})
lines.extend(
[
"Evaluation summary:",
f"Verdict: {evaluation.get('verdict', '')}",
f"Aggregated dimensions: {json.dumps(evaluation.get('aggregated', {}), sort_keys=True)}",
f"Follow-up: {evaluation.get('follow_up', '')}",
f"{ui_text('evaluation_summary', language)}:",
f"{ui_text('verdict', language)}: {evaluation.get('verdict', '')}",
f"{ui_text('aggregated_dimensions', language)}: {json.dumps(evaluation.get('aggregated', {}), sort_keys=True)}",
f"{ui_text('follow_up', language)}: {evaluation.get('follow_up', '')}",
]
)
return "\n".join(lines).strip() + "\n"
def build_accessible_session_html(session: dict) -> str:
language = str(session.get("output_language", "en"))
steps = session.get("study_plan", {}).get("steps", [])
turns = session.get("turns", [])
evaluation = session.get("evaluation", {})
body = [
"<!doctype html>",
'<html lang="en">',
f'<html lang="{_escape(language)}">',
"<head>",
'<meta charset="utf-8">',
'<meta name="viewport" content="width=device-width, initial-scale=1">',
"<title>Didactopus Learner Session</title>",
f"<title>{_escape(ui_text('didactopus_learner_session', language))}</title>",
"<style>",
":root { color-scheme: light; --bg: #f7f4ed; --panel: #fffdf8; --ink: #1e2b31; --muted: #53656d; --line: #d3c8b7; --accent: #155e63; }",
"body { margin: 0; font-family: Georgia, 'Times New Roman', serif; background: var(--bg); color: var(--ink); line-height: 1.55; }",
@ -84,32 +87,32 @@ def build_accessible_session_html(session: dict) -> str:
"</style>",
"</head>",
"<body>",
'<a class="skip" href="#session-main">Skip to learner session</a>',
f'<a class="skip" href="#session-main">{_escape(ui_text("skip_to_session", language))}</a>',
'<main id="session-main" aria-label="Didactopus learner session">',
'<section aria-labelledby="session-title">',
'<h1 id="session-title">Didactopus Learner Session</h1>',
'<p class="sr-note">This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.</p>',
f"<p><strong>Learner goal:</strong> {_escape(session.get('goal', ''))}</p>",
f"<p><strong>Source language:</strong> {_escape(session.get('source_language', 'en'))}</p>",
f"<p><strong>Output language:</strong> {_escape(session.get('output_language', 'en'))}</p>",
f'<h1 id="session-title">{_escape(ui_text("didactopus_learner_session", language))}</h1>',
f'<p class="sr-note">{_escape(ui_text("screen_reader_note", language))}</p>',
f"<p><strong>{_escape(ui_text('learner_goal', language))}:</strong> {_escape(session.get('goal', ''))}</p>",
f"<p><strong>{_escape(ui_text('source_language', language))}:</strong> {_escape(session.get('source_language', 'en'))}</p>",
f"<p><strong>{_escape(ui_text('output_language', language))}:</strong> {_escape(session.get('output_language', 'en'))}</p>",
"</section>",
'<section aria-labelledby="study-plan-title">',
'<h2 id="study-plan-title">Study Plan</h2>',
f'<h2 id="study-plan-title">{_escape(ui_text("study_plan", language))}</h2>',
'<ol>',
]
for step in steps:
body.append("<li>")
body.append(f"<h3>{_escape(step.get('title', ''))}</h3>")
body.append(f"<p><strong>Status:</strong> {_escape(step.get('status', ''))}</p>")
body.append(f"<p><strong>{_escape(ui_text('status', language))}:</strong> {_escape(step.get('status', ''))}</p>")
body.append(
f"<p><strong>Prerequisites:</strong> {_escape(', '.join(step.get('prerequisite_titles', []) or ['none explicit']))}</p>"
f"<p><strong>{_escape(ui_text('prerequisites', language))}:</strong> {_escape(', '.join(step.get('prerequisite_titles', []) or ['none explicit']))}</p>"
)
body.append(
f"<p><strong>Supporting lessons:</strong> {_escape(', '.join(step.get('supporting_lessons', []) or ['none listed']))}</p>"
f"<p><strong>{_escape(ui_text('supporting_lessons', language))}:</strong> {_escape(', '.join(step.get('supporting_lessons', []) or ['none listed']))}</p>"
)
fragments = step.get("source_fragments", [])[:2]
if fragments:
body.append("<p><strong>Grounding fragments:</strong></p>")
body.append(f"<p><strong>{_escape(ui_text('grounding_fragments', language))}:</strong></p>")
body.append("<ul>")
for fragment in fragments:
body.append(
@ -123,7 +126,7 @@ def build_accessible_session_html(session: dict) -> str:
"</ol>",
"</section>",
'<section aria-labelledby="conversation-title">',
'<h2 id="conversation-title">Conversation</h2>',
f'<h2 id="conversation-title">{_escape(ui_text("conversation", language))}</h2>',
]
)
for turn in turns:
@ -136,10 +139,10 @@ def build_accessible_session_html(session: dict) -> str:
[
"</section>",
'<section aria-labelledby="evaluation-title">',
'<h2 id="evaluation-title">Evaluation Summary</h2>',
f"<p><strong>Verdict:</strong> {_escape(evaluation.get('verdict', ''))}</p>",
f"<p><strong>Aggregated dimensions:</strong> {_escape(json.dumps(evaluation.get('aggregated', {}), sort_keys=True))}</p>",
f"<p><strong>Follow-up:</strong> {_escape(evaluation.get('follow_up', ''))}</p>",
f'<h2 id="evaluation-title">{_escape(ui_text("evaluation_summary", language))}</h2>',
f"<p><strong>{_escape(ui_text('verdict', language))}:</strong> {_escape(evaluation.get('verdict', ''))}</p>",
f"<p><strong>{_escape(ui_text('aggregated_dimensions', language))}:</strong> {_escape(json.dumps(evaluation.get('aggregated', {}), sort_keys=True))}</p>",
f"<p><strong>{_escape(ui_text('follow_up', language))}:</strong> {_escape(evaluation.get('follow_up', ''))}</p>",
"</section>",
"</main>",
"</body>",

View File

@ -5,9 +5,10 @@ from pathlib import Path
from time import perf_counter
from .config import load_config
from .language_support import response_language_instruction
from .language_support import language_alignment_score, response_language_instruction
from .learner_session import _grounding_block
from .model_provider import ModelProvider
from .multilingual_qa import multilingual_qa_for_text, round_trip_warning_for_phrases
from .ocw_skill_agent_demo import build_skill_grounded_study_plan, evaluate_submission_with_skill, load_ocw_skill_context
from .role_prompts import system_prompt_for_role
@ -77,6 +78,47 @@ def _adequacy_rating(score: float) -> str:
return "inadequate"
def _multilingual_score(role: str, text: str, language: str, qa_spec: dict | None = None) -> tuple[float, list[str]]:
score, notes = language_alignment_score(text, language)
if language == "en":
return score, notes
qa_score = 1.0
qa_notes: list[str] = []
if qa_spec:
qa_result = multilingual_qa_for_text(qa_spec, language=language, text=text)
qa_notes = list(qa_result["warnings"])
summary = qa_result["summary"]
denominator = summary["required_term_count"] + summary["required_caveat_count"] + summary["forbidden_confusion_count"]
numerator = summary["matched_term_count"] + summary["matched_caveat_count"] + (
summary["forbidden_confusion_count"] - summary["confusion_hit_count"]
)
if denominator > 0:
qa_score = max(0.0, min(1.0, numerator / denominator))
role_lower = role.lower()
if role_lower == "mentor" and "entropy" not in text.lower():
qa_notes = list(qa_notes)
qa_notes.append("Did not visibly preserve a key grounded concept term in multilingual output.")
qa_score = max(0.0, qa_score - 0.2)
combined = (score * 0.5) + (qa_score * 0.5)
return combined, [*notes, *qa_notes]
def _round_trip_phrases(qa_spec: dict | None, language: str) -> list[str]:
if not qa_spec or language == "en":
return []
target = (qa_spec.get("targets", {}) or {}).get(language, {}) or {}
phrases: list[str] = []
for entry in target.get("required_terms", []) or []:
accepted = entry.get("accepted", []) or []
if accepted:
phrases.append(str(accepted[0]))
for entry in target.get("required_caveats", []) or []:
accepted = entry.get("accepted", []) or []
if accepted:
phrases.append(str(accepted[0]))
return phrases[:6]
def _hardware_profile(
*,
profile_name: str,
@ -163,7 +205,24 @@ def run_model_benchmark(
)
elapsed_ms = round((perf_counter() - started) * 1000.0, 3)
score, notes = scorers[role](response.text)
adequacy_scores.append(score)
multilingual_score, multilingual_notes = _multilingual_score(role, response.text, language, context.multilingual_qa)
combined_score = (score * 0.8) + (multilingual_score * 0.2)
round_trip = {"warnings": [], "summary": {"source_phrase_count": 0, "round_trip_warning_count": 0, "drifted_phrases": []}}
if language != "en":
source_phrases = _round_trip_phrases(context.multilingual_qa, language)
if source_phrases:
back_translation = provider.generate(
(
"Translate the following text into English as faithfully as possible, preserving technical meaning and caveats.\n\n"
f"{response.text}"
),
role=role,
system_prompt=system_prompt_for_role(role),
temperature=0.0,
max_tokens=220,
).text
round_trip = round_trip_warning_for_phrases(source_phrases, back_translation)
adequacy_scores.append(combined_score)
role_results.append(
{
"role": role,
@ -171,9 +230,12 @@ def run_model_benchmark(
"model_name": response.model_name,
"latency_ms": elapsed_ms,
"response_preview": response.text[:280],
"adequacy_score": round(score, 3),
"adequacy_rating": _adequacy_rating(score),
"notes": notes,
"adequacy_score": round(combined_score, 3),
"adequacy_rating": _adequacy_rating(combined_score),
"grounded_score": round(score, 3),
"multilingual_score": round(multilingual_score, 3),
"round_trip": round_trip,
"notes": [*notes, *multilingual_notes, *round_trip["warnings"]],
}
)

View File

@ -0,0 +1,100 @@
from __future__ import annotations
from pathlib import Path
import yaml
def _contains_non_negated_pattern(lowered: str, pattern: str) -> bool:
start = lowered.find(pattern)
while start != -1:
prefix = lowered[max(0, start - 4):start]
if not prefix.endswith("no "):
return True
start = lowered.find(pattern, start + 1)
return False
def load_multilingual_qa_spec(source_dir: str | Path) -> dict:
source = Path(source_dir)
path = source / "multilingual_qa.yaml"
if not path.exists():
return {}
return yaml.safe_load(path.read_text(encoding="utf-8")) or {}
def multilingual_qa_for_text(spec: dict, *, language: str, text: str) -> dict:
targets = spec.get("targets", {}) or {}
target = targets.get(language, {}) or {}
warnings: list[str] = []
summary = {
"language": language,
"required_term_count": 0,
"matched_term_count": 0,
"required_caveat_count": 0,
"matched_caveat_count": 0,
"forbidden_confusion_count": 0,
"confusion_hit_count": 0,
}
if not target:
warnings.append(f"No multilingual QA spec is defined for language '{language}'.")
return {"warnings": warnings, "summary": summary}
lowered = text.lower()
required_terms = target.get("required_terms", []) or []
summary["required_term_count"] = len(required_terms)
for term in required_terms:
accepted = [str(item).lower() for item in term.get("accepted", []) or []]
if any(candidate in lowered for candidate in accepted):
summary["matched_term_count"] += 1
else:
warnings.append(f"Missing required multilingual term '{term.get('id', 'unknown')}' for language '{language}'.")
required_caveats = target.get("required_caveats", []) or []
summary["required_caveat_count"] = len(required_caveats)
for caveat in required_caveats:
accepted = [str(item).lower() for item in caveat.get("accepted", []) or []]
if any(candidate in lowered for candidate in accepted):
summary["matched_caveat_count"] += 1
else:
warnings.append(f"Missing required multilingual caveat '{caveat.get('id', 'unknown')}' for language '{language}'.")
forbidden_confusions = target.get("forbidden_confusions", []) or []
summary["forbidden_confusion_count"] = len(forbidden_confusions)
for confusion in forbidden_confusions:
patterns = [str(item).lower() for item in confusion.get("patterns", []) or []]
if any(_contains_non_negated_pattern(lowered, pattern) for pattern in patterns):
summary["confusion_hit_count"] += 1
warnings.append(f"Detected forbidden multilingual confusion '{confusion.get('id', 'unknown')}' for language '{language}'.")
return {"warnings": warnings, "summary": summary}
def multilingual_qa_for_pack(source_dir: str | Path, *, language: str, text: str) -> dict:
spec = load_multilingual_qa_spec(source_dir)
return multilingual_qa_for_text(spec, language=language, text=text)
def round_trip_warning_for_phrases(
source_phrases: list[str],
back_translated_text: str,
) -> dict:
lowered = back_translated_text.lower()
warnings: list[str] = []
drifted: list[str] = []
for phrase in source_phrases:
normalized = str(phrase).strip().lower()
if not normalized:
continue
if normalized not in lowered:
warnings.append(f"Round-trip translation did not preserve source phrase '{phrase}'.")
drifted.append(phrase)
return {
"warnings": warnings,
"summary": {
"source_phrase_count": len([phrase for phrase in source_phrases if str(phrase).strip()]),
"round_trip_warning_count": len(warnings),
"drifted_phrases": drifted,
},
}

View File

@ -0,0 +1,148 @@
from __future__ import annotations
import json
from pathlib import Path
import yaml
from .pack_validator import load_pack_artifacts
def _normalize_phrase(text: str) -> str:
return " ".join(str(text).replace(":", " ").replace("-", " ").split()).strip()
def _candidate_languages(languages: list[str] | None) -> list[str]:
return list(languages) if languages else ["es", "fr"]
def _seed_required_terms(concepts: list[dict]) -> list[dict]:
seeded = []
seen = set()
for concept in concepts:
title = str(concept.get("title", "")).strip()
concept_id = str(concept.get("id", "")).strip()
if not title or not concept_id:
continue
normalized = _normalize_phrase(title)
if len(normalized.split()) < 2:
continue
if concept_id in seen:
continue
seen.add(concept_id)
seeded.append(
{
"id": concept_id,
"accepted": [normalized],
}
)
return seeded[:12]
def _seed_required_caveats(source_corpus: dict) -> list[dict]:
caveats = []
seen = set()
for fragment in source_corpus.get("fragments", []) or []:
texts = [fragment.get("text", "")]
texts.extend(fragment.get("objectives", []) or [])
texts.extend(fragment.get("exercises", []) or [])
for text in texts:
lowered = str(text).lower()
if "not identical" in lowered or "differs from" in lowered or "careful interpretation" in lowered:
lesson_title = _normalize_phrase(fragment.get("lesson_title", "lesson"))
caveat_id = lesson_title.lower().replace(" ", "-")[:48] or "caveat"
if caveat_id in seen:
continue
seen.add(caveat_id)
caveats.append(
{
"id": caveat_id,
"accepted": [_normalize_phrase(text)],
}
)
return caveats[:6]
def _seed_forbidden_confusions(required_caveats: list[dict]) -> list[dict]:
confusions = []
for caveat in required_caveats:
accepted = caveat.get("accepted", []) or []
if not accepted:
continue
phrase = str(accepted[0])
lowered = phrase.lower()
if "not identical" in lowered:
confusion = phrase.replace("not identical", "identical")
elif "differs from" in lowered:
confusion = phrase.replace("differs from", "is identical to")
else:
continue
confusions.append(
{
"id": f"{caveat['id']}-confusion",
"patterns": [_normalize_phrase(confusion)],
}
)
return confusions[:6]
def generate_multilingual_qa_seed(
source_dir: str | Path,
*,
languages: list[str] | None = None,
) -> dict:
source_dir = Path(source_dir)
loaded = load_pack_artifacts(source_dir)
if not loaded["ok"]:
raise ValueError(f"Cannot seed multilingual QA for invalid pack directory: {source_dir}")
concepts = loaded["artifacts"]["concepts"].get("concepts", []) or []
source_corpus_path = source_dir / "source_corpus.json"
source_corpus = json.loads(source_corpus_path.read_text(encoding="utf-8")) if source_corpus_path.exists() else {"fragments": []}
required_terms = _seed_required_terms(concepts)
required_caveats = _seed_required_caveats(source_corpus)
forbidden_confusions = _seed_forbidden_confusions(required_caveats)
targets = {}
for language in _candidate_languages(languages):
targets[language] = {
"required_terms": required_terms,
"required_caveats": required_caveats,
"forbidden_confusions": forbidden_confusions,
}
return {
"source_language": "en",
"generated_by": "didactopus.multilingual_qa_seed",
"review_status": "draft-seed",
"targets": targets,
}
def write_multilingual_qa_seed(
source_dir: str | Path,
*,
out_path: str | Path | None = None,
languages: list[str] | None = None,
) -> Path:
source_dir = Path(source_dir)
payload = generate_multilingual_qa_seed(source_dir, languages=languages)
out_path = Path(out_path) if out_path is not None else source_dir / "multilingual_qa.seed.yaml"
out_path.write_text(yaml.safe_dump(payload, sort_keys=False, allow_unicode=False), encoding="utf-8")
return out_path
def main() -> None:
import argparse
parser = argparse.ArgumentParser(description="Generate a starter multilingual QA spec from a Didactopus pack.")
parser.add_argument("pack_dir")
parser.add_argument("--out", default=None)
parser.add_argument("--languages", nargs="*", default=None)
args = parser.parse_args()
out_path = write_multilingual_qa_seed(args.pack_dir, out_path=args.out, languages=args.languages)
print(json.dumps({"written": str(out_path)}, indent=2))
if __name__ == "__main__":
main()

View File

@ -8,6 +8,7 @@ import yaml
from .evaluator_pipeline import CritiqueEvaluator, LearnerAttempt, RubricEvaluator, SymbolicRuleEvaluator, aggregate, run_pipeline
from .graph_retrieval import GraphBundle, lesson_titles_for_concept, prerequisite_titles, source_fragments_for_concept
from .multilingual_qa import load_multilingual_qa_spec
@dataclass
@ -21,6 +22,7 @@ class SkillContext:
graph_bundle: GraphBundle
capability_profile: dict
run_summary: dict
multilingual_qa: dict
def load_ocw_skill_context(skill_dir: str | Path) -> SkillContext:
@ -54,6 +56,7 @@ def load_ocw_skill_context(skill_dir: str | Path) -> SkillContext:
),
capability_profile=json.loads((run_dir / "capability_profile.json").read_text(encoding="utf-8")),
run_summary=json.loads((run_dir / "run_summary.json").read_text(encoding="utf-8")),
multilingual_qa=load_multilingual_qa_spec(pack_dir),
)

View File

@ -35,4 +35,6 @@ def test_run_didactopus_arena_writes_outputs(tmp_path: Path) -> None:
queue = json.loads((tmp_path / "arena_review_queue.json").read_text(encoding="utf-8"))
assert queue
assert payload["ranked_candidates"][0]["language"] in {"en", "es", "fr"}
assert "multilingual_score" in payload["ranked_candidates"][0]["role_results"][0]
assert "round_trip" in payload["ranked_candidates"][0]["role_results"][0]
assert "LLM Review Summary" in (tmp_path / "arena_report.md").read_text(encoding="utf-8")

View File

@ -0,0 +1,28 @@
from didactopus.language_support import language_alignment_score, response_language_instruction, ui_text
def test_response_language_instruction_is_empty_for_source_language() -> None:
assert response_language_instruction("en", "en") == ""
def test_response_language_instruction_mentions_target_language() -> None:
instruction = response_language_instruction("es", "en")
assert "Spanish" in instruction
assert "English" in instruction
def test_ui_text_uses_translated_labels() -> None:
assert ui_text("study_plan", "es") == "Plan de estudio"
assert ui_text("evaluation_summary", "fr") == "Resume de l'evaluation"
def test_language_alignment_score_detects_non_english_markers() -> None:
score, notes = language_alignment_score("La entropia y la capacidad del canal se comparan para el aprendiz.", "es")
assert score == 1.0
assert notes == []
def test_language_alignment_score_flags_wrong_language() -> None:
score, notes = language_alignment_score("This response is still entirely in English.", "es")
assert score == 0.0
assert notes

View File

@ -30,9 +30,23 @@ def test_accessible_session_text_is_linearized() -> None:
assert "Learner goal:" in text
assert "Source language:" in text
assert "Output language:" in text
assert "Study plan:" in text
assert "Study Plan:" in text
assert "Conversation:" in text
assert "Evaluation summary:" in text
assert "Evaluation Summary:" in text
def test_accessible_session_outputs_localize_fixed_labels() -> None:
root = Path(__file__).resolve().parents[1]
payload = run_learner_session_demo(
root / "configs" / "config.example.yaml",
root / "skills" / "ocw-information-entropy-agent",
language="es",
)
html = build_accessible_session_html(payload)
text = build_accessible_session_text(payload)
assert "Sesion de aprendizaje de Didactopus" in html
assert "Plan de estudio" in html
assert "Objetivo del aprendiz:" in text
def test_render_accessible_session_outputs_writes_files(tmp_path: Path) -> None:

View File

@ -43,3 +43,15 @@ def test_model_benchmark_captures_response_preview_and_latency(tmp_path) -> None
assert result["latency_ms"] >= 0.0
assert result["response_preview"]
assert "adequacy_score" in result
assert "round_trip" in result
def test_model_benchmark_penalizes_stub_for_non_english_output(tmp_path) -> None:
payload = run_model_benchmark(
config_path="configs/config.example.yaml",
skill_dir="skills/ocw-information-entropy-agent",
out_dir=tmp_path,
language="es",
)
assert payload["context"]["output_language"] == "es"
assert any(result["multilingual_score"] < 1.0 for result in payload["role_results"])

View File

@ -0,0 +1,52 @@
from pathlib import Path
from didactopus.multilingual_qa import (
load_multilingual_qa_spec,
multilingual_qa_for_pack,
multilingual_qa_for_text,
round_trip_warning_for_phrases,
)
def test_load_multilingual_qa_spec_reads_ocw_pack() -> None:
spec = load_multilingual_qa_spec("domain-packs/mit-ocw-information-entropy")
assert spec["source_language"] == "en"
assert "es" in spec["targets"]
assert "fr" in spec["targets"]
def test_multilingual_qa_for_text_accepts_spanish_preservation() -> None:
spec = load_multilingual_qa_spec("domain-packs/mit-ocw-information-entropy")
result = multilingual_qa_for_text(
spec,
language="es",
text="La entropía de Shannon no es idéntica a la entropía termodinámica, y la capacidad del canal impone otro límite.",
)
assert result["summary"]["matched_term_count"] >= 2
assert result["summary"]["matched_caveat_count"] == 1
assert result["summary"]["confusion_hit_count"] == 0
def test_multilingual_qa_for_text_flags_confusion() -> None:
spec = load_multilingual_qa_spec("domain-packs/mit-ocw-information-entropy")
result = multilingual_qa_for_text(
spec,
language="es",
text="La entropía de Shannon es idéntica a la entropía termodinámica.",
)
assert result["summary"]["confusion_hit_count"] == 1
assert any("forbidden multilingual confusion" in warning.lower() for warning in result["warnings"])
def test_multilingual_qa_for_pack_handles_missing_spec(tmp_path: Path) -> None:
result = multilingual_qa_for_pack(tmp_path, language="es", text="Texto de prueba.")
assert any("no multilingual qa spec" in warning.lower() for warning in result["warnings"])
def test_round_trip_warning_for_phrases_flags_drift() -> None:
result = round_trip_warning_for_phrases(
["Shannon entropy", "channel capacity"],
"This back translation only preserved Shannon entropy.",
)
assert result["summary"]["round_trip_warning_count"] == 1
assert result["summary"]["drifted_phrases"] == ["channel capacity"]

View File

@ -0,0 +1,27 @@
from pathlib import Path
import yaml
from didactopus.multilingual_qa_seed import generate_multilingual_qa_seed, write_multilingual_qa_seed
def test_generate_multilingual_qa_seed_uses_pack_content() -> None:
payload = generate_multilingual_qa_seed("domain-packs/mit-ocw-information-entropy", languages=["es"])
assert payload["source_language"] == "en"
assert payload["review_status"] == "draft-seed"
assert "es" in payload["targets"]
target = payload["targets"]["es"]
assert target["required_terms"]
assert any(item["id"] == "shannon-entropy" for item in target["required_terms"])
assert target["required_caveats"]
def test_write_multilingual_qa_seed_writes_yaml(tmp_path: Path) -> None:
out = write_multilingual_qa_seed(
"domain-packs/mit-ocw-information-entropy",
out_path=tmp_path / "multilingual_qa.seed.yaml",
languages=["es", "fr"],
)
assert out.exists()
written = yaml.safe_load(out.read_text(encoding="utf-8"))
assert set(written["targets"]) == {"es", "fr"}