Revised multilignual support with round-trip warnings, updated philosophy.

2026-03-17 20:36:38 -04:00 · 2026-03-17 20:36:38 -04:00 · 58466bbf9f
parent 34b60ac529
commit 58466bbf9f
29 changed files with 1208 additions and 137 deletions
--- a/README.md
+++ b/README.md
@ -49,6 +49,8 @@ In practice, that means Didactopus tries to help with:
 It explicitly tries not to become a silent answer surrogate.
 The project is also being advanced with a future-compatibility constraint: avoid choices that assume abundant compute, fluent English, expert supervision, or only mature learners. That keeps the current roadmap moving while preserving eventual usefulness for more constrained and equity-sensitive educational settings.
 ## Who It Is For
 Didactopus has several real audiences:
@ -74,6 +76,13 @@ Current priorities are:
 The live detailed roadmap is in:
 - `docs/roadmap.md`
 - `docs/multilingual-qa.md`
 Didactopus can also generate a starter multilingual QA draft from a pack:
 ```bash
 python -m didactopus.multilingual_qa_seed domain-packs/mit-ocw-information-entropy
 ```
 ## Start Here If You Just Want To Learn
--- a/docs/arena.md
+++ b/docs/arena.md
@ -84,6 +84,12 @@ The arena currently writes:
 - `arena_review_queue.json`
 - `arena_report.md`
 When a candidate sets a non-English `language`, the arena now also tracks a heuristic `multilingual_score` alongside the grounded behavior score. This is meant to catch obvious failures where a model ignores the requested output language or drops key grounded terms.
 If the pack provides `multilingual_qa.yaml`, the arena also uses that spec to check required terms, required caveats, and forbidden confusions for the target language.
 For non-English candidates, the arena now also records round-trip warnings by back-translating outputs into English and checking whether required source phrases remain recoverable.
 ## Human Review Position
 The LLM review summary should be treated as initial triage support only.
--- a/docs/learner-accessibility.md
+++ b/docs/learner-accessibility.md
@ -26,6 +26,7 @@ The HTML output is meant to be screen-reader-friendly and keyboard-friendly:
 - semantic headings
 - reading-order sections for study plan, conversation, and evaluation
 - grounded source fragments rendered as ordinary text instead of only visual diagrams
 - deterministic learner-facing labels localized for supported output languages
 The plain-text output is a linearized learner-session transcript that is suitable for:
--- a/docs/local-model-benchmark.md
+++ b/docs/local-model-benchmark.md
@ -89,6 +89,15 @@ The current heuristic scoring asks whether each role does the right kind of work
 This is deliberately narrower than a general-purpose benchmark. Didactopus cares about trustworthy learner guidance, not maximal generic fluency.
 When `--language` is set to a non-English value, the benchmark now also applies a heuristic multilingual check:
 - does the response appear to actually be in the target language?
 - does it still preserve key grounded concept terms and caveats?
 If the pack provides `multilingual_qa.yaml`, the benchmark also applies per-pack preservation checks from that spec.
 For non-English runs, the benchmark now also records a round-trip warning layer by back-translating role outputs into English and checking whether required source phrases are still recoverable. This is a warning-oriented signal, not a proof of correctness.
 ## Interpreting Ratings
 - `adequate`
--- a/docs/multilingual-qa.md
+++ b/docs/multilingual-qa.md
@ -0,0 +1,80 @@
 # Multilingual QA
 Didactopus now supports an optional per-pack multilingual QA spec.
 The goal is not to certify perfect translation quality. The goal is to make multilingual evaluation less dependent on vague fluency judgments by checking whether key terms, caveats, and forbidden confusions survive across languages.
 ## Spec File
 Place this file in a pack directory:
 - `multilingual_qa.yaml`
 It is currently optional.
 ## Current Shape
 ```yaml
 source_language: en
 targets:
  es:
    required_terms:
      - id: shannon-entropy
        accepted:
          - "entropía de shannon"
    required_caveats:
      - id: shannon-vs-thermo-not-identical
        accepted:
          - "no es idéntica"
    forbidden_confusions:
      - id: shannon-equals-thermodynamic-entropy
        patterns:
          - "es idéntica a la entropía termodinámica"
 ```
 ## Starter Generation
 Didactopus can now generate a draft starter spec for reviewer refinement:
 ```bash
 python -m didactopus.multilingual_qa_seed domain-packs/mit-ocw-information-entropy \
  --out domain-packs/mit-ocw-information-entropy/multilingual_qa.seed.yaml \
  --languages es fr
 ```
 The generated `multilingual_qa.seed.yaml` is not meant for immediate trust. It is a reviewer aid that pulls:
 - multi-word concept titles as draft required terms
 - likely caveat candidates from grounded source fragments
 - likely forbidden confusions derived from negated caveat language
 ## What It Checks
 For a target language, the QA layer can check:
 - required terms that should appear in acceptable translated or multilingual output
 - required caveats that must survive explanation
 - forbidden confusions that should trigger warnings
 ## Where It Is Used
 This spec now feeds:
 - the local model benchmark
 - the Didactopus arena
 Those tools still use heuristic scoring, but multilingual QA spec checks now contribute an explicit preservation signal.
 ## Why This Helps
 This gives Didactopus a better layered multilingual evaluation model:
 1. language-alignment heuristics
 2. term and caveat preservation checks
 3. round-trip warning checks on required phrases
 4. arena comparison and LLM review support
 5. human bilingual review for promoted or disputed outputs
 ## Current Limitation
 This is still a lightweight preservation framework. It does not yet prove semantic equivalence across whole explanations. It is best treated as an early QA filter and promotion aid.
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@ -190,6 +190,7 @@ Examples:
 - Prefer role-adequate local models over chasing a single best model.
 - Keep accessibility and low-cost deployment in scope from the start, not as cleanup work.
 - Preserve provenance and license compliance as first-class constraints.
 - Advance the current roadmap without assuming abundant compute, fluent English, expert supervision, or mature learners.
 ## Suggested Implementation Sequence
--- a/domain-packs/mit-ocw-information-entropy/multilingual_qa.seed.yaml
+++ b/domain-packs/mit-ocw-information-entropy/multilingual_qa.seed.yaml
@ -0,0 +1,60 @@
 source_language: en
 generated_by: didactopus.multilingual_qa_seed
 review_status: draft-seed
 targets:
  es:
    required_terms: &id001
    - id: mit-ocw-6-050j-information-and-entropy-course-home
      accepted:
      - MIT OCW 6.050J Information and Entropy Course Home
    - id: information-and-entropy
      accepted:
      - Information and Entropy
    - id: ultimate-limits-to-communication-and-computation
      accepted:
      - Ultimate Limits to Communication and Computation
    - id: open-textbooks-problem-sets-and-programming-work
      accepted:
      - Open Textbooks, Problem Sets, and Programming Work
    - id: mit-ocw-6-050j-information-and-entropy-syllabus
      accepted:
      - MIT OCW 6.050J Information and Entropy Syllabus
    - id: prerequisites-and-mathematical-background
      accepted:
      - Prerequisites and Mathematical Background
    - id: assessment-structure
      accepted:
      - Assessment Structure
    - id: course-notes-and-reference-texts
      accepted:
      - Course Notes and Reference Texts
    - id: independent-reasoning-and-careful-comparison
      accepted:
      - Independent Reasoning and Careful Comparison
    - id: mit-ocw-6-050j-information-and-entropy-unit-sequence
      accepted:
      - MIT OCW 6.050J Information and Entropy Unit Sequence
    - id: counting-and-probability
      accepted:
      - Counting and Probability
    - id: shannon-entropy
      accepted:
      - Shannon Entropy
    required_caveats: &id002
    - id: thermodynamics-and-entropy
      accepted:
      - Objective Explain how thermodynamic entropy relates to, and differs from,
        Shannon entropy. Exercise Compare the two entropy notions and identify what
        is preserved across the analogy. The course uses entropy as a bridge concept
        between communication theory and physics while insisting on careful interpretation.
    forbidden_confusions: &id003
    - id: thermodynamics-and-entropy-confusion
      patterns:
      - Objective Explain how thermodynamic entropy relates to, and is identical to,
        Shannon entropy. Exercise Compare the two entropy notions and identify what
        is preserved across the analogy. The course uses entropy as a bridge concept
        between communication theory and physics while insisting on careful interpretation.
  fr:
    required_terms: *id001
    required_caveats: *id002
    forbidden_confusions: *id003
--- a/domain-packs/mit-ocw-information-entropy/multilingual_qa.yaml
+++ b/domain-packs/mit-ocw-information-entropy/multilingual_qa.yaml
@ -0,0 +1,59 @@
 source_language: en
 targets:
  es:
    required_terms:
      - id: shannon-entropy
        accepted:
          - "entropia"
          - "entropía"
          - "entropia de shannon"
          - "entropía de shannon"
      - id: channel-capacity
        accepted:
          - "capacidad del canal"
          - "capacidad de canal"
      - id: thermodynamic-entropy
        accepted:
          - "entropia termodinamica"
          - "entropía termodinámica"
    required_caveats:
      - id: shannon-vs-thermo-not-identical
        accepted:
          - "no es identica"
          - "no es idéntica"
          - "no son identicas"
          - "no son idénticas"
          - "no equivale exactamente"
    forbidden_confusions:
      - id: shannon-equals-thermodynamic-entropy
        patterns:
          - "es identica a la entropia termodinamica"
          - "es idéntica a la entropía termodinámica"
          - "son identicas"
          - "son idénticas"
  fr:
    required_terms:
      - id: shannon-entropy
        accepted:
          - "entropie"
          - "entropie de shannon"
      - id: channel-capacity
        accepted:
          - "capacite du canal"
          - "capacité du canal"
      - id: thermodynamic-entropy
        accepted:
          - "entropie thermodynamique"
    required_caveats:
      - id: shannon-vs-thermo-not-identical
        accepted:
          - "n'est pas identique"
          - "ne sont pas identiques"
          - "n'est pas equivalente"
          - "n'est pas équivalente"
    forbidden_confusions:
      - id: shannon-equals-thermodynamic-entropy
        patterns:
          - "est identique a l'entropie thermodynamique"
          - "est identique à l'entropie thermodynamique"
          - "sont identiques"
--- a/examples/arena/arena_report.md
+++ b/examples/arena/arena_report.md
@ -3,9 +3,9 @@
 - Candidates: 3
 ## Rankings
- `stub-baseline` via `stub` / prompt variant `baseline`: borderline (0.667), language `en`
+- `stub-baseline` via `stub` / prompt variant `baseline`: borderline (0.733), language `en`
- `stub-strict-grounding` via `stub` / prompt variant `strict_grounding`: borderline (0.667), language `es`
+- `stub-strict-grounding` via `stub` / prompt variant `strict_grounding`: inadequate (0.547), language `es`
- `stub-trust-preserving` via `stub` / prompt variant `trust_preserving`: borderline (0.667), language `fr`
+- `stub-trust-preserving` via `stub` / prompt variant `trust_preserving`: inadequate (0.547), language `fr`
 ## Human Review Queue
 - `stub-baseline`: needs_human_review=True, weak_roles=['mentor', 'evaluator']
--- a/examples/arena/arena_results.json
+++ b/examples/arena/arena_results.json
@ -10,7 +10,7 @@
      "prompt_variant": "baseline",
      "language": "en",
      "provider": "stub",
-      "overall_score": 0.667,
+      "overall_score": 0.733,
      "overall_rating": "borderline",
      "role_results": [
        {
@ -19,9 +19,11 @@
          "model_name": "local-demo",
          "prompt_variant": "baseline",
          "language": "en",
-          "latency_ms": 0.027,
+          "latency_ms": 0.021,
-          "adequacy_score": 0.65,
+          "adequacy_score": 0.72,
          "adequacy_rating": "borderline",
          "grounded_score": 0.65,
          "multilingual_score": 1.0,
          "response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
          "notes": [
            "Did not ask a focused learner question."
@ -33,9 +35,11 @@
          "model_name": "local-demo",
          "prompt_variant": "baseline",
          "language": "en",
-          "latency_ms": 0.006,
+          "latency_ms": 0.005,
          "adequacy_score": 1.0,
          "adequacy_rating": "adequate",
          "grounded_score": 1.0,
          "multilingual_score": 1.0,
          "response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
          "notes": []
        },
@ -45,9 +49,11 @@
          "model_name": "local-demo",
          "prompt_variant": "baseline",
          "language": "en",
-          "latency_ms": 0.005,
+          "latency_ms": 0.004,
-          "adequacy_score": 0.35,
+          "adequacy_score": 0.48,
          "adequacy_rating": "inadequate",
          "grounded_score": 0.35,
          "multilingual_score": 1.0,
          "response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
          "notes": [
            "Did not acknowledge learner strengths.",
@ -62,8 +68,8 @@
      "prompt_variant": "strict_grounding",
      "language": "es",
      "provider": "stub",
-      "overall_score": 0.667,
+      "overall_score": 0.547,
-      "overall_rating": "borderline",
+      "overall_rating": "inadequate",
      "role_results": [
        {
          "role": "mentor",
@ -71,12 +77,20 @@
          "model_name": "local-demo",
          "prompt_variant": "strict_grounding",
          "language": "es",
-          "latency_ms": 0.019,
+          "latency_ms": 0.028,
-          "adequacy_score": 0.65,
+          "adequacy_score": 0.52,
-          "adequacy_rating": "borderline",
+          "adequacy_rating": "inadequate",
          "grounded_score": 0.65,
          "multilingual_score": 0.0,
          "response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
          "notes": [
-            "Did not ask a focused learner question."
+            "Did not ask a focused learner question.",
            "Response does not appear to be in Spanish.",
            "Missing required multilingual term 'shannon-entropy' for language 'es'.",
            "Missing required multilingual term 'channel-capacity' for language 'es'.",
            "Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
            "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
            "Did not visibly preserve a key grounded concept term in multilingual output."
          ]
        },
        {
@ -85,11 +99,19 @@
          "model_name": "local-demo",
          "prompt_variant": "strict_grounding",
          "language": "es",
-          "latency_ms": 0.005,
+          "latency_ms": 0.006,
-          "adequacy_score": 1.0,
+          "adequacy_score": 0.82,
          "adequacy_rating": "adequate",
          "grounded_score": 1.0,
          "multilingual_score": 0.1,
          "response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
-          "notes": []
+          "notes": [
            "Response does not appear to be in Spanish.",
            "Missing required multilingual term 'shannon-entropy' for language 'es'.",
            "Missing required multilingual term 'channel-capacity' for language 'es'.",
            "Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
            "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'."
          ]
        },
        {
          "role": "evaluator",
@ -97,13 +119,20 @@
          "model_name": "local-demo",
          "prompt_variant": "strict_grounding",
          "language": "es",
-          "latency_ms": 0.004,
+          "latency_ms": 0.006,
-          "adequacy_score": 0.35,
+          "adequacy_score": 0.3,
          "adequacy_rating": "inadequate",
          "grounded_score": 0.35,
          "multilingual_score": 0.1,
          "response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
          "notes": [
            "Did not acknowledge learner strengths.",
-            "Did not provide a concrete next step."
+            "Did not provide a concrete next step.",
            "Response does not appear to be in Spanish.",
            "Missing required multilingual term 'shannon-entropy' for language 'es'.",
            "Missing required multilingual term 'channel-capacity' for language 'es'.",
            "Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
            "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'."
          ]
        }
      ]
@ -114,8 +143,8 @@
      "prompt_variant": "trust_preserving",
      "language": "fr",
      "provider": "stub",
-      "overall_score": 0.667,
+      "overall_score": 0.547,
-      "overall_rating": "borderline",
+      "overall_rating": "inadequate",
      "role_results": [
        {
          "role": "mentor",
@ -123,12 +152,20 @@
          "model_name": "local-demo",
          "prompt_variant": "trust_preserving",
          "language": "fr",
-          "latency_ms": 0.025,
+          "latency_ms": 0.024,
-          "adequacy_score": 0.65,
+          "adequacy_score": 0.52,
-          "adequacy_rating": "borderline",
+          "adequacy_rating": "inadequate",
          "grounded_score": 0.65,
          "multilingual_score": 0.0,
          "response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
          "notes": [
-            "Did not ask a focused learner question."
+            "Did not ask a focused learner question.",
            "Response does not appear to be in French.",
            "Missing required multilingual term 'shannon-entropy' for language 'fr'.",
            "Missing required multilingual term 'channel-capacity' for language 'fr'.",
            "Missing required multilingual term 'thermodynamic-entropy' for language 'fr'.",
            "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'fr'.",
            "Did not visibly preserve a key grounded concept term in multilingual output."
          ]
        },
        {
@ -137,11 +174,19 @@
          "model_name": "local-demo",
          "prompt_variant": "trust_preserving",
          "language": "fr",
-          "latency_ms": 0.005,
+          "latency_ms": 0.006,
-          "adequacy_score": 1.0,
+          "adequacy_score": 0.82,
          "adequacy_rating": "adequate",
          "grounded_score": 1.0,
          "multilingual_score": 0.1,
          "response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
-          "notes": []
+          "notes": [
            "Response does not appear to be in French.",
            "Missing required multilingual term 'shannon-entropy' for language 'fr'.",
            "Missing required multilingual term 'channel-capacity' for language 'fr'.",
            "Missing required multilingual term 'thermodynamic-entropy' for language 'fr'.",
            "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'fr'."
          ]
        },
        {
          "role": "evaluator",
@ -150,12 +195,19 @@
          "prompt_variant": "trust_preserving",
          "language": "fr",
          "latency_ms": 0.005,
-          "adequacy_score": 0.35,
+          "adequacy_score": 0.3,
          "adequacy_rating": "inadequate",
          "grounded_score": 0.35,
          "multilingual_score": 0.1,
          "response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
          "notes": [
            "Did not acknowledge learner strengths.",
-            "Did not provide a concrete next step."
+            "Did not provide a concrete next step.",
            "Response does not appear to be in French.",
            "Missing required multilingual term 'shannon-entropy' for language 'fr'.",
            "Missing required multilingual term 'channel-capacity' for language 'fr'.",
            "Missing required multilingual term 'thermodynamic-entropy' for language 'fr'.",
            "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'fr'."
          ]
        }
      ]
@ -165,7 +217,7 @@
    {
      "candidate_name": "stub-baseline",
      "overall_rating": "borderline",
-      "overall_score": 0.667,
+      "overall_score": 0.733,
      "needs_human_review": true,
      "weak_roles": [
        "mentor",
@ -174,8 +226,8 @@
    },
    {
      "candidate_name": "stub-strict-grounding",
-      "overall_rating": "borderline",
+      "overall_rating": "inadequate",
-      "overall_score": 0.667,
+      "overall_score": 0.547,
      "needs_human_review": true,
      "weak_roles": [
        "mentor",
@ -184,8 +236,8 @@
    },
    {
      "candidate_name": "stub-trust-preserving",
-      "overall_rating": "borderline",
+      "overall_rating": "inadequate",
-      "overall_score": 0.667,
+      "overall_score": 0.547,
      "needs_human_review": true,
      "weak_roles": [
        "mentor",
--- a/examples/arena/arena_review_queue.json
+++ b/examples/arena/arena_review_queue.json
@ -2,7 +2,7 @@
  {
    "candidate_name": "stub-baseline",
    "overall_rating": "borderline",
-    "overall_score": 0.667,
+    "overall_score": 0.733,
    "needs_human_review": true,
    "weak_roles": [
      "mentor",
@ -11,8 +11,8 @@
  },
  {
    "candidate_name": "stub-strict-grounding",
-    "overall_rating": "borderline",
+    "overall_rating": "inadequate",
-    "overall_score": 0.667,
+    "overall_score": 0.547,
    "needs_human_review": true,
    "weak_roles": [
      "mentor",
@ -21,8 +21,8 @@
  },
  {
    "candidate_name": "stub-trust-preserving",
-    "overall_rating": "borderline",
+    "overall_rating": "inadequate",
-    "overall_score": 0.667,
+    "overall_score": 0.547,
    "needs_human_review": true,
    "weak_roles": [
      "mentor",
--- a/examples/model-benchmark-es/model_benchmark.json
+++ b/examples/model-benchmark-es/model_benchmark.json
@ -0,0 +1,152 @@
 {
  "benchmark": {
    "name": "didactopus-local-model-adequacy",
    "task_family": "graph-grounded-mentor-loop",
    "provider": "stub",
    "hardware_profile": {
      "profile_name": "unspecified-local",
      "cpu": "unknown",
      "ram_gb": null,
      "notes": ""
    }
  },
  "context": {
    "skill_name": "ocw-information-entropy-agent",
    "study_plan_task": "Help a learner connect Shannon entropy, channel capacity, and thermodynamic entropy.",
    "primary_concept": "Independent Reasoning and Careful Comparison",
    "secondary_concept": "Thermodynamics and Entropy",
    "source_language": "en",
    "output_language": "es"
  },
  "role_results": [
    {
      "role": "mentor",
      "provider": "stub",
      "model_name": "local-demo",
      "latency_ms": 0.025,
      "response_preview": "[stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
      "adequacy_score": 0.52,
      "adequacy_rating": "inadequate",
      "grounded_score": 0.65,
      "multilingual_score": 0.0,
      "round_trip": {
        "warnings": [
          "Round-trip translation did not preserve source phrase 'entropia'.",
          "Round-trip translation did not preserve source phrase 'capacidad del canal'.",
          "Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
          "Round-trip translation did not preserve source phrase 'no es identica'."
        ],
        "summary": {
          "source_phrase_count": 4,
          "round_trip_warning_count": 4,
          "drifted_phrases": [
            "entropia",
            "capacidad del canal",
            "entropia termodinamica",
            "no es identica"
          ]
        }
      },
      "notes": [
        "Did not ask a focused learner question.",
        "Response does not appear to be in Spanish.",
        "Missing required multilingual term 'shannon-entropy' for language 'es'.",
        "Missing required multilingual term 'channel-capacity' for language 'es'.",
        "Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
        "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
        "Did not visibly preserve a key grounded concept term in multilingual output.",
        "Round-trip translation did not preserve source phrase 'entropia'.",
        "Round-trip translation did not preserve source phrase 'capacidad del canal'.",
        "Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
        "Round-trip translation did not preserve source phrase 'no es identica'."
      ]
    },
    {
      "role": "practice",
      "provider": "stub",
      "model_name": "local-demo",
      "latency_ms": 0.004,
      "response_preview": "[stubbed-response] [practice] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
      "adequacy_score": 0.82,
      "adequacy_rating": "adequate",
      "grounded_score": 1.0,
      "multilingual_score": 0.1,
      "round_trip": {
        "warnings": [
          "Round-trip translation did not preserve source phrase 'entropia'.",
          "Round-trip translation did not preserve source phrase 'capacidad del canal'.",
          "Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
          "Round-trip translation did not preserve source phrase 'no es identica'."
        ],
        "summary": {
          "source_phrase_count": 4,
          "round_trip_warning_count": 4,
          "drifted_phrases": [
            "entropia",
            "capacidad del canal",
            "entropia termodinamica",
            "no es identica"
          ]
        }
      },
      "notes": [
        "Response does not appear to be in Spanish.",
        "Missing required multilingual term 'shannon-entropy' for language 'es'.",
        "Missing required multilingual term 'channel-capacity' for language 'es'.",
        "Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
        "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
        "Round-trip translation did not preserve source phrase 'entropia'.",
        "Round-trip translation did not preserve source phrase 'capacidad del canal'.",
        "Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
        "Round-trip translation did not preserve source phrase 'no es identica'."
      ]
    },
    {
      "role": "evaluator",
      "provider": "stub",
      "model_name": "local-demo",
      "latency_ms": 0.004,
      "response_preview": "[stubbed-response] [evaluator] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons",
      "adequacy_score": 0.3,
      "adequacy_rating": "inadequate",
      "grounded_score": 0.35,
      "multilingual_score": 0.1,
      "round_trip": {
        "warnings": [
          "Round-trip translation did not preserve source phrase 'entropia'.",
          "Round-trip translation did not preserve source phrase 'capacidad del canal'.",
          "Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
          "Round-trip translation did not preserve source phrase 'no es identica'."
        ],
        "summary": {
          "source_phrase_count": 4,
          "round_trip_warning_count": 4,
          "drifted_phrases": [
            "entropia",
            "capacidad del canal",
            "entropia termodinamica",
            "no es identica"
          ]
        }
      },
      "notes": [
        "Did not acknowledge learner strengths.",
        "Did not provide a concrete next step.",
        "Response does not appear to be in Spanish.",
        "Missing required multilingual term 'shannon-entropy' for language 'es'.",
        "Missing required multilingual term 'channel-capacity' for language 'es'.",
        "Missing required multilingual term 'thermodynamic-entropy' for language 'es'.",
        "Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.",
        "Round-trip translation did not preserve source phrase 'entropia'.",
        "Round-trip translation did not preserve source phrase 'capacidad del canal'.",
        "Round-trip translation did not preserve source phrase 'entropia termodinamica'.",
        "Round-trip translation did not preserve source phrase 'no es identica'."
      ]
    }
  ],
  "summary": {
    "overall_adequacy_score": 0.547,
    "overall_adequacy_rating": "inadequate",
    "recommended_use": "Not recommended for learner-facing local deployment."
  }
 }
--- a/examples/model-benchmark-es/model_benchmark.md
+++ b/examples/model-benchmark-es/model_benchmark.md
@ -0,0 +1,16 @@
 # Didactopus Local Model Benchmark
 - Provider: `stub`
 - Hardware profile: `unspecified-local`
 - Primary concept: Independent Reasoning and Careful Comparison
 - Secondary concept: Thermodynamics and Entropy
 - Overall adequacy: inadequate (0.547)
 - Recommended use: Not recommended for learner-facing local deployment.
 ## Role Results
 - `mentor` via `local-demo`: inadequate (0.52), latency 0.025 ms
  Notes: Did not ask a focused learner question.; Response does not appear to be in Spanish.; Missing required multilingual term 'shannon-entropy' for language 'es'.; Missing required multilingual term 'channel-capacity' for language 'es'.; Missing required multilingual term 'thermodynamic-entropy' for language 'es'.; Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.; Did not visibly preserve a key grounded concept term in multilingual output.; Round-trip translation did not preserve source phrase 'entropia'.; Round-trip translation did not preserve source phrase 'capacidad del canal'.; Round-trip translation did not preserve source phrase 'entropia termodinamica'.; Round-trip translation did not preserve source phrase 'no es identica'.
 - `practice` via `local-demo`: adequate (0.82), latency 0.004 ms
  Notes: Response does not appear to be in Spanish.; Missing required multilingual term 'shannon-entropy' for language 'es'.; Missing required multilingual term 'channel-capacity' for language 'es'.; Missing required multilingual term 'thermodynamic-entropy' for language 'es'.; Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.; Round-trip translation did not preserve source phrase 'entropia'.; Round-trip translation did not preserve source phrase 'capacidad del canal'.; Round-trip translation did not preserve source phrase 'entropia termodinamica'.; Round-trip translation did not preserve source phrase 'no es identica'.
 - `evaluator` via `local-demo`: inadequate (0.3), latency 0.004 ms
  Notes: Did not acknowledge learner strengths.; Did not provide a concrete next step.; Response does not appear to be in Spanish.; Missing required multilingual term 'shannon-entropy' for language 'es'.; Missing required multilingual term 'channel-capacity' for language 'es'.; Missing required multilingual term 'thermodynamic-entropy' for language 'es'.; Missing required multilingual caveat 'shannon-vs-thermo-not-identical' for language 'es'.; Round-trip translation did not preserve source phrase 'entropia'.; Round-trip translation did not preserve source phrase 'capacidad del canal'.; Round-trip translation did not preserve source phrase 'entropia termodinamica'.; Round-trip translation did not preserve source phrase 'no es identica'.
--- a/examples/ocw-information-entropy-session-es.html
+++ b/examples/ocw-information-entropy-session-es.html
@ -1,9 +1,9 @@
 <!doctype html>
-<html lang="en">
+<html lang="es">
 <head>
 <meta charset="utf-8">
 <meta name="viewport" content="width=device-width, initial-scale=1">
-<title>Didactopus Learner Session</title>
+<title>Sesion de aprendizaje de Didactopus</title>
 <style>
 :root { color-scheme: light; --bg: #f7f4ed; --panel: #fffdf8; --ink: #1e2b31; --muted: #53656d; --line: #d3c8b7; --accent: #155e63; }
 body { margin: 0; font-family: Georgia, 'Times New Roman', serif; background: var(--bg); color: var(--ink); line-height: 1.55; }
@ -21,24 +21,24 @@ ol, ul { padding-left: 22px; }
 </style>
 </head>
 <body>
-<a class="skip" href="#session-main">Skip to learner session</a>
+<a class="skip" href="#session-main">Saltar a la sesion de aprendizaje</a>
 <main id="session-main" aria-label="Didactopus learner session">
 <section aria-labelledby="session-title">
-<h1 id="session-title">Didactopus Learner Session</h1>
+<h1 id="session-title">Sesion de aprendizaje de Didactopus</h1>
-<p class="sr-note">This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.</p>
+<p class="sr-note">Esta pagina esta estructurada para uso con teclado y lector de pantalla. Presenta el objetivo del aprendiz, el plan de estudio, los fragmentos de fundamento y los turnos de conversacion en orden de lectura.</p>
-<p><strong>Learner goal:</strong> Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.</p>
+<p><strong>Objetivo del aprendiz:</strong> Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.</p>
-<p><strong>Source language:</strong> en</p>
+<p><strong>Idioma de origen:</strong> en</p>
-<p><strong>Output language:</strong> es</p>
+<p><strong>Idioma de salida:</strong> es</p>
 </section>
 <section aria-labelledby="study-plan-title">
-<h2 id="study-plan-title">Study Plan</h2>
+<h2 id="study-plan-title">Plan de estudio</h2>
 <ol>
 <li>
 <h3>Independent Reasoning and Careful Comparison</h3>
-<p><strong>Status:</strong> mastered</p>
+<p><strong>Estado:</strong> mastered</p>
-<p><strong>Prerequisites:</strong> Course Notes and Reference Texts</p>
+<p><strong>Prerrequisitos:</strong> Course Notes and Reference Texts</p>
-<p><strong>Supporting lessons:</strong> Independent Reasoning and Careful Comparison</p>
+<p><strong>Lecciones de apoyo:</strong> Independent Reasoning and Careful Comparison</p>
-<p><strong>Grounding fragments:</strong></p>
+<p><strong>Fragmentos de fundamento:</strong></p>
 <ul>
 <li><div class="fragment"><strong>Independent Reasoning and Careful Comparison</strong> (lesson_body)<br>- Objective: Explain why the course requires precise comparison of related but non-identical concepts.
 - Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.
@ -48,10 +48,10 @@ The syllabus framing implies a style of work where analogy is useful but dangero
 </li>
 <li>
 <h3>Thermodynamics and Entropy</h3>
-<p><strong>Status:</strong> mastered</p>
+<p><strong>Estado:</strong> mastered</p>
-<p><strong>Prerequisites:</strong> Cryptography and Information Hiding</p>
+<p><strong>Prerrequisitos:</strong> Cryptography and Information Hiding</p>
-<p><strong>Supporting lessons:</strong> Thermodynamics and Entropy</p>
+<p><strong>Lecciones de apoyo:</strong> Thermodynamics and Entropy</p>
-<p><strong>Grounding fragments:</strong></p>
+<p><strong>Fragmentos de fundamento:</strong></p>
 <ul>
 <li><div class="fragment"><strong>Thermodynamics and Entropy</strong> (lesson_body)<br>- Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
 - Exercise: Compare the two entropy notions and identify what is preserved across the analogy.
@ -61,10 +61,10 @@ The course uses entropy as a bridge concept between communication theory and phy
 </li>
 <li>
 <h3>Shannon Entropy</h3>
-<p><strong>Status:</strong> mastered</p>
+<p><strong>Estado:</strong> mastered</p>
-<p><strong>Prerequisites:</strong> Counting and Probability</p>
+<p><strong>Prerrequisitos:</strong> Counting and Probability</p>
-<p><strong>Supporting lessons:</strong> Shannon Entropy</p>
+<p><strong>Lecciones de apoyo:</strong> Shannon Entropy</p>
-<p><strong>Grounding fragments:</strong></p>
+<p><strong>Fragmentos de fundamento:</strong></p>
 <ul>
 <li><div class="fragment"><strong>Shannon Entropy</strong> (lesson_body)<br>- Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
 - Exercise: Compute the entropy of a Bernoulli source and interpret the result.
@ -75,7 +75,7 @@ The course then introduces entropy as a quantitative measure of uncertainty for
 </ol>
 </section>
 <section aria-labelledby="conversation-title">
-<h2 id="conversation-title">Conversation</h2>
+<h2 id="conversation-title">Conversacion</h2>
 <article class="turn" aria-label="Conversation turn">
 <h3>Learner Goal</h3>
 <p class="meta">Role: user</p>
@ -108,10 +108,10 @@ The course then introduces entropy as a quantitative measure of uncertainty for
 </article>
 </section>
 <section aria-labelledby="evaluation-title">
-<h2 id="evaluation-title">Evaluation Summary</h2>
+<h2 id="evaluation-title">Resumen de evaluacion</h2>
-<p><strong>Verdict:</strong> needs_revision</p>
+<p><strong>Veredicto:</strong> needs_revision</p>
-<p><strong>Aggregated dimensions:</strong> {&quot;correctness&quot;: 0.6000000000000001, &quot;critique&quot;: 0.6499999999999999, &quot;explanation&quot;: 0.85}</p>
+<p><strong>Dimensiones agregadas:</strong> {&quot;correctness&quot;: 0.6000000000000001, &quot;critique&quot;: 0.6499999999999999, &quot;explanation&quot;: 0.85}</p>
-<p><strong>Follow-up:</strong> Rework the answer so it states the equality/relationship explicitly and explains why it matters.</p>
+<p><strong>Siguiente paso:</strong> Rework the answer so it states the equality/relationship explicitly and explains why it matters.</p>
 </section>
 </main>
 </body>
--- a/examples/ocw-information-entropy-session-es.txt
+++ b/examples/ocw-information-entropy-session-es.txt
@ -1,36 +1,36 @@
-Didactopus Learner Session
+Sesion de aprendizaje de Didactopus
-Learner goal: Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
+Objetivo del aprendiz: Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
-Source language: en
+Idioma de origen: en
-Output language: es
+Idioma de salida: es
-Study plan:
+Plan de estudio:
 1. Independent Reasoning and Careful Comparison
-   Status: mastered
+   Estado: mastered
-   Prerequisites: Course Notes and Reference Texts
+   Prerrequisitos: Course Notes and Reference Texts
-   Supporting lessons: Independent Reasoning and Careful Comparison
+   Lecciones de apoyo: Independent Reasoning and Careful Comparison
-   Source fragment (lesson_body): - Objective: Explain why the course requires precise comparison of related but non-identical concepts.
+   Fragmento de fuente (lesson_body): - Objective: Explain why the course requires precise comparison of related but non-identical concepts.
 - Exercise: Write a short note distinguishing Shannon entropy, channel capacity, and thermodynamic entropy.
 The syllabus framing implies a style of work where analogy is useful but dangerous when used loosely. Learners must compare models carefully, state assumptions, and notice where similar mathematics does not imply identical interpretation.
-   Source fragment (objective): Explain why the course requires precise comparison of related but non-identical concepts.
+   Fragmento de fuente (objective): Explain why the course requires precise comparison of related but non-identical concepts.
 2. Thermodynamics and Entropy
-   Status: mastered
+   Estado: mastered
-   Prerequisites: Cryptography and Information Hiding
+   Prerrequisitos: Cryptography and Information Hiding
-   Supporting lessons: Thermodynamics and Entropy
+   Lecciones de apoyo: Thermodynamics and Entropy
-   Source fragment (lesson_body): - Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
+   Fragmento de fuente (lesson_body): - Objective: Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
 - Exercise: Compare the two entropy notions and identify what is preserved across the analogy.
 The course uses entropy as a bridge concept between communication theory and physics while insisting on careful interpretation.
-   Source fragment (objective): Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
+   Fragmento de fuente (objective): Explain how thermodynamic entropy relates to, and differs from, Shannon entropy.
 3. Shannon Entropy
-   Status: mastered
+   Estado: mastered
-   Prerequisites: Counting and Probability
+   Prerrequisitos: Counting and Probability
-   Supporting lessons: Shannon Entropy
+   Lecciones de apoyo: Shannon Entropy
-   Source fragment (lesson_body): - Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
+   Fragmento de fuente (lesson_body): - Objective: Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
 - Exercise: Compute the entropy of a Bernoulli source and interpret the result.
 The course then introduces entropy as a quantitative measure of uncertainty for a source model and uses it to reason about representation cost and surprise.
-   Source fragment (objective): Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
+   Fragmento de fuente (objective): Explain Shannon entropy as a measure of uncertainty and compare high-entropy and low-entropy sources.
-Conversation:
+Conversacion:
 Learner Goal:
 Help me understand how Shannon entropy leads into channel capacity and thermodynamic entropy.
@ -49,7 +49,7 @@ Didactopus Evaluator:
 Didactopus Mentor:
 [stubbed-response] [mentor] Concept: Independent Reasoning and Careful Comparison Prerequisites: Course Notes and Reference Texts Supporting lessons
-Evaluation summary:
+Resumen de evaluacion:
-Verdict: needs_revision
+Veredicto: needs_revision
-Aggregated dimensions: {"correctness": 0.6000000000000001, "critique": 0.6499999999999999, "explanation": 0.85}
+Dimensiones agregadas: {"correctness": 0.6000000000000001, "critique": 0.6499999999999999, "explanation": 0.85}
-Follow-up: Rework the answer so it states the equality/relationship explicitly and explains why it matters.
+Siguiente paso: Rework the answer so it states the equality/relationship explicitly and explains why it matters.
--- a/skills/ocw-information-entropy-agent/assets/generated/pack/multilingual_qa.yaml
+++ b/skills/ocw-information-entropy-agent/assets/generated/pack/multilingual_qa.yaml
@ -0,0 +1,59 @@
 source_language: en
 targets:
  es:
    required_terms:
      - id: shannon-entropy
        accepted:
          - "entropia"
          - "entropía"
          - "entropia de shannon"
          - "entropía de shannon"
      - id: channel-capacity
        accepted:
          - "capacidad del canal"
          - "capacidad de canal"
      - id: thermodynamic-entropy
        accepted:
          - "entropia termodinamica"
          - "entropía termodinámica"
    required_caveats:
      - id: shannon-vs-thermo-not-identical
        accepted:
          - "no es identica"
          - "no es idéntica"
          - "no son identicas"
          - "no son idénticas"
          - "no equivale exactamente"
    forbidden_confusions:
      - id: shannon-equals-thermodynamic-entropy
        patterns:
          - "es identica a la entropia termodinamica"
          - "es idéntica a la entropía termodinámica"
          - "son identicas"
          - "son idénticas"
  fr:
    required_terms:
      - id: shannon-entropy
        accepted:
          - "entropie"
          - "entropie de shannon"
      - id: channel-capacity
        accepted:
          - "capacite du canal"
          - "capacité du canal"
      - id: thermodynamic-entropy
        accepted:
          - "entropie thermodynamique"
    required_caveats:
      - id: shannon-vs-thermo-not-identical
        accepted:
          - "n'est pas identique"
          - "ne sont pas identiques"
          - "n'est pas equivalente"
          - "n'est pas équivalente"
    forbidden_confusions:
      - id: shannon-equals-thermodynamic-entropy
        patterns:
          - "est identique a l'entropie thermodynamique"
          - "est identique à l'entropie thermodynamique"
          - "sont identiques"
--- a/src/didactopus/arena.py
+++ b/src/didactopus/arena.py
@ -9,8 +9,16 @@ import yaml
 from .config import load_config
 from .language_support import response_language_instruction
 from .learner_session import _grounding_block
-from .model_bench import _adequacy_rating, _score_evaluator_response, _score_mentor_response, _score_practice_response
+from .model_bench import (
    _adequacy_rating,
    _multilingual_score,
    _round_trip_phrases,
    _score_evaluator_response,
    _score_mentor_response,
    _score_practice_response,
 )
 from .model_provider import ModelProvider
 from .multilingual_qa import round_trip_warning_for_phrases
 from .ocw_skill_agent_demo import build_skill_grounded_study_plan, evaluate_submission_with_skill, load_ocw_skill_context
 from .role_prompts import system_prompt_for_role_variant
@ -110,7 +118,24 @@ def _run_candidate(candidate: dict, skill_dir: str | Path) -> dict:
        )
        elapsed_ms = round((perf_counter() - started) * 1000.0, 3)
        score, notes = _scorer_for_role(role)(response.text)
-        overall += score
+        multilingual_score, multilingual_notes = _multilingual_score(role, response.text, language, context.multilingual_qa)
        combined_score = (score * 0.8) + (multilingual_score * 0.2)
        round_trip = {"warnings": [], "summary": {"source_phrase_count": 0, "round_trip_warning_count": 0, "drifted_phrases": []}}
        if language != "en":
            source_phrases = _round_trip_phrases(context.multilingual_qa, language)
            if source_phrases:
                back_translation = provider.generate(
                    (
                        "Translate the following text into English as faithfully as possible, preserving technical meaning and caveats.\n\n"
                        f"{response.text}"
                    ),
                    role=role,
                    system_prompt=system_prompt_for_role_variant(role, variant),
                    temperature=0.0,
                    max_tokens=220,
                ).text
                round_trip = round_trip_warning_for_phrases(source_phrases, back_translation)
        overall += combined_score
        role_results.append(
            {
                "role": role,
@ -119,10 +144,13 @@ def _run_candidate(candidate: dict, skill_dir: str | Path) -> dict:
                "prompt_variant": variant,
                "language": language,
                "latency_ms": elapsed_ms,
-                "adequacy_score": round(score, 3),
+                "adequacy_score": round(combined_score, 3),
-                "adequacy_rating": _adequacy_rating(score),
+                "adequacy_rating": _adequacy_rating(combined_score),
                "grounded_score": round(score, 3),
                "multilingual_score": round(multilingual_score, 3),
                "round_trip": round_trip,
                "response_preview": response.text[:280],
-                "notes": notes,
+                "notes": [*notes, *multilingual_notes, *round_trip["warnings"]],
            }
        )
--- a/src/didactopus/language_support.py
+++ b/src/didactopus/language_support.py
@ -14,11 +14,84 @@ LANGUAGE_LABELS = {
    "ja": "Japanese",
 }
 UI_STRINGS = {
    "en": {
        "didactopus_learner_session": "Didactopus Learner Session",
        "learner_goal": "Learner goal",
        "source_language": "Source language",
        "output_language": "Output language",
        "study_plan": "Study Plan",
        "conversation": "Conversation",
        "evaluation_summary": "Evaluation Summary",
        "verdict": "Verdict",
        "aggregated_dimensions": "Aggregated dimensions",
        "follow_up": "Follow-up",
        "status": "Status",
        "prerequisites": "Prerequisites",
        "supporting_lessons": "Supporting lessons",
        "grounding_fragments": "Grounding fragments",
        "source_fragment": "Source fragment",
        "skip_to_session": "Skip to learner session",
        "screen_reader_note": "This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.",
    },
    "es": {
        "didactopus_learner_session": "Sesion de aprendizaje de Didactopus",
        "learner_goal": "Objetivo del aprendiz",
        "source_language": "Idioma de origen",
        "output_language": "Idioma de salida",
        "study_plan": "Plan de estudio",
        "conversation": "Conversacion",
        "evaluation_summary": "Resumen de evaluacion",
        "verdict": "Veredicto",
        "aggregated_dimensions": "Dimensiones agregadas",
        "follow_up": "Siguiente paso",
        "status": "Estado",
        "prerequisites": "Prerrequisitos",
        "supporting_lessons": "Lecciones de apoyo",
        "grounding_fragments": "Fragmentos de fundamento",
        "source_fragment": "Fragmento de fuente",
        "skip_to_session": "Saltar a la sesion de aprendizaje",
        "screen_reader_note": "Esta pagina esta estructurada para uso con teclado y lector de pantalla. Presenta el objetivo del aprendiz, el plan de estudio, los fragmentos de fundamento y los turnos de conversacion en orden de lectura.",
    },
    "fr": {
        "didactopus_learner_session": "Session d'apprentissage Didactopus",
        "learner_goal": "Objectif de l'apprenant",
        "source_language": "Langue source",
        "output_language": "Langue de sortie",
        "study_plan": "Plan d'etude",
        "conversation": "Conversation",
        "evaluation_summary": "Resume de l'evaluation",
        "verdict": "Verdict",
        "aggregated_dimensions": "Dimensions agregees",
        "follow_up": "Etape suivante",
        "status": "Statut",
        "prerequisites": "Prerquis",
        "supporting_lessons": "Lecons de soutien",
        "grounding_fragments": "Fragments d'ancrage",
        "source_fragment": "Fragment source",
        "skip_to_session": "Aller a la session d'apprentissage",
        "screen_reader_note": "Cette page est structuree pour une utilisation au clavier et avec un lecteur d'ecran. Elle presente l'objectif de l'apprenant, le plan d'etude, les fragments d'ancrage et les tours de conversation dans l'ordre de lecture.",
    },
 }
 LANGUAGE_MARKERS = {
    "es": (" el ", " la ", " de ", " y ", " que ", " para ", " no ", "una ", "un "),
    "fr": (" le ", " la ", " de ", " et ", " que ", " pour ", " pas ", "une ", "un "),
    "de": (" der ", " die ", " und ", " nicht ", " ist ", " fur "),
    "pt": (" o ", " a ", " de ", " e ", " para ", " nao "),
    "it": (" il ", " la ", " di ", " e ", " per ", " non "),
 }
 def language_label(language: str) -> str:
    return LANGUAGE_LABELS.get(language, language)
 def ui_text(key: str, language: str) -> str:
    table = UI_STRINGS.get(language, UI_STRINGS["en"])
    return table.get(key, UI_STRINGS["en"].get(key, key))
 def response_language_instruction(language: str, source_language: str = "en") -> str:
    if language == source_language:
        return ""
@ -26,3 +99,18 @@ def response_language_instruction(language: str, source_language: str = "en") ->
        f" Respond in {language_label(language)}. Preserve key source-grounded concepts and caveats faithfully, "
        f"and make clear when you are explaining material whose source language is {language_label(source_language)}."
    )
 def language_alignment_score(text: str, language: str) -> tuple[float, list[str]]:
    if language == "en":
        return 1.0, []
    lowered = f" {text.lower()} "
    markers = LANGUAGE_MARKERS.get(language)
    if markers is None:
        return 0.5, [f"No language-specific heuristic markers are defined for {language} yet."]
    marker_hits = sum(1 for marker in markers if marker in lowered)
    if marker_hits >= 2:
        return 1.0, []
    if marker_hits == 1:
        return 0.6, [f"Only weak evidence that the response is actually in {language_label(language)}."]
    return 0.0, [f"Response does not appear to be in {language_label(language)}."]
--- a/src/didactopus/learner_accessibility.py
+++ b/src/didactopus/learner_accessibility.py
@ -4,36 +4,38 @@ import html
 import json
 from pathlib import Path
 from .language_support import ui_text
 def _escape(value: object) -> str:
    return html.escape(str(value))
 def build_accessible_session_text(session: dict) -> str:
    language = str(session.get("output_language", "en"))
    lines = [
-        "Didactopus Learner Session",
+        ui_text("didactopus_learner_session", language),
        "",
-        f"Learner goal: {session.get('goal', '')}",
+        f"{ui_text('learner_goal', language)}: {session.get('goal', '')}",
-        f"Source language: {session.get('source_language', 'en')}",
+        f"{ui_text('source_language', language)}: {session.get('source_language', 'en')}",
-        f"Output language: {session.get('output_language', 'en')}",
+        f"{ui_text('output_language', language)}: {session.get('output_language', 'en')}",
        "",
-        "Study plan:",
+        f"{ui_text('study_plan', language)}:",
    ]
    for index, step in enumerate(session.get("study_plan", {}).get("steps", []), start=1):
        lines.extend(
            [
                f"{index}. {step.get('title', '')}",
-                f"   Status: {step.get('status', '')}",
+                f"   {ui_text('status', language)}: {step.get('status', '')}",
-                f"   Prerequisites: {', '.join(step.get('prerequisite_titles', []) or ['none explicit'])}",
+                f"   {ui_text('prerequisites', language)}: {', '.join(step.get('prerequisite_titles', []) or ['none explicit'])}",
-                f"   Supporting lessons: {', '.join(step.get('supporting_lessons', []) or ['none listed'])}",
+                f"   {ui_text('supporting_lessons', language)}: {', '.join(step.get('supporting_lessons', []) or ['none listed'])}",
            ]
        )
        for fragment in step.get("source_fragments", [])[:2]:
-            lines.append(f"   Source fragment ({fragment.get('kind', 'fragment')}): {fragment.get('text', '')}")
+            lines.append(f"   {ui_text('source_fragment', language)} ({fragment.get('kind', 'fragment')}): {fragment.get('text', '')}")
    lines.extend(
        [
            "",
-            "Conversation:",
+            f"{ui_text('conversation', language)}:",
        ]
    )
    for turn in session.get("turns", []):
@ -47,26 +49,27 @@ def build_accessible_session_text(session: dict) -> str:
    evaluation = session.get("evaluation", {})
    lines.extend(
        [
-            "Evaluation summary:",
+            f"{ui_text('evaluation_summary', language)}:",
-            f"Verdict: {evaluation.get('verdict', '')}",
+            f"{ui_text('verdict', language)}: {evaluation.get('verdict', '')}",
-            f"Aggregated dimensions: {json.dumps(evaluation.get('aggregated', {}), sort_keys=True)}",
+            f"{ui_text('aggregated_dimensions', language)}: {json.dumps(evaluation.get('aggregated', {}), sort_keys=True)}",
-            f"Follow-up: {evaluation.get('follow_up', '')}",
+            f"{ui_text('follow_up', language)}: {evaluation.get('follow_up', '')}",
        ]
    )
    return "\n".join(lines).strip() + "\n"
 def build_accessible_session_html(session: dict) -> str:
    language = str(session.get("output_language", "en"))
    steps = session.get("study_plan", {}).get("steps", [])
    turns = session.get("turns", [])
    evaluation = session.get("evaluation", {})
    body = [
        "<!doctype html>",
-        '<html lang="en">',
+        f'<html lang="{_escape(language)}">',
        "<head>",
        '<meta charset="utf-8">',
        '<meta name="viewport" content="width=device-width, initial-scale=1">',
-        "<title>Didactopus Learner Session</title>",
+        f"<title>{_escape(ui_text('didactopus_learner_session', language))}</title>",
        "<style>",
        ":root { color-scheme: light; --bg: #f7f4ed; --panel: #fffdf8; --ink: #1e2b31; --muted: #53656d; --line: #d3c8b7; --accent: #155e63; }",
        "body { margin: 0; font-family: Georgia, 'Times New Roman', serif; background: var(--bg); color: var(--ink); line-height: 1.55; }",
@ -84,32 +87,32 @@ def build_accessible_session_html(session: dict) -> str:
        "</style>",
        "</head>",
        "<body>",
-        '<a class="skip" href="#session-main">Skip to learner session</a>',
+        f'<a class="skip" href="#session-main">{_escape(ui_text("skip_to_session", language))}</a>',
        '<main id="session-main" aria-label="Didactopus learner session">',
        '<section aria-labelledby="session-title">',
-        '<h1 id="session-title">Didactopus Learner Session</h1>',
+        f'<h1 id="session-title">{_escape(ui_text("didactopus_learner_session", language))}</h1>',
-        '<p class="sr-note">This page is structured for keyboard and screen-reader use. It presents the learner goal, study plan, grounded source fragments, and conversation turns in reading order.</p>',
+        f'<p class="sr-note">{_escape(ui_text("screen_reader_note", language))}</p>',
-        f"<p><strong>Learner goal:</strong> {_escape(session.get('goal', ''))}</p>",
+        f"<p><strong>{_escape(ui_text('learner_goal', language))}:</strong> {_escape(session.get('goal', ''))}</p>",
-        f"<p><strong>Source language:</strong> {_escape(session.get('source_language', 'en'))}</p>",
+        f"<p><strong>{_escape(ui_text('source_language', language))}:</strong> {_escape(session.get('source_language', 'en'))}</p>",
-        f"<p><strong>Output language:</strong> {_escape(session.get('output_language', 'en'))}</p>",
+        f"<p><strong>{_escape(ui_text('output_language', language))}:</strong> {_escape(session.get('output_language', 'en'))}</p>",
        "</section>",
        '<section aria-labelledby="study-plan-title">',
-        '<h2 id="study-plan-title">Study Plan</h2>',
+        f'<h2 id="study-plan-title">{_escape(ui_text("study_plan", language))}</h2>',
        '<ol>',
    ]
    for step in steps:
        body.append("<li>")
        body.append(f"<h3>{_escape(step.get('title', ''))}</h3>")
-        body.append(f"<p><strong>Status:</strong> {_escape(step.get('status', ''))}</p>")
+        body.append(f"<p><strong>{_escape(ui_text('status', language))}:</strong> {_escape(step.get('status', ''))}</p>")
        body.append(
-            f"<p><strong>Prerequisites:</strong> {_escape(', '.join(step.get('prerequisite_titles', []) or ['none explicit']))}</p>"
+            f"<p><strong>{_escape(ui_text('prerequisites', language))}:</strong> {_escape(', '.join(step.get('prerequisite_titles', []) or ['none explicit']))}</p>"
        )
        body.append(
-            f"<p><strong>Supporting lessons:</strong> {_escape(', '.join(step.get('supporting_lessons', []) or ['none listed']))}</p>"
+            f"<p><strong>{_escape(ui_text('supporting_lessons', language))}:</strong> {_escape(', '.join(step.get('supporting_lessons', []) or ['none listed']))}</p>"
        )
        fragments = step.get("source_fragments", [])[:2]
        if fragments:
-            body.append("<p><strong>Grounding fragments:</strong></p>")
+            body.append(f"<p><strong>{_escape(ui_text('grounding_fragments', language))}:</strong></p>")
            body.append("<ul>")
            for fragment in fragments:
                body.append(
@ -123,7 +126,7 @@ def build_accessible_session_html(session: dict) -> str:
            "</ol>",
            "</section>",
            '<section aria-labelledby="conversation-title">',
-            '<h2 id="conversation-title">Conversation</h2>',
+            f'<h2 id="conversation-title">{_escape(ui_text("conversation", language))}</h2>',
        ]
    )
    for turn in turns:
@ -136,10 +139,10 @@ def build_accessible_session_html(session: dict) -> str:
        [
            "</section>",
            '<section aria-labelledby="evaluation-title">',
-            '<h2 id="evaluation-title">Evaluation Summary</h2>',
+            f'<h2 id="evaluation-title">{_escape(ui_text("evaluation_summary", language))}</h2>',
-            f"<p><strong>Verdict:</strong> {_escape(evaluation.get('verdict', ''))}</p>",
+            f"<p><strong>{_escape(ui_text('verdict', language))}:</strong> {_escape(evaluation.get('verdict', ''))}</p>",
-            f"<p><strong>Aggregated dimensions:</strong> {_escape(json.dumps(evaluation.get('aggregated', {}), sort_keys=True))}</p>",
+            f"<p><strong>{_escape(ui_text('aggregated_dimensions', language))}:</strong> {_escape(json.dumps(evaluation.get('aggregated', {}), sort_keys=True))}</p>",
-            f"<p><strong>Follow-up:</strong> {_escape(evaluation.get('follow_up', ''))}</p>",
+            f"<p><strong>{_escape(ui_text('follow_up', language))}:</strong> {_escape(evaluation.get('follow_up', ''))}</p>",
            "</section>",
            "</main>",
            "</body>",
--- a/src/didactopus/model_bench.py
+++ b/src/didactopus/model_bench.py
@ -5,9 +5,10 @@ from pathlib import Path
 from time import perf_counter
 from .config import load_config
-from .language_support import response_language_instruction
+from .language_support import language_alignment_score, response_language_instruction
 from .learner_session import _grounding_block
 from .model_provider import ModelProvider
 from .multilingual_qa import multilingual_qa_for_text, round_trip_warning_for_phrases
 from .ocw_skill_agent_demo import build_skill_grounded_study_plan, evaluate_submission_with_skill, load_ocw_skill_context
 from .role_prompts import system_prompt_for_role
@ -77,6 +78,47 @@ def _adequacy_rating(score: float) -> str:
    return "inadequate"
 def _multilingual_score(role: str, text: str, language: str, qa_spec: dict | None = None) -> tuple[float, list[str]]:
    score, notes = language_alignment_score(text, language)
    if language == "en":
        return score, notes
    qa_score = 1.0
    qa_notes: list[str] = []
    if qa_spec:
        qa_result = multilingual_qa_for_text(qa_spec, language=language, text=text)
        qa_notes = list(qa_result["warnings"])
        summary = qa_result["summary"]
        denominator = summary["required_term_count"] + summary["required_caveat_count"] + summary["forbidden_confusion_count"]
        numerator = summary["matched_term_count"] + summary["matched_caveat_count"] + (
            summary["forbidden_confusion_count"] - summary["confusion_hit_count"]
        )
        if denominator > 0:
            qa_score = max(0.0, min(1.0, numerator / denominator))
    role_lower = role.lower()
    if role_lower == "mentor" and "entropy" not in text.lower():
        qa_notes = list(qa_notes)
        qa_notes.append("Did not visibly preserve a key grounded concept term in multilingual output.")
        qa_score = max(0.0, qa_score - 0.2)
    combined = (score * 0.5) + (qa_score * 0.5)
    return combined, [*notes, *qa_notes]
 def _round_trip_phrases(qa_spec: dict | None, language: str) -> list[str]:
    if not qa_spec or language == "en":
        return []
    target = (qa_spec.get("targets", {}) or {}).get(language, {}) or {}
    phrases: list[str] = []
    for entry in target.get("required_terms", []) or []:
        accepted = entry.get("accepted", []) or []
        if accepted:
            phrases.append(str(accepted[0]))
    for entry in target.get("required_caveats", []) or []:
        accepted = entry.get("accepted", []) or []
        if accepted:
            phrases.append(str(accepted[0]))
    return phrases[:6]
 def _hardware_profile(
    *,
    profile_name: str,
@ -163,7 +205,24 @@ def run_model_benchmark(
        )
        elapsed_ms = round((perf_counter() - started) * 1000.0, 3)
        score, notes = scorers[role](response.text)
-        adequacy_scores.append(score)
+        multilingual_score, multilingual_notes = _multilingual_score(role, response.text, language, context.multilingual_qa)
        combined_score = (score * 0.8) + (multilingual_score * 0.2)
        round_trip = {"warnings": [], "summary": {"source_phrase_count": 0, "round_trip_warning_count": 0, "drifted_phrases": []}}
        if language != "en":
            source_phrases = _round_trip_phrases(context.multilingual_qa, language)
            if source_phrases:
                back_translation = provider.generate(
                    (
                        "Translate the following text into English as faithfully as possible, preserving technical meaning and caveats.\n\n"
                        f"{response.text}"
                    ),
                    role=role,
                    system_prompt=system_prompt_for_role(role),
                    temperature=0.0,
                    max_tokens=220,
                ).text
                round_trip = round_trip_warning_for_phrases(source_phrases, back_translation)
        adequacy_scores.append(combined_score)
        role_results.append(
            {
                "role": role,
@ -171,9 +230,12 @@ def run_model_benchmark(
                "model_name": response.model_name,
                "latency_ms": elapsed_ms,
                "response_preview": response.text[:280],
-                "adequacy_score": round(score, 3),
+                "adequacy_score": round(combined_score, 3),
-                "adequacy_rating": _adequacy_rating(score),
+                "adequacy_rating": _adequacy_rating(combined_score),
-                "notes": notes,
+                "grounded_score": round(score, 3),
                "multilingual_score": round(multilingual_score, 3),
                "round_trip": round_trip,
                "notes": [*notes, *multilingual_notes, *round_trip["warnings"]],
            }
        )
--- a/src/didactopus/multilingual_qa.py
+++ b/src/didactopus/multilingual_qa.py
@ -0,0 +1,100 @@
 from __future__ import annotations
 from pathlib import Path
 import yaml
 def _contains_non_negated_pattern(lowered: str, pattern: str) -> bool:
    start = lowered.find(pattern)
    while start != -1:
        prefix = lowered[max(0, start - 4):start]
        if not prefix.endswith("no "):
            return True
        start = lowered.find(pattern, start + 1)
    return False
 def load_multilingual_qa_spec(source_dir: str | Path) -> dict:
    source = Path(source_dir)
    path = source / "multilingual_qa.yaml"
    if not path.exists():
        return {}
    return yaml.safe_load(path.read_text(encoding="utf-8")) or {}
 def multilingual_qa_for_text(spec: dict, *, language: str, text: str) -> dict:
    targets = spec.get("targets", {}) or {}
    target = targets.get(language, {}) or {}
    warnings: list[str] = []
    summary = {
        "language": language,
        "required_term_count": 0,
        "matched_term_count": 0,
        "required_caveat_count": 0,
        "matched_caveat_count": 0,
        "forbidden_confusion_count": 0,
        "confusion_hit_count": 0,
    }
    if not target:
        warnings.append(f"No multilingual QA spec is defined for language '{language}'.")
        return {"warnings": warnings, "summary": summary}
    lowered = text.lower()
    required_terms = target.get("required_terms", []) or []
    summary["required_term_count"] = len(required_terms)
    for term in required_terms:
        accepted = [str(item).lower() for item in term.get("accepted", []) or []]
        if any(candidate in lowered for candidate in accepted):
            summary["matched_term_count"] += 1
        else:
            warnings.append(f"Missing required multilingual term '{term.get('id', 'unknown')}' for language '{language}'.")
    required_caveats = target.get("required_caveats", []) or []
    summary["required_caveat_count"] = len(required_caveats)
    for caveat in required_caveats:
        accepted = [str(item).lower() for item in caveat.get("accepted", []) or []]
        if any(candidate in lowered for candidate in accepted):
            summary["matched_caveat_count"] += 1
        else:
            warnings.append(f"Missing required multilingual caveat '{caveat.get('id', 'unknown')}' for language '{language}'.")
    forbidden_confusions = target.get("forbidden_confusions", []) or []
    summary["forbidden_confusion_count"] = len(forbidden_confusions)
    for confusion in forbidden_confusions:
        patterns = [str(item).lower() for item in confusion.get("patterns", []) or []]
        if any(_contains_non_negated_pattern(lowered, pattern) for pattern in patterns):
            summary["confusion_hit_count"] += 1
            warnings.append(f"Detected forbidden multilingual confusion '{confusion.get('id', 'unknown')}' for language '{language}'.")
    return {"warnings": warnings, "summary": summary}
 def multilingual_qa_for_pack(source_dir: str | Path, *, language: str, text: str) -> dict:
    spec = load_multilingual_qa_spec(source_dir)
    return multilingual_qa_for_text(spec, language=language, text=text)
 def round_trip_warning_for_phrases(
    source_phrases: list[str],
    back_translated_text: str,
 ) -> dict:
    lowered = back_translated_text.lower()
    warnings: list[str] = []
    drifted: list[str] = []
    for phrase in source_phrases:
        normalized = str(phrase).strip().lower()
        if not normalized:
            continue
        if normalized not in lowered:
            warnings.append(f"Round-trip translation did not preserve source phrase '{phrase}'.")
            drifted.append(phrase)
    return {
        "warnings": warnings,
        "summary": {
            "source_phrase_count": len([phrase for phrase in source_phrases if str(phrase).strip()]),
            "round_trip_warning_count": len(warnings),
            "drifted_phrases": drifted,
        },
    }
--- a/src/didactopus/multilingual_qa_seed.py
+++ b/src/didactopus/multilingual_qa_seed.py
@ -0,0 +1,148 @@
 from __future__ import annotations
 import json
 from pathlib import Path
 import yaml
 from .pack_validator import load_pack_artifacts
 def _normalize_phrase(text: str) -> str:
    return " ".join(str(text).replace(":", " ").replace("-", " ").split()).strip()
 def _candidate_languages(languages: list[str] | None) -> list[str]:
    return list(languages) if languages else ["es", "fr"]
 def _seed_required_terms(concepts: list[dict]) -> list[dict]:
    seeded = []
    seen = set()
    for concept in concepts:
        title = str(concept.get("title", "")).strip()
        concept_id = str(concept.get("id", "")).strip()
        if not title or not concept_id:
            continue
        normalized = _normalize_phrase(title)
        if len(normalized.split()) < 2:
            continue
        if concept_id in seen:
            continue
        seen.add(concept_id)
        seeded.append(
            {
                "id": concept_id,
                "accepted": [normalized],
            }
        )
    return seeded[:12]
 def _seed_required_caveats(source_corpus: dict) -> list[dict]:
    caveats = []
    seen = set()
    for fragment in source_corpus.get("fragments", []) or []:
        texts = [fragment.get("text", "")]
        texts.extend(fragment.get("objectives", []) or [])
        texts.extend(fragment.get("exercises", []) or [])
        for text in texts:
            lowered = str(text).lower()
            if "not identical" in lowered or "differs from" in lowered or "careful interpretation" in lowered:
                lesson_title = _normalize_phrase(fragment.get("lesson_title", "lesson"))
                caveat_id = lesson_title.lower().replace(" ", "-")[:48] or "caveat"
                if caveat_id in seen:
                    continue
                seen.add(caveat_id)
                caveats.append(
                    {
                        "id": caveat_id,
                        "accepted": [_normalize_phrase(text)],
                    }
                )
    return caveats[:6]
 def _seed_forbidden_confusions(required_caveats: list[dict]) -> list[dict]:
    confusions = []
    for caveat in required_caveats:
        accepted = caveat.get("accepted", []) or []
        if not accepted:
            continue
        phrase = str(accepted[0])
        lowered = phrase.lower()
        if "not identical" in lowered:
            confusion = phrase.replace("not identical", "identical")
        elif "differs from" in lowered:
            confusion = phrase.replace("differs from", "is identical to")
        else:
            continue
        confusions.append(
            {
                "id": f"{caveat['id']}-confusion",
                "patterns": [_normalize_phrase(confusion)],
            }
        )
    return confusions[:6]
 def generate_multilingual_qa_seed(
    source_dir: str | Path,
    *,
    languages: list[str] | None = None,
 ) -> dict:
    source_dir = Path(source_dir)
    loaded = load_pack_artifacts(source_dir)
    if not loaded["ok"]:
        raise ValueError(f"Cannot seed multilingual QA for invalid pack directory: {source_dir}")
    concepts = loaded["artifacts"]["concepts"].get("concepts", []) or []
    source_corpus_path = source_dir / "source_corpus.json"
    source_corpus = json.loads(source_corpus_path.read_text(encoding="utf-8")) if source_corpus_path.exists() else {"fragments": []}
    required_terms = _seed_required_terms(concepts)
    required_caveats = _seed_required_caveats(source_corpus)
    forbidden_confusions = _seed_forbidden_confusions(required_caveats)
    targets = {}
    for language in _candidate_languages(languages):
        targets[language] = {
            "required_terms": required_terms,
            "required_caveats": required_caveats,
            "forbidden_confusions": forbidden_confusions,
        }
    return {
        "source_language": "en",
        "generated_by": "didactopus.multilingual_qa_seed",
        "review_status": "draft-seed",
        "targets": targets,
    }
 def write_multilingual_qa_seed(
    source_dir: str | Path,
    *,
    out_path: str | Path | None = None,
    languages: list[str] | None = None,
 ) -> Path:
    source_dir = Path(source_dir)
    payload = generate_multilingual_qa_seed(source_dir, languages=languages)
    out_path = Path(out_path) if out_path is not None else source_dir / "multilingual_qa.seed.yaml"
    out_path.write_text(yaml.safe_dump(payload, sort_keys=False, allow_unicode=False), encoding="utf-8")
    return out_path
 def main() -> None:
    import argparse
    parser = argparse.ArgumentParser(description="Generate a starter multilingual QA spec from a Didactopus pack.")
    parser.add_argument("pack_dir")
    parser.add_argument("--out", default=None)
    parser.add_argument("--languages", nargs="*", default=None)
    args = parser.parse_args()
    out_path = write_multilingual_qa_seed(args.pack_dir, out_path=args.out, languages=args.languages)
    print(json.dumps({"written": str(out_path)}, indent=2))
 if __name__ == "__main__":
    main()
--- a/src/didactopus/ocw_skill_agent_demo.py
+++ b/src/didactopus/ocw_skill_agent_demo.py
@ -8,6 +8,7 @@ import yaml
 from .evaluator_pipeline import CritiqueEvaluator, LearnerAttempt, RubricEvaluator, SymbolicRuleEvaluator, aggregate, run_pipeline
 from .graph_retrieval import GraphBundle, lesson_titles_for_concept, prerequisite_titles, source_fragments_for_concept
 from .multilingual_qa import load_multilingual_qa_spec
@dataclass
@ -21,6 +22,7 @@ class SkillContext:
    graph_bundle: GraphBundle
    capability_profile: dict
    run_summary: dict
    multilingual_qa: dict
 def load_ocw_skill_context(skill_dir: str | Path) -> SkillContext:
@ -54,6 +56,7 @@ def load_ocw_skill_context(skill_dir: str | Path) -> SkillContext:
        ),
        capability_profile=json.loads((run_dir / "capability_profile.json").read_text(encoding="utf-8")),
        run_summary=json.loads((run_dir / "run_summary.json").read_text(encoding="utf-8")),
        multilingual_qa=load_multilingual_qa_spec(pack_dir),
    )
--- a/tests/test_arena.py
+++ b/tests/test_arena.py
@ -35,4 +35,6 @@ def test_run_didactopus_arena_writes_outputs(tmp_path: Path) -> None:
    queue = json.loads((tmp_path / "arena_review_queue.json").read_text(encoding="utf-8"))
    assert queue
    assert payload["ranked_candidates"][0]["language"] in {"en", "es", "fr"}
    assert "multilingual_score" in payload["ranked_candidates"][0]["role_results"][0]
    assert "round_trip" in payload["ranked_candidates"][0]["role_results"][0]
    assert "LLM Review Summary" in (tmp_path / "arena_report.md").read_text(encoding="utf-8")
--- a/tests/test_language_support.py
+++ b/tests/test_language_support.py
@ -0,0 +1,28 @@
 from didactopus.language_support import language_alignment_score, response_language_instruction, ui_text
 def test_response_language_instruction_is_empty_for_source_language() -> None:
    assert response_language_instruction("en", "en") == ""
 def test_response_language_instruction_mentions_target_language() -> None:
    instruction = response_language_instruction("es", "en")
    assert "Spanish" in instruction
    assert "English" in instruction
 def test_ui_text_uses_translated_labels() -> None:
    assert ui_text("study_plan", "es") == "Plan de estudio"
    assert ui_text("evaluation_summary", "fr") == "Resume de l'evaluation"
 def test_language_alignment_score_detects_non_english_markers() -> None:
    score, notes = language_alignment_score("La entropia y la capacidad del canal se comparan para el aprendiz.", "es")
    assert score == 1.0
    assert notes == []
 def test_language_alignment_score_flags_wrong_language() -> None:
    score, notes = language_alignment_score("This response is still entirely in English.", "es")
    assert score == 0.0
    assert notes
--- a/tests/test_learner_accessibility.py
+++ b/tests/test_learner_accessibility.py
@ -30,9 +30,23 @@ def test_accessible_session_text_is_linearized() -> None:
    assert "Learner goal:" in text
    assert "Source language:" in text
    assert "Output language:" in text
-    assert "Study plan:" in text
+    assert "Study Plan:" in text
    assert "Conversation:" in text
-    assert "Evaluation summary:" in text
+    assert "Evaluation Summary:" in text
 def test_accessible_session_outputs_localize_fixed_labels() -> None:
    root = Path(__file__).resolve().parents[1]
    payload = run_learner_session_demo(
        root / "configs" / "config.example.yaml",
        root / "skills" / "ocw-information-entropy-agent",
        language="es",
    )
    html = build_accessible_session_html(payload)
    text = build_accessible_session_text(payload)
    assert "Sesion de aprendizaje de Didactopus" in html
    assert "Plan de estudio" in html
    assert "Objetivo del aprendiz:" in text
 def test_render_accessible_session_outputs_writes_files(tmp_path: Path) -> None:
--- a/tests/test_model_bench.py
+++ b/tests/test_model_bench.py
@ -43,3 +43,15 @@ def test_model_benchmark_captures_response_preview_and_latency(tmp_path) -> None
        assert result["latency_ms"] >= 0.0
        assert result["response_preview"]
        assert "adequacy_score" in result
        assert "round_trip" in result
 def test_model_benchmark_penalizes_stub_for_non_english_output(tmp_path) -> None:
    payload = run_model_benchmark(
        config_path="configs/config.example.yaml",
        skill_dir="skills/ocw-information-entropy-agent",
        out_dir=tmp_path,
        language="es",
    )
    assert payload["context"]["output_language"] == "es"
    assert any(result["multilingual_score"] < 1.0 for result in payload["role_results"])
--- a/tests/test_multilingual_qa.py
+++ b/tests/test_multilingual_qa.py
@ -0,0 +1,52 @@
 from pathlib import Path
 from didactopus.multilingual_qa import (
    load_multilingual_qa_spec,
    multilingual_qa_for_pack,
    multilingual_qa_for_text,
    round_trip_warning_for_phrases,
 )
 def test_load_multilingual_qa_spec_reads_ocw_pack() -> None:
    spec = load_multilingual_qa_spec("domain-packs/mit-ocw-information-entropy")
    assert spec["source_language"] == "en"
    assert "es" in spec["targets"]
    assert "fr" in spec["targets"]
 def test_multilingual_qa_for_text_accepts_spanish_preservation() -> None:
    spec = load_multilingual_qa_spec("domain-packs/mit-ocw-information-entropy")
    result = multilingual_qa_for_text(
        spec,
        language="es",
        text="La entropía de Shannon no es idéntica a la entropía termodinámica, y la capacidad del canal impone otro límite.",
    )
    assert result["summary"]["matched_term_count"] >= 2
    assert result["summary"]["matched_caveat_count"] == 1
    assert result["summary"]["confusion_hit_count"] == 0
 def test_multilingual_qa_for_text_flags_confusion() -> None:
    spec = load_multilingual_qa_spec("domain-packs/mit-ocw-information-entropy")
    result = multilingual_qa_for_text(
        spec,
        language="es",
        text="La entropía de Shannon es idéntica a la entropía termodinámica.",
    )
    assert result["summary"]["confusion_hit_count"] == 1
    assert any("forbidden multilingual confusion" in warning.lower() for warning in result["warnings"])
 def test_multilingual_qa_for_pack_handles_missing_spec(tmp_path: Path) -> None:
    result = multilingual_qa_for_pack(tmp_path, language="es", text="Texto de prueba.")
    assert any("no multilingual qa spec" in warning.lower() for warning in result["warnings"])
 def test_round_trip_warning_for_phrases_flags_drift() -> None:
    result = round_trip_warning_for_phrases(
        ["Shannon entropy", "channel capacity"],
        "This back translation only preserved Shannon entropy.",
    )
    assert result["summary"]["round_trip_warning_count"] == 1
    assert result["summary"]["drifted_phrases"] == ["channel capacity"]
--- a/tests/test_multilingual_qa_seed.py
+++ b/tests/test_multilingual_qa_seed.py
@ -0,0 +1,27 @@
 from pathlib import Path
 import yaml
 from didactopus.multilingual_qa_seed import generate_multilingual_qa_seed, write_multilingual_qa_seed
 def test_generate_multilingual_qa_seed_uses_pack_content() -> None:
    payload = generate_multilingual_qa_seed("domain-packs/mit-ocw-information-entropy", languages=["es"])
    assert payload["source_language"] == "en"
    assert payload["review_status"] == "draft-seed"
    assert "es" in payload["targets"]
    target = payload["targets"]["es"]
    assert target["required_terms"]
    assert any(item["id"] == "shannon-entropy" for item in target["required_terms"])
    assert target["required_caveats"]
 def test_write_multilingual_qa_seed_writes_yaml(tmp_path: Path) -> None:
    out = write_multilingual_qa_seed(
        "domain-packs/mit-ocw-information-entropy",
        out_path=tmp_path / "multilingual_qa.seed.yaml",
        languages=["es", "fr"],
    )
    assert out.exists()
    written = yaml.safe_load(out.read_text(encoding="utf-8"))
    assert set(written["targets"]) == {"es", "fr"}