2.3 KiB
Didactopus Arena
The Didactopus arena compares candidate combinations of:
- provider configuration
- model choice
- role prompt variant
- output language
It is not a generic chatbot arena. It is a Didactopus-specific behavior arena for grounded learner tasks.
What It Does
For each candidate, the arena runs the current graph-grounded learner-task shape for:
mentorpracticeevaluator
It then produces:
- deterministic role scores
- candidate rankings
- a human review queue
- an optional LLM-written review summary to help the human reviewer triage results
Why This Exists
Didactopus needs a practical way to improve:
- local model choice
- prompt variants
- trust-preserving behavior
- source-grounded behavior
This is an aid to benchmarking and review, not an automatic certification system.
How To Run It
Use the example spec:
python -m didactopus.arena --arena-spec configs/arena.example.yaml
That writes outputs under:
examples/arena/
Spec Shape
The arena spec is a YAML file with:
candidatesreview
Example candidate fields:
nameconfigprompt_variantlanguage
Example review fields:
enabledconfigrole
Current Prompt Variants
baselinestrict_groundingtrust_preservingconcise
These are applied to Didactopus role prompts, not to arbitrary raw prompt strings.
Outputs
The arena currently writes:
arena_results.jsonarena_review_queue.jsonarena_report.md
When a candidate sets a non-English language, the arena now also tracks a heuristic multilingual_score alongside the grounded behavior score. This is meant to catch obvious failures where a model ignores the requested output language or drops key grounded terms.
If the pack provides multilingual_qa.yaml, the arena also uses that spec to check required terms, required caveats, and forbidden confusions for the target language.
For non-English candidates, the arena now also records round-trip warnings by back-translating outputs into English and checking whether required source phrases remain recoverable.
Human Review Position
The LLM review summary should be treated as initial triage support only.
The intended order of trust is:
- deterministic checks
- arena comparison results
- LLM comparative summary
- human reviewer decision