1.7 KiB
1.7 KiB
Didactopus Arena
The Didactopus arena compares candidate combinations of:
- provider configuration
- model choice
- role prompt variant
- output language
It is not a generic chatbot arena. It is a Didactopus-specific behavior arena for grounded learner tasks.
What It Does
For each candidate, the arena runs the current graph-grounded learner-task shape for:
mentorpracticeevaluator
It then produces:
- deterministic role scores
- candidate rankings
- a human review queue
- an optional LLM-written review summary to help the human reviewer triage results
Why This Exists
Didactopus needs a practical way to improve:
- local model choice
- prompt variants
- trust-preserving behavior
- source-grounded behavior
This is an aid to benchmarking and review, not an automatic certification system.
How To Run It
Use the example spec:
python -m didactopus.arena --arena-spec configs/arena.example.yaml
That writes outputs under:
examples/arena/
Spec Shape
The arena spec is a YAML file with:
candidatesreview
Example candidate fields:
nameconfigprompt_variantlanguage
Example review fields:
enabledconfigrole
Current Prompt Variants
baselinestrict_groundingtrust_preservingconcise
These are applied to Didactopus role prompts, not to arbitrary raw prompt strings.
Outputs
The arena currently writes:
arena_results.jsonarena_review_queue.jsonarena_report.md
Human Review Position
The LLM review summary should be treated as initial triage support only.
The intended order of trust is:
- deterministic checks
- arena comparison results
- LLM comparative summary
- human reviewer decision