Didactopus Arena

The Didactopus arena compares candidate combinations of:

It is not a generic chatbot arena. It is a Didactopus-specific behavior arena for grounded learner tasks.

What It Does

For each candidate, the arena runs the current graph-grounded learner-task shape for:

It then produces:

deterministic role scores
candidate rankings
a human review queue
an optional LLM-written review summary to help the human reviewer triage results

Didactopus needs a practical way to improve:

This is an aid to benchmarking and review, not an automatic certification system.

Use the example spec:

python -m didactopus.arena --arena-spec configs/arena.example.yaml

That writes outputs under:

The arena spec is a YAML file with:

Example candidate fields:

Example review fields:

These are applied to Didactopus role prompts, not to arbitrary raw prompt strings.

The arena currently writes:

The LLM review summary should be treated as initial triage support only.

The intended order of trust is: