Didactopus/examples/arena/arena_report.md

16 lines
811 B
Markdown

# Didactopus Arena Report
- Candidates: 3
## Rankings
- `stub-baseline` via `stub` / prompt variant `baseline`: borderline (0.733), language `en`
- `stub-strict-grounding` via `stub` / prompt variant `strict_grounding`: inadequate (0.547), language `es`
- `stub-trust-preserving` via `stub` / prompt variant `trust_preserving`: inadequate (0.547), language `fr`
## Human Review Queue
- `stub-baseline`: needs_human_review=True, weak_roles=['mentor', 'evaluator']
- `stub-strict-grounding`: needs_human_review=True, weak_roles=['mentor', 'evaluator']
- `stub-trust-preserving`: needs_human_review=True, weak_roles=['mentor', 'evaluator']
## LLM Review Summary
[stubbed-response] [mentor] Review these Didactopus arena results for a human reviewer. Rank the strongest candidates, identify likely prompt improv