Didactopus/docs/local-model-benchmark.md

# Local Model Benchmark

Didactopus should not evaluate local models as generic chatbots. It should evaluate them as role-specific components in a graph-grounded learner workflow.

This benchmark uses the MIT OCW Information and Entropy skill bundle and measures whether a local model is adequate for the current Didactopus mentor loop.

## What It Benchmarks

The current harness evaluates three Didactopus roles:

- `mentor`
- `practice`
- `evaluator`

Each role is prompted with graph-grounded context derived from:

- `knowledge_graph.json`
- `source_corpus.json`
- the generated OCW skill bundle

## Why This Matters

Didactopus needs local models that are good enough to support guided learning on constrained hardware. That is a different question from asking which model is globally strongest.

The benchmark is intended to support comparisons such as:

- Raspberry Pi-class devices
- low-end local desktops
- stronger local workstations
- RoleMesh-routed local model mixes

## How To Run It

Stub or local-demo run:

```bash
python -m didactopus.model_bench \
  --config configs/config.example.yaml \
  --skill-dir skills/ocw-information-entropy-agent \
  --out-dir examples/model-benchmark \
  --hardware-profile pi-minimal \
  --hardware-cpu cortex-a76 \
  --hardware-ram-gb 8
```

RoleMesh-backed run:

```bash
python -m didactopus.model_bench \
  --config configs/config.rolemesh.example.yaml \
  --skill-dir skills/ocw-information-entropy-agent \
  --out-dir examples/model-benchmark-rolemesh \
  --hardware-profile laptop-local \
  --hardware-cpu ryzen-7 \
  --hardware-ram-gb 32
```

## Outputs

The benchmark writes:

- `model_benchmark.json`
- `model_benchmark.md`

These include:

- provider and model information
- hardware profile metadata
- per-role latency
- per-role adequacy score and adequacy rating
- an overall recommendation

## Current Scoring Shape

The current heuristic scoring asks whether each role does the right kind of work:

- `mentor`
  - stays tied to the grounded concept
  - surfaces structure or prerequisites
  - asks a focused learner question
- `practice`
  - produces a real exercise
  - avoids giving away the full solution
  - stays tied to the grounded topic
- `evaluator`
  - acknowledges learner strengths
  - preserves an existing caveat rather than inventing an omission
  - gives a concrete next step

This is deliberately narrower than a general-purpose benchmark. Didactopus cares about trustworthy learner guidance, not maximal generic fluency.

When `--language` is set to a non-English value, the benchmark now also applies a heuristic multilingual check:

- does the response appear to actually be in the target language?
- does it still preserve key grounded concept terms and caveats?

If the pack provides `multilingual_qa.yaml`, the benchmark also applies per-pack preservation checks from that spec.

For non-English runs, the benchmark now also records a round-trip warning layer by back-translating role outputs into English and checking whether required source phrases are still recoverable. This is a warning-oriented signal, not a proof of correctness.

## Interpreting Ratings

- `adequate`
  - suitable for local guided-learning experiments
- `borderline`
  - usable only with review and caution
- `inadequate`
  - not recommended for learner-facing use in the current configuration

## Recommended Next Step

As the learner session backend grows, the benchmark should expand to include:

- multi-turn sessions
- first-token delay and tokens-per-second capture
- memory and thermal observations on constrained hardware
- accessibility-specific checks for structure and spoken-output quality

For model-and-prompt comparison across multiple candidates, use:

- `docs/arena.md`