252 lines
10 KiB
Markdown
252 lines
10 KiB
Markdown
# Local LLM Evaluation for GenieHive Agent Roles
|
||
|
||
Last updated: 2026-04-27
|
||
|
||
## Purpose
|
||
|
||
This document describes a framework for evaluating locally-hosted LLMs against the
|
||
roles that GenieHive needs to fulfill in a multi-agent or tool-use pipeline. The goal
|
||
is to determine which models are fit for which roles given available hardware, and
|
||
to produce benchmark data that GenieHive's own routing layer can consume.
|
||
|
||
---
|
||
|
||
## Role Taxonomy
|
||
|
||
GenieHive routes by role. Before evaluating models, the roles likely needed in an
|
||
agent pipeline must be defined. The following taxonomy covers the most common cases.
|
||
|
||
### Tier 1: Core Inference Roles
|
||
|
||
| Role ID | Description | Key requirements |
|
||
|----------------------|----------------------------------------------------------|---------------------------------------------------|
|
||
| `general_assistant` | General-purpose instruction following, Q&A, summarization| Good instruction following, ≥8k context |
|
||
| `reasoning` | Multi-step problem solving, chain-of-thought tasks | Extended thinking, ≥16k context, slow OK |
|
||
| `code_assistant` | Code generation, explanation, debugging | Strong code benchmarks, fill-in-middle optional |
|
||
| `structured_output` | JSON/schema-constrained generation | Grammar sampling or reliable JSON mode |
|
||
| `tool_use` | Tool/function call formatting and parsing | Function call format compliance, low hallucination|
|
||
|
||
### Tier 2: Supporting Roles
|
||
|
||
| Role ID | Description | Key requirements |
|
||
|----------------------|----------------------------------------------------------|---------------------------------------------------|
|
||
| `embedder` | Semantic embedding for RAG, search, clustering | High MTEB scores, must be loaded (not lazy) |
|
||
| `classifier` | Short-text classification, intent detection | Fast TTFT, low token budget, reliable format |
|
||
| `summarizer` | Condensing long documents | Long context (≥32k), extractive reliability |
|
||
| `critic` | Reviewing, scoring, or evaluating model outputs | Self-consistency, instruction precision |
|
||
| `transcriber` | Audio-to-text (Whisper-family) | WER on domain-specific content |
|
||
|
||
### Tier 3: Specialized Roles (project-specific)
|
||
|
||
These are informed by the current project context (TalkOrigins bibliography pipeline,
|
||
Panda's Thumb archive, multi-site search).
|
||
|
||
| Role ID | Description | Key requirements |
|
||
|------------------------|------------------------------------------------------------|------------------------------------------------|
|
||
| `bibliographic_analyst`| Extract, verify, and enrich bibliographic metadata | Precise instruction following, structured JSON |
|
||
| `science_explainer` | Explain scientific concepts for a general audience | Factual accuracy, good prose, ≥8k context |
|
||
| `search_query_writer` | Generate search queries from topic descriptions | Concise, varied output; fast |
|
||
| `html_cleaner` | Identify and convert markup patterns (MT tags, etc.) | Reliable format compliance |
|
||
|
||
---
|
||
|
||
## Hardware Context
|
||
|
||
Evaluation should be scoped to what is actually available. Document hardware before
|
||
running benchmarks.
|
||
|
||
Recommended inventory fields per host:
|
||
|
||
```yaml
|
||
host_id: atlas-01
|
||
gpu:
|
||
- name: NVIDIA Tesla P40
|
||
vram_gb: 24
|
||
cuda: "8.0"
|
||
cpu:
|
||
threads: 24
|
||
model: "Intel Xeon"
|
||
ram_gb: 128
|
||
fast_storage: true # NVMe vs. spinning rust matters for model load time
|
||
```
|
||
|
||
Models that do not fit in VRAM will run on CPU or split across GPU+CPU. Note
|
||
GPU-only, GPU+CPU, and CPU-only fit status explicitly for each candidate.
|
||
|
||
---
|
||
|
||
## Candidate Model Selection
|
||
|
||
For each role tier, select 2–4 candidate models. Selection criteria:
|
||
|
||
1. **Fits hardware** — VRAM budget for the target host
|
||
2. **GGUF available** — for llama.cpp / llamafile deployment
|
||
3. **License** — permissive enough for intended use
|
||
4. **Recency** — prefer models released in the last 12 months unless a classic
|
||
substantially outperforms
|
||
|
||
### Suggested Starting Candidates (as of 2026-04)
|
||
|
||
**General assistant / reasoning:**
|
||
- Qwen3-8B-Q4_K_M (fits P40 at 24 GB, extended thinking available)
|
||
- Qwen3-14B-Q4_K_M (fits P40 at ~10 GB VRAM + offload, better reasoning)
|
||
- Mistral-7B-Instruct-v0.3 (fast, reliable baseline)
|
||
|
||
**Code assistant:**
|
||
- Qwen2.5-Coder-7B-Instruct
|
||
- DeepSeek-Coder-V2-Lite-Instruct (16B MoE, may fit on CPU+GPU split)
|
||
|
||
**Structured output / tool use:**
|
||
- Qwen3-8B (native tool call support)
|
||
- functionary-small-v3.2 (purpose-built for tool use)
|
||
- Hermes-3-Llama-3.1-8B (strong JSON reliability)
|
||
|
||
**Embeddings:**
|
||
- nomic-embed-text-v1.5 (fast, high MTEB, 137M params)
|
||
- mxbai-embed-large-v1 (larger, higher MTEB)
|
||
- bge-small-en-v1.5 (smallest, acceptable quality for retrieval)
|
||
|
||
**Transcription:**
|
||
- faster-whisper large-v3 (best WER, GPU accelerated)
|
||
- faster-whisper medium.en (faster, smaller, English-only)
|
||
|
||
---
|
||
|
||
## Evaluation Protocol
|
||
|
||
### Phase 1: Deployment fit check
|
||
|
||
For each candidate:
|
||
|
||
1. Load the model via llama.cpp or Ollama.
|
||
2. Send a minimal completion request to confirm the endpoint is responding.
|
||
3. Record:
|
||
- Actual VRAM used (from `nvidia-smi`)
|
||
- Time to first token on a short prompt (~50 tokens)
|
||
- Tokens/sec on a medium completion (~200 tokens)
|
||
|
||
Pass criterion: TTFT < 5 s, tokens/sec > 5.
|
||
|
||
### Phase 2: Role fitness benchmarks
|
||
|
||
Use GenieHive's built-in benchmark runner for chat roles. Extend with custom
|
||
workloads for each role. Each workload should have 3–5 cases with known expected
|
||
outputs or pass criteria.
|
||
|
||
**Workload design principles:**
|
||
- Cases should be representative of real workload (not toy examples)
|
||
- Pass criteria should be checkable without a judge model where possible
|
||
(exact match, JSON parse, regex, non-empty, length bounds)
|
||
- Include at least one adversarial case per role (ambiguous prompt, edge input)
|
||
- Record `chat_template_kwargs` for models that need them (e.g., Qwen3 thinking)
|
||
|
||
**Suggested workloads to add to `benchmark_runner.py`:**
|
||
|
||
```
|
||
chat.structured_json — produce valid JSON matching a schema
|
||
chat.tool_call_format — emit a well-formed function call
|
||
chat.code_python — generate a short working Python function
|
||
chat.long_context_recall — answer from a 16k-token context document
|
||
chat.concise_classification — classify text into one of N labels
|
||
```
|
||
|
||
**Embeddings workloads** (separate evaluation script needed):
|
||
- Cosine similarity ranking on semantically close/distant pairs
|
||
- Retrieval recall@5 on a small fixed corpus
|
||
|
||
### Phase 3: Comparative scoring
|
||
|
||
For each role, rank candidates by:
|
||
|
||
1. Pass rate (primary)
|
||
2. Tokens/sec (secondary, for latency-sensitive roles)
|
||
3. TTFT (secondary, for interactive roles)
|
||
4. VRAM cost (tie-breaker)
|
||
|
||
Document the winner and runner-up. Load both into GenieHive's benchmark store
|
||
so the routing layer can score them in live operation.
|
||
|
||
---
|
||
|
||
## Integrating Results into GenieHive
|
||
|
||
After running benchmarks:
|
||
|
||
1. Emit a JSON benchmark report (use `run_benchmark_workload.py`).
|
||
2. Ingest into the control plane: `python scripts/ingest_benchmark_report.py`.
|
||
3. Define a role in `roles.yaml` with `preferred_families` aligned to the
|
||
winning candidate's model family.
|
||
4. Verify routing: `GET /v1/cluster/routes/resolve?model=<role_id>` should
|
||
return the winning service.
|
||
5. Run a live request through the role to confirm end-to-end.
|
||
|
||
---
|
||
|
||
## Evaluation Checklist
|
||
|
||
```
|
||
[ ] Hardware inventory documented for each candidate host
|
||
[ ] Candidate models selected per role tier
|
||
[ ] Each candidate loaded and Phase 1 deployment check passed
|
||
[ ] Custom workloads written for at least Tier 1 roles
|
||
[ ] Phase 2 benchmarks run and results recorded
|
||
[ ] Results ingested into GenieHive benchmark store
|
||
[ ] Roles defined in roles.yaml matching Phase 3 winners
|
||
[ ] End-to-end routing verified for each role
|
||
[ ] Results documented in a summary (see template below)
|
||
```
|
||
|
||
---
|
||
|
||
## Results Summary Template
|
||
|
||
```markdown
|
||
## Evaluation Results — <date>
|
||
|
||
### Hardware
|
||
- Host: <host_id>
|
||
- GPU: <name>, <vram_gb> GB VRAM
|
||
- RAM: <ram_gb> GB
|
||
|
||
### Role: general_assistant
|
||
| Model | Pass rate | tok/s | TTFT ms | VRAM GB | Result |
|
||
|--------------------|-----------|-------|---------|---------|---------|
|
||
| Qwen3-8B-Q4_K_M | 0.92 | 38 | 420 | 6.1 | WINNER |
|
||
| Mistral-7B-v0.3 | 0.85 | 52 | 310 | 4.9 | runner-up |
|
||
|
||
### Role: embedder
|
||
| Model | Recall@5 | Latency ms | VRAM GB | Result |
|
||
|------------------------|----------|------------|---------|---------|
|
||
| nomic-embed-text-v1.5 | 0.88 | 12 | 0.3 | WINNER |
|
||
| bge-small-en-v1.5 | 0.79 | 8 | 0.1 | runner-up |
|
||
|
||
... (repeat for each role)
|
||
```
|
||
|
||
---
|
||
|
||
## Notes on Model Families and Known Behaviors
|
||
|
||
**Qwen3 / Qwen3.5:**
|
||
GenieHive auto-detects these and sets `enable_thinking: false` unless a role or
|
||
asset explicitly overrides. For the `reasoning` role, set `enable_thinking: true`
|
||
in the role's `body_defaults` to engage extended chain-of-thought.
|
||
|
||
**Mistral / Mixtral:**
|
||
Standard instruction format. No special handling needed.
|
||
|
||
**DeepSeek models:**
|
||
Some versions use a `<think>` block in their output. GenieHive strips
|
||
`reasoning_content` from responses but not inline `<think>` blocks. If the
|
||
model emits inline thinking that should be hidden from clients, add a
|
||
response-cleaning step or configure the model server to suppress it.
|
||
|
||
**Embedding models via Ollama:**
|
||
Ollama's embedding endpoint is `/api/embeddings`, not `/v1/embeddings`. The
|
||
current `UpstreamClient` uses the OpenAI-compatible path. When registering
|
||
an Ollama embedding service, confirm the node config points to the correct
|
||
endpoint or that the Ollama version supports `/v1/embeddings`.
|
||
|
||
**llamafile:**
|
||
Does not support the embeddings endpoint. Only suitable for chat roles.
|