GenieHive/docs/local_llm_evaluation.md

10 KiB
Raw Permalink Blame History

Local LLM Evaluation for GenieHive Agent Roles

Last updated: 2026-04-27

Purpose

This document describes a framework for evaluating locally-hosted LLMs against the roles that GenieHive needs to fulfill in a multi-agent or tool-use pipeline. The goal is to determine which models are fit for which roles given available hardware, and to produce benchmark data that GenieHive's own routing layer can consume.


Role Taxonomy

GenieHive routes by role. Before evaluating models, the roles likely needed in an agent pipeline must be defined. The following taxonomy covers the most common cases.

Tier 1: Core Inference Roles

Role ID Description Key requirements
general_assistant General-purpose instruction following, Q&A, summarization Good instruction following, ≥8k context
reasoning Multi-step problem solving, chain-of-thought tasks Extended thinking, ≥16k context, slow OK
code_assistant Code generation, explanation, debugging Strong code benchmarks, fill-in-middle optional
structured_output JSON/schema-constrained generation Grammar sampling or reliable JSON mode
tool_use Tool/function call formatting and parsing Function call format compliance, low hallucination

Tier 2: Supporting Roles

Role ID Description Key requirements
embedder Semantic embedding for RAG, search, clustering High MTEB scores, must be loaded (not lazy)
classifier Short-text classification, intent detection Fast TTFT, low token budget, reliable format
summarizer Condensing long documents Long context (≥32k), extractive reliability
critic Reviewing, scoring, or evaluating model outputs Self-consistency, instruction precision
transcriber Audio-to-text (Whisper-family) WER on domain-specific content

Tier 3: Specialized Roles (project-specific)

These are informed by the current project context (TalkOrigins bibliography pipeline, Panda's Thumb archive, multi-site search).

Role ID Description Key requirements
bibliographic_analyst Extract, verify, and enrich bibliographic metadata Precise instruction following, structured JSON
science_explainer Explain scientific concepts for a general audience Factual accuracy, good prose, ≥8k context
search_query_writer Generate search queries from topic descriptions Concise, varied output; fast
html_cleaner Identify and convert markup patterns (MT tags, etc.) Reliable format compliance

Hardware Context

Evaluation should be scoped to what is actually available. Document hardware before running benchmarks.

Recommended inventory fields per host:

host_id: atlas-01
gpu:
  - name: NVIDIA Tesla P40
    vram_gb: 24
    cuda: "8.0"
cpu:
  threads: 24
  model: "Intel Xeon"
ram_gb: 128
fast_storage: true   # NVMe vs. spinning rust matters for model load time

Models that do not fit in VRAM will run on CPU or split across GPU+CPU. Note GPU-only, GPU+CPU, and CPU-only fit status explicitly for each candidate.


Candidate Model Selection

For each role tier, select 24 candidate models. Selection criteria:

  1. Fits hardware — VRAM budget for the target host
  2. GGUF available — for llama.cpp / llamafile deployment
  3. License — permissive enough for intended use
  4. Recency — prefer models released in the last 12 months unless a classic substantially outperforms

Suggested Starting Candidates (as of 2026-04)

General assistant / reasoning:

  • Qwen3-8B-Q4_K_M (fits P40 at 24 GB, extended thinking available)
  • Qwen3-14B-Q4_K_M (fits P40 at ~10 GB VRAM + offload, better reasoning)
  • Mistral-7B-Instruct-v0.3 (fast, reliable baseline)

Code assistant:

  • Qwen2.5-Coder-7B-Instruct
  • DeepSeek-Coder-V2-Lite-Instruct (16B MoE, may fit on CPU+GPU split)

Structured output / tool use:

  • Qwen3-8B (native tool call support)
  • functionary-small-v3.2 (purpose-built for tool use)
  • Hermes-3-Llama-3.1-8B (strong JSON reliability)

Embeddings:

  • nomic-embed-text-v1.5 (fast, high MTEB, 137M params)
  • mxbai-embed-large-v1 (larger, higher MTEB)
  • bge-small-en-v1.5 (smallest, acceptable quality for retrieval)

Transcription:

  • faster-whisper large-v3 (best WER, GPU accelerated)
  • faster-whisper medium.en (faster, smaller, English-only)

Evaluation Protocol

Phase 1: Deployment fit check

For each candidate:

  1. Load the model via llama.cpp or Ollama.
  2. Send a minimal completion request to confirm the endpoint is responding.
  3. Record:
    • Actual VRAM used (from nvidia-smi)
    • Time to first token on a short prompt (~50 tokens)
    • Tokens/sec on a medium completion (~200 tokens)

Pass criterion: TTFT < 5 s, tokens/sec > 5.

Phase 2: Role fitness benchmarks

Use GenieHive's built-in benchmark runner for chat roles. Extend with custom workloads for each role. Each workload should have 35 cases with known expected outputs or pass criteria.

Workload design principles:

  • Cases should be representative of real workload (not toy examples)
  • Pass criteria should be checkable without a judge model where possible (exact match, JSON parse, regex, non-empty, length bounds)
  • Include at least one adversarial case per role (ambiguous prompt, edge input)
  • Record chat_template_kwargs for models that need them (e.g., Qwen3 thinking)

Suggested workloads to add to benchmark_runner.py:

chat.structured_json      — produce valid JSON matching a schema
chat.tool_call_format     — emit a well-formed function call
chat.code_python          — generate a short working Python function
chat.long_context_recall  — answer from a 16k-token context document
chat.concise_classification — classify text into one of N labels

Embeddings workloads (separate evaluation script needed):

  • Cosine similarity ranking on semantically close/distant pairs
  • Retrieval recall@5 on a small fixed corpus

Phase 3: Comparative scoring

For each role, rank candidates by:

  1. Pass rate (primary)
  2. Tokens/sec (secondary, for latency-sensitive roles)
  3. TTFT (secondary, for interactive roles)
  4. VRAM cost (tie-breaker)

Document the winner and runner-up. Load both into GenieHive's benchmark store so the routing layer can score them in live operation.


Integrating Results into GenieHive

After running benchmarks:

  1. Emit a JSON benchmark report (use run_benchmark_workload.py).
  2. Ingest into the control plane: python scripts/ingest_benchmark_report.py.
  3. Define a role in roles.yaml with preferred_families aligned to the winning candidate's model family.
  4. Verify routing: GET /v1/cluster/routes/resolve?model=<role_id> should return the winning service.
  5. Run a live request through the role to confirm end-to-end.

Evaluation Checklist

[ ] Hardware inventory documented for each candidate host
[ ] Candidate models selected per role tier
[ ] Each candidate loaded and Phase 1 deployment check passed
[ ] Custom workloads written for at least Tier 1 roles
[ ] Phase 2 benchmarks run and results recorded
[ ] Results ingested into GenieHive benchmark store
[ ] Roles defined in roles.yaml matching Phase 3 winners
[ ] End-to-end routing verified for each role
[ ] Results documented in a summary (see template below)

Results Summary Template

## Evaluation Results — <date>

### Hardware
- Host: <host_id>
- GPU: <name>, <vram_gb> GB VRAM
- RAM: <ram_gb> GB

### Role: general_assistant
| Model              | Pass rate | tok/s | TTFT ms | VRAM GB | Result  |
|--------------------|-----------|-------|---------|---------|---------|
| Qwen3-8B-Q4_K_M    | 0.92      | 38    | 420     | 6.1     | WINNER  |
| Mistral-7B-v0.3    | 0.85      | 52    | 310     | 4.9     | runner-up |

### Role: embedder
| Model                  | Recall@5 | Latency ms | VRAM GB | Result  |
|------------------------|----------|------------|---------|---------|
| nomic-embed-text-v1.5  | 0.88     | 12         | 0.3     | WINNER  |
| bge-small-en-v1.5      | 0.79     | 8          | 0.1     | runner-up |

... (repeat for each role)

Notes on Model Families and Known Behaviors

Qwen3 / Qwen3.5: GenieHive auto-detects these and sets enable_thinking: false unless a role or asset explicitly overrides. For the reasoning role, set enable_thinking: true in the role's body_defaults to engage extended chain-of-thought.

Mistral / Mixtral: Standard instruction format. No special handling needed.

DeepSeek models: Some versions use a <think> block in their output. GenieHive strips reasoning_content from responses but not inline <think> blocks. If the model emits inline thinking that should be hidden from clients, add a response-cleaning step or configure the model server to suppress it.

Embedding models via Ollama: Ollama's embedding endpoint is /api/embeddings, not /v1/embeddings. The current UpstreamClient uses the OpenAI-compatible path. When registering an Ollama embedding service, confirm the node config points to the correct endpoint or that the Ollama version supports /v1/embeddings.

llamafile: Does not support the embeddings endpoint. Only suitable for chat roles.