From a76c7e81f4cf5665ee1092fdb9f48a4a1b67fadf Mon Sep 17 00:00:00 2001
From: welberr <welsberr@gmail.com>
Date: Mon, 27 Apr 2026 09:25:51 -0400
Subject: [PATCH] Revise architecture/roadmap docs and add LLM evaluation guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- architecture.md: rewrite to describe the actual running system; remove
  design-phase repo-naming discussion and initial-implementation-sequence
  list; add data-flow diagram, scoring weights table, API status table
- roadmap.md: replace aspirational list with concrete completed/gap/next
  structure; document four confirmed implementation gaps (transcription
  stub, strategy field ignored, fallback_roles unimplemented, benchmark
  quality score additive overflow); prioritise fixes as P0/P1/P2/P3
- docs/local_llm_evaluation.md: new document; role taxonomy (tier 1–3),
  hardware inventory template, candidate model suggestions, three-phase
  evaluation protocol, GenieHive integration steps, results template,
  notes on Qwen3/Mistral/DeepSeek/Ollama embedding path quirks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 docs/architecture.md         | 306 +++++++++++++++++++----------------
 docs/local_llm_evaluation.md | 251 ++++++++++++++++++++++++++++
 docs/roadmap.md              | 189 +++++++++++++++++++---
 3 files changed, 579 insertions(+), 167 deletions(-)
 create mode 100644 docs/local_llm_evaluation.md

diff --git a/docs/architecture.md b/docs/architecture.md
index 6f7f24f..1f1b5ce 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -1,74 +1,46 @@
 # GenieHive Architecture
 
-Status: proposed v1 architecture
-Drafted: 2026-04-05
-
-## Repo Name
-
-Chosen name: `GenieHive`
-
-Why this name:
-
-- suggestive: "genie" implies generative AI services, "hive" implies a cooperating cluster
-- accessible: easy to say, remember, and explain
-- whimsical enough to feel like a project name rather than a dry infrastructure label
-
-Tradeoff:
-
-- `GenieHive` is less search-distinct than `Geniewarren` because `hive` is a common product metaphor
+Last updated: 2026-04-27
 
 ## Mission
 
-GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts.
+GenieHive is a local-first control plane for heterogeneous generative AI services
+running across one or more hosts. It provides:
 
-It should:
+- Registration and health tracking for distributed AI services
+- A stable, OpenAI-compatible client-facing API
+- Role-based routing and scheduling over multiple services
+- Integrated benchmarking and performance-informed route scoring
 
-- register hosts and their available services
-- expose a stable client-facing API
-- track health, capacity, and observed performance
-- support direct model addressing and higher-level role addressing
-- route requests to healthy loaded services first
-- optionally coordinate loading or swapping when policy allows
-- remain practical for a small self-hosted deployment with two hosts
+It is not a plain OpenAI-compatible gateway. The control plane layer adds topology
+awareness, role abstraction, and signal-driven routing that a dumb proxy does not
+provide.
 
-## Non-Goals For V1
+---
 
-Out of scope initially:
+## Four Layers
 
-- peer-to-peer consensus
-- autonomous global model swapping across many nodes
-- full WAN zero-trust platform engineering
-- image and TTS generation orchestration
-- distributed vector database management
-- billing or multi-tenant quota accounting
+```
+┌─────────────────────────────────────────────┐
+│  Client Facades                              │
+│  OpenAI-compatible completions + embeddings  │
+│  Operator inspection API                     │
+├─────────────────────────────────────────────┤
+│  Control API                                 │
+│  Registry · Role catalog · Route resolution  │
+│  Scheduling · Benchmark store                │
+├─────────────────────────────────────────────┤
+│  Node Agent(s)                               │
+│  Host discovery · Service enumeration        │
+│  Telemetry reporting · Heartbeat             │
+├─────────────────────────────────────────────┤
+│  Provider Adapters                           │
+│  OpenAI-compatible chat / embeddings         │
+│  Transcription (partial)                     │
+└─────────────────────────────────────────────┘
+```
 
-## Architectural Position
-
-GenieHive is not just an OpenAI-compatible gateway.
-
-It is a control plane with these layers:
-
-1. Control API
-   - authoritative registry
-   - routing and scheduling
-   - role catalog
-   - operator inspection
-
-2. Node Agent
-   - host discovery
-   - service discovery
-   - telemetry reporting
-   - optional local process management
-
-3. Provider Adapters
-   - OpenAI-compatible chat backends
-   - OpenAI-compatible embedding backends
-   - transcription backends
-   - future adapters for image and speech synthesis
-
-4. Client Facades
-   - OpenAI-compatible facade for completions and embeddings
-   - operator API for topology, health, and inventory
+---
 
 ## Core Concepts
 
@@ -78,117 +50,165 @@ A physical or virtual machine participating in the cluster.
 
 ### Service
 
-A concrete callable capability on a host. Examples:
-
-- chat completion endpoint
-- embedding endpoint
-- transcription endpoint
+A concrete callable capability on a host: a chat endpoint, an embeddings endpoint,
+or a transcription endpoint. A host typically exposes multiple services.
 
 ### Asset
 
-A model weight, model name, application, or runtime target that a service can serve.
+A model weight, model name, or runtime target that a service can serve. Assets carry
+optional `request_policy` fields that adjust how requests are shaped before forwarding.
 
 ### Role
 
-A reusable task profile that describes how requests should be fulfilled. A role is policy, not a concrete model.
+A reusable task profile that describes *how* requests should be fulfilled, not *which*
+model fills them. A role has a prompt policy (system prompt injection, body defaults)
+and a routing policy (preferred model families, minimum context size, loaded-first
+preference). The same role can route to different services as cluster state changes.
 
 ### Route Resolution
 
-Request handling order:
+1. If `model` matches a loaded, healthy asset or service alias → route directly.
+2. If `model` matches a known role → score eligible services and route to the best.
+3. Otherwise → fail with a clear 404.
 
-1. If the requested `model` matches a currently loaded and healthy concrete asset or service alias, route directly.
-2. Otherwise, if the requested `model` matches a known role, resolve the role to the best eligible service.
-3. Otherwise, fail clearly.
+---
 
-## V1 Capability Scope
+## Data Flow: Chat Completion
 
-V1 supports only:
+```
+Client POST /v1/chat/completions
+  │
+  ▼
+resolve_route(model, kind="chat")
+  ├─ direct: asset_id or service alias match
+  └─ role: filter by kind/health → score by runtime + benchmark signals
+  │
+  ▼
+apply_request_policy(request, asset, role)
+  ├─ deep-merge body_defaults
+  ├─ apply system prompt (prepend / append / replace)
+  └─ auto-infer Qwen3 template kwargs if needed
+  │
+  ▼
+UpstreamClient.chat_completions(endpoint, modified_request)
+  │
+  ▼
+_strip_reasoning_fields(response)  ← removes reasoning_content / reasoning
+  │
+  ▼
+Response to client
+```
 
-- chat completions
-- embeddings
-- transcription
+---
+
+## Scoring
+
+Route scoring combines three signal families:
+
+| Signal family  | Weight (role) | Weight (service) |
+|----------------|---------------|-----------------|
+| Text overlap   | 30%           | 20%             |
+| Runtime        | 30%           | 45%             |
+| Benchmark      | 25%           | 35%             |
+| Family pref.   | 15%           | —               |
+
+**Runtime signals** (from last heartbeat):
+- Loaded state: +0.35
+- Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
+- Throughput: ≥40 tok/s +0.20, ≥20 +0.10
+- Queue depth: penalty −0.20 if ≥5, −0.10 if ≥2
+
+**Benchmark signals** (from ingested workload runs):
+- Workload overlap score (Jaccard-style token overlap)
+- Quality score from results: `0.45 * overlap + 0.55 * quality`
+
+---
 
 ## Topology
 
-Recommended initial topology:
+**Minimum viable (single machine):**
+```
+control plane + node agent + model server
+all on 127.0.0.1, different ports
+```
 
-- 1 control plane
-- 2 node agents
-- 1 or more clients
-- LAN-first deployment
-- API key auth in v1
-- VPN or mTLS in v1.5
+**Recommended (small cluster):**
+```
+1 control plane host
+2+ node-agent hosts, each with 1+ model servers
+1+ clients on LAN
+```
 
-## API Families
+**Auth:**
+- Client requests: `X-Api-Key` header
+- Node registration/heartbeat: `X-GenieHive-Node-Key` header
+- Empty key lists disable auth (development only)
+- mTLS between control and nodes planned for v1.5
+
+---
+
+## State Store
+
+SQLite. Schema:
+
+| Table               | Content                                   |
+|---------------------|-------------------------------------------|
+| `hosts`             | Host registration, resources, labels      |
+| `services`          | Service config, runtime, assets, observed |
+| `roles`             | Role catalog                              |
+| `benchmark_samples` | Workload results per service              |
+
+Default path: `state/geniehive.sqlite3`
+
+---
+
+## API Reference Summary
 
 ### Client API
-
-- `GET /v1/models`
-- `POST /v1/chat/completions`
-- `POST /v1/embeddings`
-- `POST /v1/audio/transcriptions`
-
-`GET /v1/models` should expose enough metadata for programmatic clients to make routing decisions about what GenieHive can handle cheaply, especially for lower-complexity offloaded work. That metadata should include direct assets, service-backed aliases, role aliases, operation kind, health, loaded status, and observed performance hints.
+| Endpoint                        | Status        |
+|---------------------------------|---------------|
+| `GET /v1/models`                | Implemented   |
+| `POST /v1/chat/completions`     | Implemented   |
+| `POST /v1/embeddings`           | Implemented   |
+| `POST /v1/audio/transcriptions` | Stub only     |
 
 ### Operator API
-
-- `GET /v1/cluster/hosts`
-- `GET /v1/cluster/services`
-- `GET /v1/cluster/roles`
-- `GET /v1/cluster/health`
-- `GET /v1/cluster/routes/resolve?model=...`
+| Endpoint                           | Status      |
+|------------------------------------|-------------|
+| `GET /v1/cluster/hosts`            | Implemented |
+| `GET /v1/cluster/services`         | Implemented |
+| `GET /v1/cluster/roles`            | Implemented |
+| `GET /v1/cluster/benchmarks`       | Implemented |
+| `GET /v1/cluster/health`           | Implemented |
+| `GET /v1/cluster/routes/resolve`   | Implemented |
+| `POST /v1/cluster/routes/match`    | Implemented |
 
 ### Node API
+| Endpoint                     | Status      |
+|------------------------------|-------------|
+| `POST /v1/nodes/register`    | Implemented |
+| `POST /v1/nodes/heartbeat`   | Implemented |
+| `GET /v1/node/inventory`     | Implemented |
+| `GET /v1/node/registration`  | Implemented |
 
-- `POST /v1/nodes/register`
-- `POST /v1/nodes/heartbeat`
-- `GET /v1/node/inventory`
-- `POST /v1/node/services/refresh`
+---
 
-## Data Store
+## Supported Upstream Backends
 
-V1 should use SQLite for durable state.
+Any OpenAI-compatible HTTP server. Tested configurations:
 
-## Routing Rules
+- **Ollama** — chat and embeddings
+- **llama.cpp** (server mode) — chat and embeddings
+- **llamafile** — chat
+- **vLLM** — chat and embeddings
 
-### Direct Model Resolution
+---
 
-If a request names a concrete asset alias or service alias:
+## Non-Goals for V1
 
-- prefer loaded and healthy services
-- choose the lowest-cost healthy target if multiple matches exist
-- fail clearly if all matches are unhealthy
-
-### Role Resolution
-
-If direct resolution fails, treat the requested name as a role.
-
-Role resolution should filter by:
-
-- operation kind
-- modality
-- health
-- auth and exposure compatibility
-- minimum context or memory requirements
-- preferred model families
-
-Then rank by:
-
-- already loaded
-- recent health
-- expected latency
-- queue pressure
-- operator priority
-
-## First Implementation Sequence
-
-1. Create the repo skeleton and docs.
-2. Implement SQLite-backed registry models.
-3. Implement node registration and heartbeat.
-4. Implement operator inspection endpoints.
-5. Implement client-facing chat routing.
-6. Add embeddings routing.
-7. Add transcription routing.
-8. Add truthful readiness and health reporting.
-9. Add role catalog and role-based resolution.
-10. Add optional managed local runtime support.
+- Peer-to-peer consensus
+- Autonomous global model swapping
+- WAN zero-trust networking
+- Image and TTS generation
+- Distributed vector databases
+- Billing or multi-tenant quotas
diff --git a/docs/local_llm_evaluation.md b/docs/local_llm_evaluation.md
new file mode 100644
index 0000000..3a82c23
--- /dev/null
+++ b/docs/local_llm_evaluation.md
@@ -0,0 +1,251 @@
+# Local LLM Evaluation for GenieHive Agent Roles
+
+Last updated: 2026-04-27
+
+## Purpose
+
+This document describes a framework for evaluating locally-hosted LLMs against the
+roles that GenieHive needs to fulfill in a multi-agent or tool-use pipeline. The goal
+is to determine which models are fit for which roles given available hardware, and
+to produce benchmark data that GenieHive's own routing layer can consume.
+
+---
+
+## Role Taxonomy
+
+GenieHive routes by role. Before evaluating models, the roles likely needed in an
+agent pipeline must be defined. The following taxonomy covers the most common cases.
+
+### Tier 1: Core Inference Roles
+
+| Role ID              | Description                                              | Key requirements                                  |
+|----------------------|----------------------------------------------------------|---------------------------------------------------|
+| `general_assistant`  | General-purpose instruction following, Q&A, summarization| Good instruction following, ≥8k context           |
+| `reasoning`          | Multi-step problem solving, chain-of-thought tasks       | Extended thinking, ≥16k context, slow OK          |
+| `code_assistant`     | Code generation, explanation, debugging                  | Strong code benchmarks, fill-in-middle optional   |
+| `structured_output`  | JSON/schema-constrained generation                       | Grammar sampling or reliable JSON mode            |
+| `tool_use`           | Tool/function call formatting and parsing                | Function call format compliance, low hallucination|
+
+### Tier 2: Supporting Roles
+
+| Role ID              | Description                                              | Key requirements                                  |
+|----------------------|----------------------------------------------------------|---------------------------------------------------|
+| `embedder`           | Semantic embedding for RAG, search, clustering           | High MTEB scores, must be loaded (not lazy)       |
+| `classifier`         | Short-text classification, intent detection              | Fast TTFT, low token budget, reliable format      |
+| `summarizer`         | Condensing long documents                                | Long context (≥32k), extractive reliability       |
+| `critic`             | Reviewing, scoring, or evaluating model outputs          | Self-consistency, instruction precision            |
+| `transcriber`        | Audio-to-text (Whisper-family)                           | WER on domain-specific content                    |
+
+### Tier 3: Specialized Roles (project-specific)
+
+These are informed by the current project context (TalkOrigins bibliography pipeline,
+Panda's Thumb archive, multi-site search).
+
+| Role ID                | Description                                                | Key requirements                               |
+|------------------------|------------------------------------------------------------|------------------------------------------------|
+| `bibliographic_analyst`| Extract, verify, and enrich bibliographic metadata         | Precise instruction following, structured JSON |
+| `science_explainer`    | Explain scientific concepts for a general audience         | Factual accuracy, good prose, ≥8k context      |
+| `search_query_writer`  | Generate search queries from topic descriptions            | Concise, varied output; fast                   |
+| `html_cleaner`         | Identify and convert markup patterns (MT tags, etc.)       | Reliable format compliance                     |
+
+---
+
+## Hardware Context
+
+Evaluation should be scoped to what is actually available. Document hardware before
+running benchmarks.
+
+Recommended inventory fields per host:
+
+```yaml
+host_id: atlas-01
+gpu:
+  - name: NVIDIA Tesla P40
+    vram_gb: 24
+    cuda: "8.0"
+cpu:
+  threads: 24
+  model: "Intel Xeon"
+ram_gb: 128
+fast_storage: true   # NVMe vs. spinning rust matters for model load time
+```
+
+Models that do not fit in VRAM will run on CPU or split across GPU+CPU. Note
+GPU-only, GPU+CPU, and CPU-only fit status explicitly for each candidate.
+
+---
+
+## Candidate Model Selection
+
+For each role tier, select 2–4 candidate models. Selection criteria:
+
+1. **Fits hardware** — VRAM budget for the target host
+2. **GGUF available** — for llama.cpp / llamafile deployment
+3. **License** — permissive enough for intended use
+4. **Recency** — prefer models released in the last 12 months unless a classic
+   substantially outperforms
+
+### Suggested Starting Candidates (as of 2026-04)
+
+**General assistant / reasoning:**
+- Qwen3-8B-Q4_K_M (fits P40 at 24 GB, extended thinking available)
+- Qwen3-14B-Q4_K_M (fits P40 at ~10 GB VRAM + offload, better reasoning)
+- Mistral-7B-Instruct-v0.3 (fast, reliable baseline)
+
+**Code assistant:**
+- Qwen2.5-Coder-7B-Instruct
+- DeepSeek-Coder-V2-Lite-Instruct (16B MoE, may fit on CPU+GPU split)
+
+**Structured output / tool use:**
+- Qwen3-8B (native tool call support)
+- functionary-small-v3.2 (purpose-built for tool use)
+- Hermes-3-Llama-3.1-8B (strong JSON reliability)
+
+**Embeddings:**
+- nomic-embed-text-v1.5 (fast, high MTEB, 137M params)
+- mxbai-embed-large-v1 (larger, higher MTEB)
+- bge-small-en-v1.5 (smallest, acceptable quality for retrieval)
+
+**Transcription:**
+- faster-whisper large-v3 (best WER, GPU accelerated)
+- faster-whisper medium.en (faster, smaller, English-only)
+
+---
+
+## Evaluation Protocol
+
+### Phase 1: Deployment fit check
+
+For each candidate:
+
+1. Load the model via llama.cpp or Ollama.
+2. Send a minimal completion request to confirm the endpoint is responding.
+3. Record:
+   - Actual VRAM used (from `nvidia-smi`)
+   - Time to first token on a short prompt (~50 tokens)
+   - Tokens/sec on a medium completion (~200 tokens)
+
+Pass criterion: TTFT < 5 s, tokens/sec > 5.
+
+### Phase 2: Role fitness benchmarks
+
+Use GenieHive's built-in benchmark runner for chat roles. Extend with custom
+workloads for each role. Each workload should have 3–5 cases with known expected
+outputs or pass criteria.
+
+**Workload design principles:**
+- Cases should be representative of real workload (not toy examples)
+- Pass criteria should be checkable without a judge model where possible
+  (exact match, JSON parse, regex, non-empty, length bounds)
+- Include at least one adversarial case per role (ambiguous prompt, edge input)
+- Record `chat_template_kwargs` for models that need them (e.g., Qwen3 thinking)
+
+**Suggested workloads to add to `benchmark_runner.py`:**
+
+```
+chat.structured_json      — produce valid JSON matching a schema
+chat.tool_call_format     — emit a well-formed function call
+chat.code_python          — generate a short working Python function
+chat.long_context_recall  — answer from a 16k-token context document
+chat.concise_classification — classify text into one of N labels
+```
+
+**Embeddings workloads** (separate evaluation script needed):
+- Cosine similarity ranking on semantically close/distant pairs
+- Retrieval recall@5 on a small fixed corpus
+
+### Phase 3: Comparative scoring
+
+For each role, rank candidates by:
+
+1. Pass rate (primary)
+2. Tokens/sec (secondary, for latency-sensitive roles)
+3. TTFT (secondary, for interactive roles)
+4. VRAM cost (tie-breaker)
+
+Document the winner and runner-up. Load both into GenieHive's benchmark store
+so the routing layer can score them in live operation.
+
+---
+
+## Integrating Results into GenieHive
+
+After running benchmarks:
+
+1. Emit a JSON benchmark report (use `run_benchmark_workload.py`).
+2. Ingest into the control plane: `python scripts/ingest_benchmark_report.py`.
+3. Define a role in `roles.yaml` with `preferred_families` aligned to the
+   winning candidate's model family.
+4. Verify routing: `GET /v1/cluster/routes/resolve?model=<role_id>` should
+   return the winning service.
+5. Run a live request through the role to confirm end-to-end.
+
+---
+
+## Evaluation Checklist
+
+```
+[ ] Hardware inventory documented for each candidate host
+[ ] Candidate models selected per role tier
+[ ] Each candidate loaded and Phase 1 deployment check passed
+[ ] Custom workloads written for at least Tier 1 roles
+[ ] Phase 2 benchmarks run and results recorded
+[ ] Results ingested into GenieHive benchmark store
+[ ] Roles defined in roles.yaml matching Phase 3 winners
+[ ] End-to-end routing verified for each role
+[ ] Results documented in a summary (see template below)
+```
+
+---
+
+## Results Summary Template
+
+```markdown
+## Evaluation Results — <date>
+
+### Hardware
+- Host: <host_id>
+- GPU: <name>, <vram_gb> GB VRAM
+- RAM: <ram_gb> GB
+
+### Role: general_assistant
+| Model              | Pass rate | tok/s | TTFT ms | VRAM GB | Result  |
+|--------------------|-----------|-------|---------|---------|---------|
+| Qwen3-8B-Q4_K_M    | 0.92      | 38    | 420     | 6.1     | WINNER  |
+| Mistral-7B-v0.3    | 0.85      | 52    | 310     | 4.9     | runner-up |
+
+### Role: embedder
+| Model                  | Recall@5 | Latency ms | VRAM GB | Result  |
+|------------------------|----------|------------|---------|---------|
+| nomic-embed-text-v1.5  | 0.88     | 12         | 0.3     | WINNER  |
+| bge-small-en-v1.5      | 0.79     | 8          | 0.1     | runner-up |
+
+... (repeat for each role)
+```
+
+---
+
+## Notes on Model Families and Known Behaviors
+
+**Qwen3 / Qwen3.5:**
+GenieHive auto-detects these and sets `enable_thinking: false` unless a role or
+asset explicitly overrides. For the `reasoning` role, set `enable_thinking: true`
+in the role's `body_defaults` to engage extended chain-of-thought.
+
+**Mistral / Mixtral:**
+Standard instruction format. No special handling needed.
+
+**DeepSeek models:**
+Some versions use a `<think>` block in their output. GenieHive strips
+`reasoning_content` from responses but not inline `<think>` blocks. If the
+model emits inline thinking that should be hidden from clients, add a
+response-cleaning step or configure the model server to suppress it.
+
+**Embedding models via Ollama:**
+Ollama's embedding endpoint is `/api/embeddings`, not `/v1/embeddings`. The
+current `UpstreamClient` uses the OpenAI-compatible path. When registering
+an Ollama embedding service, confirm the node config points to the correct
+endpoint or that the Ollama version supports `/v1/embeddings`.
+
+**llamafile:**
+Does not support the embeddings endpoint. Only suitable for chat roles.
diff --git a/docs/roadmap.md b/docs/roadmap.md
index e02339f..1f463a8 100644
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -1,34 +1,175 @@
 # GenieHive Roadmap
 
-## Completed Foundations
+Last updated: 2026-04-27
 
-- control-plane registry with SQLite persistence
-- node registration and heartbeat
-- role catalog and route resolution
-- client-facing `GET /v1/models`
-- client-facing `POST /v1/chat/completions`
-- client-facing `POST /v1/embeddings`
-- first control-plus-node demo flow
+## What Is Complete
 
-## Immediate Next Milestones
+The v1 core is implemented and tested.
 
-1. Run and document the first live LLM demo against real upstream servers.
-2. Validate the `GET /v1/models` metadata as a Codex-friendly offload catalog for lower-complexity tasks.
-3. Add `POST /v1/audio/transcriptions`.
-4. Add a richer node metrics model for queue depth, current load, and observed performance over time.
-5. Add a stronger operator/client distinction in the public metadata and auth surfaces.
+**Registry and cluster control:**
+- SQLite-backed registry with hosts, services, roles, and benchmark samples
+- Node registration and heartbeat protocol with auto-re-registration on 404
+- Role catalog loading from YAML
+- Route resolution: direct asset/service match → role resolution → clear failure
 
-## LLM Demo Note
+**Client-facing API:**
+- `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state,
+  latency hints, offload classification, role aliases)
+- `POST /v1/chat/completions` — proxies to upstream with request policy application
+- `POST /v1/embeddings` — proxies to upstream
 
-The project is now ready for a first live LLM demo using GenieHive as:
+**Request policy system:**
+- Body defaults and overrides via deep merge
+- System prompt injection (prepend / append / replace)
+- Per-asset and per-role policies, merged with role winning on prompts
+- Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically
 
-- master: control plane
-- peer: one or more node agents with pre-existing local LLM servers
-- client: a small demo agent or Codex configured against GenieHive
+**Route matching and scoring:**
+- `POST /v1/cluster/routes/match` — scored candidate list for role and service targets
+- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput,
+  queue depth), benchmark (workload overlap, quality score)
+- `GET /v1/cluster/routes/resolve` — quick single-model resolution
 
-The current live-demo priority is chat-first. Embeddings are also wired in GenieHive, but upstream compatibility differs across local servers, so the safest first demo matrix is:
+**Benchmark infrastructure:**
+- Built-in workloads: `chat.short_reasoning`, `chat.concise_support`
+- `run_benchmark_workload.py` executes workloads and emits a JSON report
+- `ingest_benchmark_report.py` posts results to the control plane
+- Benchmark samples feed the route scoring pipeline
 
-- Ollama for chat and embeddings
-- vLLM for chat and embeddings
-- llama.cpp for chat
-- llamafile for chat
+**Operator inspection:**
+- `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health`
+
+**Auth:**
+- Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`)
+- Empty key lists disable auth for development
+
+**Tests:**
+- Registry, chat proxy, node inventory, benchmark runner, full demo flow
+- All passing
+
+---
+
+## Known Gaps and Issues
+
+These are confirmed gaps in the current implementation, not aspirational items.
+
+### 1. Transcription endpoint not implemented
+
+`POST /v1/audio/transcriptions` is listed in the architecture and wired into
+`main.py`, but there is no upstream proxy handler for it. `upstream.py` has no
+`transcriptions()` method. The endpoint currently returns nothing useful.
+
+### 2. Routing strategy field is ignored
+
+`RoutingConfig.default_strategy` exists in `config.py` (default: `"loaded_first"`),
+but `resolve_route()` in `registry.py` does not read it. There is effectively only
+one strategy. The field is misleading.
+
+### 3. Role fallback chain is not implemented
+
+`RoutingPolicy.fallback_roles` is defined in `models.py` and appears in the schema
+docs, but `resolve_route()` never consults it. A role that fails to match any service
+fails outright rather than trying its fallbacks.
+
+### 4. `_benchmark_quality_score` can exceed 1.0 before clamping
+
+`pass_rate` and `quality_score` are taken as `max()`, then `tokens_per_sec` and
+`ttft_ms` are *added* on top. A service with `pass_rate=1.0`, fast tokens, and low
+TTFT accumulates a score of up to 1.6 before the final `min(1.0, quality)` clamp.
+This means the additive bonuses have no effect once pass_rate or quality_score is
+already high, which is probably not the intended behavior.
+
+### 5. Health is self-reported only
+
+Service health (`healthy` / `unhealthy`) comes entirely from node-reported state.
+The control plane does not probe upstream endpoints. A service can appear healthy
+while its endpoint is unreachable.
+
+### 6. No active model discovery from upstream services
+
+The node agent scans for `.gguf` files on disk and reads static service config.
+It does not query running Ollama or vLLM instances for their loaded model list.
+A freshly-pulled Ollama model will not appear until the node config is updated
+and the agent restarted.
+
+### 7. `docs/architecture.md` duplicates `GENIEWARREN_SPEC.md`
+
+`architecture.md` contains the repo-naming rationale, name alternatives, and
+implementation sequence list that are only meaningful in a design/proposal context.
+These are noise in a reference architecture document.
+
+---
+
+## Immediate Next Work (Priority Order)
+
+### P0 — Fix confirmed bugs
+
+1. **Remove the misleading `default_strategy` field** or implement a dispatch table
+   so the config field actually selects behavior. Simplest fix: delete the field and
+   the dead config surface until a second strategy is implemented.
+
+2. **Fix `_benchmark_quality_score`** so additive bonuses apply only when no
+   `pass_rate` / `quality_score` is available, or restructure as a weighted average
+   so the components don't stack additively.
+
+### P1 — Complete stated v1 scope
+
+3. **Implement transcription proxy** — add `upstream.transcriptions()` and wire
+   the handler in `chat.py` and `main.py`.
+
+4. **Implement role fallback chain** — when `resolve_route()` finds no matching
+   service for a role, walk `fallback_roles` in order before failing.
+
+### P2 — Close the most important self-reported-only gaps
+
+5. **Add active health probing** — the control plane should periodically probe
+   registered service endpoints (a lightweight `GET /health` or `GET /v1/models`
+   is sufficient) and update health state independently of node heartbeats.
+
+6. **Add upstream model discovery for Ollama** — query `GET /api/tags` (Ollama)
+   or `GET /v1/models` (OpenAI-compatible) from the node agent and merge loaded
+   model names into the service's asset list. This enables dynamic model tracking
+   without config restarts.
+
+### P3 — Documentation cleanup
+
+7. **Revise `architecture.md`** — remove the design-phase repo-naming rationale
+   and first-implementation-sequence list; replace with a description of the actual
+   running system (the four layers as implemented, data flow diagram if possible).
+
+8. **Update `roadmap.md`** — this file (done).
+
+---
+
+## Near-Term Milestones (After P0–P3)
+
+- **Live LLM demo** — run control + node against a real upstream (Ollama or
+  llama.cpp) and document the end-to-end flow, including chat via role and
+  direct asset addressing
+- **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as
+  a programmatic service catalog for a Claude Code or Codex client selecting
+  a GenieHive-hosted model for lower-complexity subtasks
+- **Richer node metrics** — queue depth, in-flight count, and rolling performance
+  averages reported from node to control on every heartbeat
+- **Second routing strategy** — implement `round_robin` or `least_loaded` as a
+  second selectable strategy, then make `default_strategy` actually dispatch
+
+---
+
+## V1.5 Scope (Not Yet Started)
+
+- mTLS between control plane and node agents
+- Scoped client tokens (read-only vs. operator vs. admin)
+- Active load-aware model swapping (trigger unload/load on a node based on demand)
+- Image and TTS generation adapter stubs
+- Streaming response passthrough for chat completions
+
+---
+
+## Non-Goals (Unchanged from Original Spec)
+
+- Peer-to-peer consensus
+- Autonomous global model swapping across many nodes
+- Full WAN zero-trust platform
+- Distributed vector database management
+- Billing or multi-tenant quota accounting