Revise architecture/roadmap docs and add LLM evaluation guide

- architecture.md: rewrite to describe the actual running system; remove design-phase repo-naming discussion and initial-implementation-sequence list; add data-flow diagram, scoring weights table, API status table - roadmap.md: replace aspirational list with concrete completed/gap/next structure; document four confirmed implementation gaps (transcription stub, strategy field ignored, fallback_roles unimplemented, benchmark quality score additive overflow); prioritise fixes as P0/P1/P2/P3 - docs/local_llm_evaluation.md: new document; role taxonomy (tier 1–3), hardware inventory template, candidate model suggestions, three-phase evaluation protocol, GenieHive integration steps, results template, notes on Qwen3/Mistral/DeepSeek/Ollama embedding path quirks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 09:25:51 -04:00 · 2026-04-27 09:25:51 -04:00 · a76c7e81f4
parent e36650a017
commit a76c7e81f4
3 changed files with 579 additions and 167 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -1,74 +1,46 @@
 # GenieHive Architecture
-Status: proposed v1 architecture
+Last updated: 2026-04-27
 Drafted: 2026-04-05
 ## Repo Name
 Chosen name: `GenieHive`
 Why this name:
 - suggestive: "genie" implies generative AI services, "hive" implies a cooperating cluster
 - accessible: easy to say, remember, and explain
 - whimsical enough to feel like a project name rather than a dry infrastructure label
 Tradeoff:
 - `GenieHive` is less search-distinct than `Geniewarren` because `hive` is a common product metaphor
 ## Mission
-GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts.
+GenieHive is a local-first control plane for heterogeneous generative AI services
 running across one or more hosts. It provides:
-It should:
+- Registration and health tracking for distributed AI services
 - A stable, OpenAI-compatible client-facing API
 - Role-based routing and scheduling over multiple services
 - Integrated benchmarking and performance-informed route scoring
- register hosts and their available services
+It is not a plain OpenAI-compatible gateway. The control plane layer adds topology
- expose a stable client-facing API
+awareness, role abstraction, and signal-driven routing that a dumb proxy does not
- track health, capacity, and observed performance
+provide.
 - support direct model addressing and higher-level role addressing
 - route requests to healthy loaded services first
 - optionally coordinate loading or swapping when policy allows
 - remain practical for a small self-hosted deployment with two hosts
-## Non-Goals For V1
+---
-Out of scope initially:
+## Four Layers
- peer-to-peer consensus
+```
- autonomous global model swapping across many nodes
+┌─────────────────────────────────────────────┐
- full WAN zero-trust platform engineering
+│  Client Facades                              │
- image and TTS generation orchestration
+│  OpenAI-compatible completions + embeddings  │
- distributed vector database management
+│  Operator inspection API                     │
- billing or multi-tenant quota accounting
+├─────────────────────────────────────────────┤
 │  Control API                                 │
 │  Registry · Role catalog · Route resolution  │
 │  Scheduling · Benchmark store                │
 ├─────────────────────────────────────────────┤
 │  Node Agent(s)                               │
 │  Host discovery · Service enumeration        │
 │  Telemetry reporting · Heartbeat             │
 ├─────────────────────────────────────────────┤
 │  Provider Adapters                           │
 │  OpenAI-compatible chat / embeddings         │
 │  Transcription (partial)                     │
 └─────────────────────────────────────────────┘
 ```
-## Architectural Position
+---
 GenieHive is not just an OpenAI-compatible gateway.
 It is a control plane with these layers:
 1. Control API
   - authoritative registry
   - routing and scheduling
   - role catalog
   - operator inspection
 2. Node Agent
   - host discovery
   - service discovery
   - telemetry reporting
   - optional local process management
 3. Provider Adapters
   - OpenAI-compatible chat backends
   - OpenAI-compatible embedding backends
   - transcription backends
   - future adapters for image and speech synthesis
 4. Client Facades
   - OpenAI-compatible facade for completions and embeddings
   - operator API for topology, health, and inventory
 ## Core Concepts
@ -78,117 +50,165 @@ A physical or virtual machine participating in the cluster.
 ### Service
-A concrete callable capability on a host. Examples:
+A concrete callable capability on a host: a chat endpoint, an embeddings endpoint,
-
+or a transcription endpoint. A host typically exposes multiple services.
 - chat completion endpoint
 - embedding endpoint
 - transcription endpoint
 ### Asset
-A model weight, model name, application, or runtime target that a service can serve.
+A model weight, model name, or runtime target that a service can serve. Assets carry
 optional `request_policy` fields that adjust how requests are shaped before forwarding.
 ### Role
-A reusable task profile that describes how requests should be fulfilled. A role is policy, not a concrete model.
+A reusable task profile that describes *how* requests should be fulfilled, not *which*
 model fills them. A role has a prompt policy (system prompt injection, body defaults)
 and a routing policy (preferred model families, minimum context size, loaded-first
 preference). The same role can route to different services as cluster state changes.
 ### Route Resolution
-Request handling order:
+1. If `model` matches a loaded, healthy asset or service alias → route directly.
 2. If `model` matches a known role → score eligible services and route to the best.
 3. Otherwise → fail with a clear 404.
-1. If the requested `model` matches a currently loaded and healthy concrete asset or service alias, route directly.
+---
 2. Otherwise, if the requested `model` matches a known role, resolve the role to the best eligible service.
 3. Otherwise, fail clearly.
-## V1 Capability Scope
+## Data Flow: Chat Completion
-V1 supports only:
+```
 Client POST /v1/chat/completions
  │
  ▼
 resolve_route(model, kind="chat")
  ├─ direct: asset_id or service alias match
  └─ role: filter by kind/health → score by runtime + benchmark signals
  │
  ▼
 apply_request_policy(request, asset, role)
  ├─ deep-merge body_defaults
  ├─ apply system prompt (prepend / append / replace)
  └─ auto-infer Qwen3 template kwargs if needed
  │
  ▼
 UpstreamClient.chat_completions(endpoint, modified_request)
  │
  ▼
 _strip_reasoning_fields(response)  ← removes reasoning_content / reasoning
  │
  ▼
 Response to client
 ```
- chat completions
+---
- embeddings
+
- transcription
+## Scoring
 Route scoring combines three signal families:
 | Signal family  | Weight (role) | Weight (service) |
 |----------------|---------------|-----------------|
 | Text overlap   | 30%           | 20%             |
 | Runtime        | 30%           | 45%             |
 | Benchmark      | 25%           | 35%             |
 | Family pref.   | 15%           | —               |
 **Runtime signals** (from last heartbeat):
 - Loaded state: +0.35
 - Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
 - Throughput: ≥40 tok/s +0.20, ≥20 +0.10
 - Queue depth: penalty −0.20 if ≥5, −0.10 if ≥2
 **Benchmark signals** (from ingested workload runs):
 - Workload overlap score (Jaccard-style token overlap)
 - Quality score from results: `0.45 * overlap + 0.55 * quality`
 ---
 ## Topology
-Recommended initial topology:
+**Minimum viable (single machine):**
 ```
 control plane + node agent + model server
 all on 127.0.0.1, different ports
 ```
- 1 control plane
+**Recommended (small cluster):**
- 2 node agents
+```
- 1 or more clients
+1 control plane host
- LAN-first deployment
+2+ node-agent hosts, each with 1+ model servers
- API key auth in v1
+1+ clients on LAN
- VPN or mTLS in v1.5
+```
-## API Families
+**Auth:**
 - Client requests: `X-Api-Key` header
 - Node registration/heartbeat: `X-GenieHive-Node-Key` header
 - Empty key lists disable auth (development only)
 - mTLS between control and nodes planned for v1.5
 ---
 ## State Store
 SQLite. Schema:
 | Table               | Content                                   |
 |---------------------|-------------------------------------------|
 | `hosts`             | Host registration, resources, labels      |
 | `services`          | Service config, runtime, assets, observed |
 | `roles`             | Role catalog                              |
 | `benchmark_samples` | Workload results per service              |
 Default path: `state/geniehive.sqlite3`
 ---
 ## API Reference Summary
 ### Client API
-
+| Endpoint                        | Status        |
- `GET /v1/models`
+|---------------------------------|---------------|
- `POST /v1/chat/completions`
+| `GET /v1/models`                | Implemented   |
- `POST /v1/embeddings`
+| `POST /v1/chat/completions`     | Implemented   |
- `POST /v1/audio/transcriptions`
+| `POST /v1/embeddings`           | Implemented   |
-
+| `POST /v1/audio/transcriptions` | Stub only     |
 `GET /v1/models` should expose enough metadata for programmatic clients to make routing decisions about what GenieHive can handle cheaply, especially for lower-complexity offloaded work. That metadata should include direct assets, service-backed aliases, role aliases, operation kind, health, loaded status, and observed performance hints.
 ### Operator API
-
+| Endpoint                           | Status      |
- `GET /v1/cluster/hosts`
+|------------------------------------|-------------|
- `GET /v1/cluster/services`
+| `GET /v1/cluster/hosts`            | Implemented |
- `GET /v1/cluster/roles`
+| `GET /v1/cluster/services`         | Implemented |
- `GET /v1/cluster/health`
+| `GET /v1/cluster/roles`            | Implemented |
- `GET /v1/cluster/routes/resolve?model=...`
+| `GET /v1/cluster/benchmarks`       | Implemented |
 | `GET /v1/cluster/health`           | Implemented |
 | `GET /v1/cluster/routes/resolve`   | Implemented |
 | `POST /v1/cluster/routes/match`    | Implemented |
 ### Node API
 | Endpoint                     | Status      |
 |------------------------------|-------------|
 | `POST /v1/nodes/register`    | Implemented |
 | `POST /v1/nodes/heartbeat`   | Implemented |
 | `GET /v1/node/inventory`     | Implemented |
 | `GET /v1/node/registration`  | Implemented |
- `POST /v1/nodes/register`
+---
 - `POST /v1/nodes/heartbeat`
 - `GET /v1/node/inventory`
 - `POST /v1/node/services/refresh`
-## Data Store
+## Supported Upstream Backends
-V1 should use SQLite for durable state.
+Any OpenAI-compatible HTTP server. Tested configurations:
-## Routing Rules
+- **Ollama** — chat and embeddings
 - **llama.cpp** (server mode) — chat and embeddings
 - **llamafile** — chat
 - **vLLM** — chat and embeddings
-### Direct Model Resolution
+---
-If a request names a concrete asset alias or service alias:
+## Non-Goals for V1
- prefer loaded and healthy services
+- Peer-to-peer consensus
- choose the lowest-cost healthy target if multiple matches exist
+- Autonomous global model swapping
- fail clearly if all matches are unhealthy
+- WAN zero-trust networking
-
+- Image and TTS generation
-### Role Resolution
+- Distributed vector databases
-
+- Billing or multi-tenant quotas
 If direct resolution fails, treat the requested name as a role.
 Role resolution should filter by:
 - operation kind
 - modality
 - health
 - auth and exposure compatibility
 - minimum context or memory requirements
 - preferred model families
 Then rank by:
 - already loaded
 - recent health
 - expected latency
 - queue pressure
 - operator priority
 ## First Implementation Sequence
 1. Create the repo skeleton and docs.
 2. Implement SQLite-backed registry models.
 3. Implement node registration and heartbeat.
 4. Implement operator inspection endpoints.
 5. Implement client-facing chat routing.
 6. Add embeddings routing.
 7. Add transcription routing.
 8. Add truthful readiness and health reporting.
 9. Add role catalog and role-based resolution.
 10. Add optional managed local runtime support.
--- a/docs/local_llm_evaluation.md
+++ b/docs/local_llm_evaluation.md
@ -0,0 +1,251 @@
 # Local LLM Evaluation for GenieHive Agent Roles
 Last updated: 2026-04-27
 ## Purpose
 This document describes a framework for evaluating locally-hosted LLMs against the
 roles that GenieHive needs to fulfill in a multi-agent or tool-use pipeline. The goal
 is to determine which models are fit for which roles given available hardware, and
 to produce benchmark data that GenieHive's own routing layer can consume.
 ---
 ## Role Taxonomy
 GenieHive routes by role. Before evaluating models, the roles likely needed in an
 agent pipeline must be defined. The following taxonomy covers the most common cases.
 ### Tier 1: Core Inference Roles
 | Role ID              | Description                                              | Key requirements                                  |
 |----------------------|----------------------------------------------------------|---------------------------------------------------|
 | `general_assistant`  | General-purpose instruction following, Q&A, summarization| Good instruction following, ≥8k context           |
 | `reasoning`          | Multi-step problem solving, chain-of-thought tasks       | Extended thinking, ≥16k context, slow OK          |
 | `code_assistant`     | Code generation, explanation, debugging                  | Strong code benchmarks, fill-in-middle optional   |
 | `structured_output`  | JSON/schema-constrained generation                       | Grammar sampling or reliable JSON mode            |
 | `tool_use`           | Tool/function call formatting and parsing                | Function call format compliance, low hallucination|
 ### Tier 2: Supporting Roles
 | Role ID              | Description                                              | Key requirements                                  |
 |----------------------|----------------------------------------------------------|---------------------------------------------------|
 | `embedder`           | Semantic embedding for RAG, search, clustering           | High MTEB scores, must be loaded (not lazy)       |
 | `classifier`         | Short-text classification, intent detection              | Fast TTFT, low token budget, reliable format      |
 | `summarizer`         | Condensing long documents                                | Long context (≥32k), extractive reliability       |
 | `critic`             | Reviewing, scoring, or evaluating model outputs          | Self-consistency, instruction precision            |
 | `transcriber`        | Audio-to-text (Whisper-family)                           | WER on domain-specific content                    |
 ### Tier 3: Specialized Roles (project-specific)
 These are informed by the current project context (TalkOrigins bibliography pipeline,
 Panda's Thumb archive, multi-site search).
 | Role ID                | Description                                                | Key requirements                               |
 |------------------------|------------------------------------------------------------|------------------------------------------------|
 | `bibliographic_analyst`| Extract, verify, and enrich bibliographic metadata         | Precise instruction following, structured JSON |
 | `science_explainer`    | Explain scientific concepts for a general audience         | Factual accuracy, good prose, ≥8k context      |
 | `search_query_writer`  | Generate search queries from topic descriptions            | Concise, varied output; fast                   |
 | `html_cleaner`         | Identify and convert markup patterns (MT tags, etc.)       | Reliable format compliance                     |
 ---
 ## Hardware Context
 Evaluation should be scoped to what is actually available. Document hardware before
 running benchmarks.
 Recommended inventory fields per host:
 ```yaml
 host_id: atlas-01
 gpu:
  - name: NVIDIA Tesla P40
    vram_gb: 24
    cuda: "8.0"
 cpu:
  threads: 24
  model: "Intel Xeon"
 ram_gb: 128
 fast_storage: true   # NVMe vs. spinning rust matters for model load time
 ```
 Models that do not fit in VRAM will run on CPU or split across GPU+CPU. Note
 GPU-only, GPU+CPU, and CPU-only fit status explicitly for each candidate.
 ---
 ## Candidate Model Selection
 For each role tier, select 2–4 candidate models. Selection criteria:
 1. **Fits hardware** — VRAM budget for the target host
 2. **GGUF available** — for llama.cpp / llamafile deployment
 3. **License** — permissive enough for intended use
 4. **Recency** — prefer models released in the last 12 months unless a classic
   substantially outperforms
 ### Suggested Starting Candidates (as of 2026-04)
 **General assistant / reasoning:**
 - Qwen3-8B-Q4_K_M (fits P40 at 24 GB, extended thinking available)
 - Qwen3-14B-Q4_K_M (fits P40 at ~10 GB VRAM + offload, better reasoning)
 - Mistral-7B-Instruct-v0.3 (fast, reliable baseline)
 **Code assistant:**
 - Qwen2.5-Coder-7B-Instruct
 - DeepSeek-Coder-V2-Lite-Instruct (16B MoE, may fit on CPU+GPU split)
 **Structured output / tool use:**
 - Qwen3-8B (native tool call support)
 - functionary-small-v3.2 (purpose-built for tool use)
 - Hermes-3-Llama-3.1-8B (strong JSON reliability)
 **Embeddings:**
 - nomic-embed-text-v1.5 (fast, high MTEB, 137M params)
 - mxbai-embed-large-v1 (larger, higher MTEB)
 - bge-small-en-v1.5 (smallest, acceptable quality for retrieval)
 **Transcription:**
 - faster-whisper large-v3 (best WER, GPU accelerated)
 - faster-whisper medium.en (faster, smaller, English-only)
 ---
 ## Evaluation Protocol
 ### Phase 1: Deployment fit check
 For each candidate:
 1. Load the model via llama.cpp or Ollama.
 2. Send a minimal completion request to confirm the endpoint is responding.
 3. Record:
   - Actual VRAM used (from `nvidia-smi`)
   - Time to first token on a short prompt (~50 tokens)
   - Tokens/sec on a medium completion (~200 tokens)
 Pass criterion: TTFT < 5 s, tokens/sec > 5.
 ### Phase 2: Role fitness benchmarks
 Use GenieHive's built-in benchmark runner for chat roles. Extend with custom
 workloads for each role. Each workload should have 3–5 cases with known expected
 outputs or pass criteria.
 **Workload design principles:**
 - Cases should be representative of real workload (not toy examples)
 - Pass criteria should be checkable without a judge model where possible
  (exact match, JSON parse, regex, non-empty, length bounds)
 - Include at least one adversarial case per role (ambiguous prompt, edge input)
 - Record `chat_template_kwargs` for models that need them (e.g., Qwen3 thinking)
 **Suggested workloads to add to `benchmark_runner.py`:**
 ```
 chat.structured_json      — produce valid JSON matching a schema
 chat.tool_call_format     — emit a well-formed function call
 chat.code_python          — generate a short working Python function
 chat.long_context_recall  — answer from a 16k-token context document
 chat.concise_classification — classify text into one of N labels
 ```
 **Embeddings workloads** (separate evaluation script needed):
 - Cosine similarity ranking on semantically close/distant pairs
 - Retrieval recall@5 on a small fixed corpus
 ### Phase 3: Comparative scoring
 For each role, rank candidates by:
 1. Pass rate (primary)
 2. Tokens/sec (secondary, for latency-sensitive roles)
 3. TTFT (secondary, for interactive roles)
 4. VRAM cost (tie-breaker)
 Document the winner and runner-up. Load both into GenieHive's benchmark store
 so the routing layer can score them in live operation.
 ---
 ## Integrating Results into GenieHive
 After running benchmarks:
 1. Emit a JSON benchmark report (use `run_benchmark_workload.py`).
 2. Ingest into the control plane: `python scripts/ingest_benchmark_report.py`.
 3. Define a role in `roles.yaml` with `preferred_families` aligned to the
   winning candidate's model family.
 4. Verify routing: `GET /v1/cluster/routes/resolve?model=<role_id>` should
   return the winning service.
 5. Run a live request through the role to confirm end-to-end.
 ---
 ## Evaluation Checklist
 ```
 [ ] Hardware inventory documented for each candidate host
 [ ] Candidate models selected per role tier
 [ ] Each candidate loaded and Phase 1 deployment check passed
 [ ] Custom workloads written for at least Tier 1 roles
 [ ] Phase 2 benchmarks run and results recorded
 [ ] Results ingested into GenieHive benchmark store
 [ ] Roles defined in roles.yaml matching Phase 3 winners
 [ ] End-to-end routing verified for each role
 [ ] Results documented in a summary (see template below)
 ```
 ---
 ## Results Summary Template
 ```markdown
 ## Evaluation Results — <date>
 ### Hardware
 - Host: <host_id>
 - GPU: <name>, <vram_gb> GB VRAM
 - RAM: <ram_gb> GB
 ### Role: general_assistant
 | Model              | Pass rate | tok/s | TTFT ms | VRAM GB | Result  |
 |--------------------|-----------|-------|---------|---------|---------|
 | Qwen3-8B-Q4_K_M    | 0.92      | 38    | 420     | 6.1     | WINNER  |
 | Mistral-7B-v0.3    | 0.85      | 52    | 310     | 4.9     | runner-up |
 ### Role: embedder
 | Model                  | Recall@5 | Latency ms | VRAM GB | Result  |
 |------------------------|----------|------------|---------|---------|
 | nomic-embed-text-v1.5  | 0.88     | 12         | 0.3     | WINNER  |
 | bge-small-en-v1.5      | 0.79     | 8          | 0.1     | runner-up |
 ... (repeat for each role)
 ```
 ---
 ## Notes on Model Families and Known Behaviors
 **Qwen3 / Qwen3.5:**
 GenieHive auto-detects these and sets `enable_thinking: false` unless a role or
 asset explicitly overrides. For the `reasoning` role, set `enable_thinking: true`
 in the role's `body_defaults` to engage extended chain-of-thought.
 **Mistral / Mixtral:**
 Standard instruction format. No special handling needed.
 **DeepSeek models:**
 Some versions use a `<think>` block in their output. GenieHive strips
 `reasoning_content` from responses but not inline `<think>` blocks. If the
 model emits inline thinking that should be hidden from clients, add a
 response-cleaning step or configure the model server to suppress it.
 **Embedding models via Ollama:**
 Ollama's embedding endpoint is `/api/embeddings`, not `/v1/embeddings`. The
 current `UpstreamClient` uses the OpenAI-compatible path. When registering
 an Ollama embedding service, confirm the node config points to the correct
 endpoint or that the Ollama version supports `/v1/embeddings`.
 **llamafile:**
 Does not support the embeddings endpoint. Only suitable for chat roles.
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@ -1,34 +1,175 @@
 # GenieHive Roadmap
-## Completed Foundations
+Last updated: 2026-04-27
- control-plane registry with SQLite persistence
+## What Is Complete
 - node registration and heartbeat
 - role catalog and route resolution
 - client-facing `GET /v1/models`
 - client-facing `POST /v1/chat/completions`
 - client-facing `POST /v1/embeddings`
 - first control-plus-node demo flow
-## Immediate Next Milestones
+The v1 core is implemented and tested.
-1. Run and document the first live LLM demo against real upstream servers.
+**Registry and cluster control:**
-2. Validate the `GET /v1/models` metadata as a Codex-friendly offload catalog for lower-complexity tasks.
+- SQLite-backed registry with hosts, services, roles, and benchmark samples
-3. Add `POST /v1/audio/transcriptions`.
+- Node registration and heartbeat protocol with auto-re-registration on 404
-4. Add a richer node metrics model for queue depth, current load, and observed performance over time.
+- Role catalog loading from YAML
-5. Add a stronger operator/client distinction in the public metadata and auth surfaces.
+- Route resolution: direct asset/service match → role resolution → clear failure
-## LLM Demo Note
+**Client-facing API:**
 - `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state,
  latency hints, offload classification, role aliases)
 - `POST /v1/chat/completions` — proxies to upstream with request policy application
 - `POST /v1/embeddings` — proxies to upstream
-The project is now ready for a first live LLM demo using GenieHive as:
+**Request policy system:**
 - Body defaults and overrides via deep merge
 - System prompt injection (prepend / append / replace)
 - Per-asset and per-role policies, merged with role winning on prompts
 - Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically
- master: control plane
+**Route matching and scoring:**
- peer: one or more node agents with pre-existing local LLM servers
+- `POST /v1/cluster/routes/match` — scored candidate list for role and service targets
- client: a small demo agent or Codex configured against GenieHive
+- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput,
  queue depth), benchmark (workload overlap, quality score)
 - `GET /v1/cluster/routes/resolve` — quick single-model resolution
-The current live-demo priority is chat-first. Embeddings are also wired in GenieHive, but upstream compatibility differs across local servers, so the safest first demo matrix is:
+**Benchmark infrastructure:**
 - Built-in workloads: `chat.short_reasoning`, `chat.concise_support`
 - `run_benchmark_workload.py` executes workloads and emits a JSON report
 - `ingest_benchmark_report.py` posts results to the control plane
 - Benchmark samples feed the route scoring pipeline
- Ollama for chat and embeddings
+**Operator inspection:**
- vLLM for chat and embeddings
+- `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health`
- llama.cpp for chat
+
- llamafile for chat
+**Auth:**
 - Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`)
 - Empty key lists disable auth for development
 **Tests:**
 - Registry, chat proxy, node inventory, benchmark runner, full demo flow
 - All passing
 ---
 ## Known Gaps and Issues
 These are confirmed gaps in the current implementation, not aspirational items.
 ### 1. Transcription endpoint not implemented
 `POST /v1/audio/transcriptions` is listed in the architecture and wired into
 `main.py`, but there is no upstream proxy handler for it. `upstream.py` has no
 `transcriptions()` method. The endpoint currently returns nothing useful.
 ### 2. Routing strategy field is ignored
 `RoutingConfig.default_strategy` exists in `config.py` (default: `"loaded_first"`),
 but `resolve_route()` in `registry.py` does not read it. There is effectively only
 one strategy. The field is misleading.
 ### 3. Role fallback chain is not implemented
 `RoutingPolicy.fallback_roles` is defined in `models.py` and appears in the schema
 docs, but `resolve_route()` never consults it. A role that fails to match any service
 fails outright rather than trying its fallbacks.
 ### 4. `_benchmark_quality_score` can exceed 1.0 before clamping
 `pass_rate` and `quality_score` are taken as `max()`, then `tokens_per_sec` and
 `ttft_ms` are *added* on top. A service with `pass_rate=1.0`, fast tokens, and low
 TTFT accumulates a score of up to 1.6 before the final `min(1.0, quality)` clamp.
 This means the additive bonuses have no effect once pass_rate or quality_score is
 already high, which is probably not the intended behavior.
 ### 5. Health is self-reported only
 Service health (`healthy` / `unhealthy`) comes entirely from node-reported state.
 The control plane does not probe upstream endpoints. A service can appear healthy
 while its endpoint is unreachable.
 ### 6. No active model discovery from upstream services
 The node agent scans for `.gguf` files on disk and reads static service config.
 It does not query running Ollama or vLLM instances for their loaded model list.
 A freshly-pulled Ollama model will not appear until the node config is updated
 and the agent restarted.
 ### 7. `docs/architecture.md` duplicates `GENIEWARREN_SPEC.md`
 `architecture.md` contains the repo-naming rationale, name alternatives, and
 implementation sequence list that are only meaningful in a design/proposal context.
 These are noise in a reference architecture document.
 ---
 ## Immediate Next Work (Priority Order)
 ### P0 — Fix confirmed bugs
 1. **Remove the misleading `default_strategy` field** or implement a dispatch table
   so the config field actually selects behavior. Simplest fix: delete the field and
   the dead config surface until a second strategy is implemented.
 2. **Fix `_benchmark_quality_score`** so additive bonuses apply only when no
   `pass_rate` / `quality_score` is available, or restructure as a weighted average
   so the components don't stack additively.
 ### P1 — Complete stated v1 scope
 3. **Implement transcription proxy** — add `upstream.transcriptions()` and wire
   the handler in `chat.py` and `main.py`.
 4. **Implement role fallback chain** — when `resolve_route()` finds no matching
   service for a role, walk `fallback_roles` in order before failing.
 ### P2 — Close the most important self-reported-only gaps
 5. **Add active health probing** — the control plane should periodically probe
   registered service endpoints (a lightweight `GET /health` or `GET /v1/models`
   is sufficient) and update health state independently of node heartbeats.
 6. **Add upstream model discovery for Ollama** — query `GET /api/tags` (Ollama)
   or `GET /v1/models` (OpenAI-compatible) from the node agent and merge loaded
   model names into the service's asset list. This enables dynamic model tracking
   without config restarts.
 ### P3 — Documentation cleanup
 7. **Revise `architecture.md`** — remove the design-phase repo-naming rationale
   and first-implementation-sequence list; replace with a description of the actual
   running system (the four layers as implemented, data flow diagram if possible).
 8. **Update `roadmap.md`** — this file (done).
 ---
 ## Near-Term Milestones (After P0–P3)
 - **Live LLM demo** — run control + node against a real upstream (Ollama or
  llama.cpp) and document the end-to-end flow, including chat via role and
  direct asset addressing
 - **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as
  a programmatic service catalog for a Claude Code or Codex client selecting
  a GenieHive-hosted model for lower-complexity subtasks
 - **Richer node metrics** — queue depth, in-flight count, and rolling performance
  averages reported from node to control on every heartbeat
 - **Second routing strategy** — implement `round_robin` or `least_loaded` as a
  second selectable strategy, then make `default_strategy` actually dispatch
 ---
 ## V1.5 Scope (Not Yet Started)
 - mTLS between control plane and node agents
 - Scoped client tokens (read-only vs. operator vs. admin)
 - Active load-aware model swapping (trigger unload/load on a node based on demand)
 - Image and TTS generation adapter stubs
 - Streaming response passthrough for chat completions
 ---
 ## Non-Goals (Unchanged from Original Spec)
 - Peer-to-peer consensus
 - Autonomous global model swapping across many nodes
 - Full WAN zero-trust platform
 - Distributed vector database management
 - Billing or multi-tenant quota accounting