From a76c7e81f4cf5665ee1092fdb9f48a4a1b67fadf Mon Sep 17 00:00:00 2001 From: welberr Date: Mon, 27 Apr 2026 09:25:51 -0400 Subject: [PATCH] Revise architecture/roadmap docs and add LLM evaluation guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - architecture.md: rewrite to describe the actual running system; remove design-phase repo-naming discussion and initial-implementation-sequence list; add data-flow diagram, scoring weights table, API status table - roadmap.md: replace aspirational list with concrete completed/gap/next structure; document four confirmed implementation gaps (transcription stub, strategy field ignored, fallback_roles unimplemented, benchmark quality score additive overflow); prioritise fixes as P0/P1/P2/P3 - docs/local_llm_evaluation.md: new document; role taxonomy (tier 1–3), hardware inventory template, candidate model suggestions, three-phase evaluation protocol, GenieHive integration steps, results template, notes on Qwen3/Mistral/DeepSeek/Ollama embedding path quirks Co-Authored-By: Claude Sonnet 4.6 --- docs/architecture.md | 306 +++++++++++++++++++---------------- docs/local_llm_evaluation.md | 251 ++++++++++++++++++++++++++++ docs/roadmap.md | 189 +++++++++++++++++++--- 3 files changed, 579 insertions(+), 167 deletions(-) create mode 100644 docs/local_llm_evaluation.md diff --git a/docs/architecture.md b/docs/architecture.md index 6f7f24f..1f1b5ce 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1,74 +1,46 @@ # GenieHive Architecture -Status: proposed v1 architecture -Drafted: 2026-04-05 - -## Repo Name - -Chosen name: `GenieHive` - -Why this name: - -- suggestive: "genie" implies generative AI services, "hive" implies a cooperating cluster -- accessible: easy to say, remember, and explain -- whimsical enough to feel like a project name rather than a dry infrastructure label - -Tradeoff: - -- `GenieHive` is less search-distinct than `Geniewarren` because `hive` is a common product metaphor +Last updated: 2026-04-27 ## Mission -GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts. +GenieHive is a local-first control plane for heterogeneous generative AI services +running across one or more hosts. It provides: -It should: +- Registration and health tracking for distributed AI services +- A stable, OpenAI-compatible client-facing API +- Role-based routing and scheduling over multiple services +- Integrated benchmarking and performance-informed route scoring -- register hosts and their available services -- expose a stable client-facing API -- track health, capacity, and observed performance -- support direct model addressing and higher-level role addressing -- route requests to healthy loaded services first -- optionally coordinate loading or swapping when policy allows -- remain practical for a small self-hosted deployment with two hosts +It is not a plain OpenAI-compatible gateway. The control plane layer adds topology +awareness, role abstraction, and signal-driven routing that a dumb proxy does not +provide. -## Non-Goals For V1 +--- -Out of scope initially: +## Four Layers -- peer-to-peer consensus -- autonomous global model swapping across many nodes -- full WAN zero-trust platform engineering -- image and TTS generation orchestration -- distributed vector database management -- billing or multi-tenant quota accounting +``` +┌─────────────────────────────────────────────┐ +│ Client Facades │ +│ OpenAI-compatible completions + embeddings │ +│ Operator inspection API │ +├─────────────────────────────────────────────┤ +│ Control API │ +│ Registry · Role catalog · Route resolution │ +│ Scheduling · Benchmark store │ +├─────────────────────────────────────────────┤ +│ Node Agent(s) │ +│ Host discovery · Service enumeration │ +│ Telemetry reporting · Heartbeat │ +├─────────────────────────────────────────────┤ +│ Provider Adapters │ +│ OpenAI-compatible chat / embeddings │ +│ Transcription (partial) │ +└─────────────────────────────────────────────┘ +``` -## Architectural Position - -GenieHive is not just an OpenAI-compatible gateway. - -It is a control plane with these layers: - -1. Control API - - authoritative registry - - routing and scheduling - - role catalog - - operator inspection - -2. Node Agent - - host discovery - - service discovery - - telemetry reporting - - optional local process management - -3. Provider Adapters - - OpenAI-compatible chat backends - - OpenAI-compatible embedding backends - - transcription backends - - future adapters for image and speech synthesis - -4. Client Facades - - OpenAI-compatible facade for completions and embeddings - - operator API for topology, health, and inventory +--- ## Core Concepts @@ -78,117 +50,165 @@ A physical or virtual machine participating in the cluster. ### Service -A concrete callable capability on a host. Examples: - -- chat completion endpoint -- embedding endpoint -- transcription endpoint +A concrete callable capability on a host: a chat endpoint, an embeddings endpoint, +or a transcription endpoint. A host typically exposes multiple services. ### Asset -A model weight, model name, application, or runtime target that a service can serve. +A model weight, model name, or runtime target that a service can serve. Assets carry +optional `request_policy` fields that adjust how requests are shaped before forwarding. ### Role -A reusable task profile that describes how requests should be fulfilled. A role is policy, not a concrete model. +A reusable task profile that describes *how* requests should be fulfilled, not *which* +model fills them. A role has a prompt policy (system prompt injection, body defaults) +and a routing policy (preferred model families, minimum context size, loaded-first +preference). The same role can route to different services as cluster state changes. ### Route Resolution -Request handling order: +1. If `model` matches a loaded, healthy asset or service alias → route directly. +2. If `model` matches a known role → score eligible services and route to the best. +3. Otherwise → fail with a clear 404. -1. If the requested `model` matches a currently loaded and healthy concrete asset or service alias, route directly. -2. Otherwise, if the requested `model` matches a known role, resolve the role to the best eligible service. -3. Otherwise, fail clearly. +--- -## V1 Capability Scope +## Data Flow: Chat Completion -V1 supports only: +``` +Client POST /v1/chat/completions + │ + ▼ +resolve_route(model, kind="chat") + ├─ direct: asset_id or service alias match + └─ role: filter by kind/health → score by runtime + benchmark signals + │ + ▼ +apply_request_policy(request, asset, role) + ├─ deep-merge body_defaults + ├─ apply system prompt (prepend / append / replace) + └─ auto-infer Qwen3 template kwargs if needed + │ + ▼ +UpstreamClient.chat_completions(endpoint, modified_request) + │ + ▼ +_strip_reasoning_fields(response) ← removes reasoning_content / reasoning + │ + ▼ +Response to client +``` -- chat completions -- embeddings -- transcription +--- + +## Scoring + +Route scoring combines three signal families: + +| Signal family | Weight (role) | Weight (service) | +|----------------|---------------|-----------------| +| Text overlap | 30% | 20% | +| Runtime | 30% | 45% | +| Benchmark | 25% | 35% | +| Family pref. | 15% | — | + +**Runtime signals** (from last heartbeat): +- Loaded state: +0.35 +- Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05 +- Throughput: ≥40 tok/s +0.20, ≥20 +0.10 +- Queue depth: penalty −0.20 if ≥5, −0.10 if ≥2 + +**Benchmark signals** (from ingested workload runs): +- Workload overlap score (Jaccard-style token overlap) +- Quality score from results: `0.45 * overlap + 0.55 * quality` + +--- ## Topology -Recommended initial topology: +**Minimum viable (single machine):** +``` +control plane + node agent + model server +all on 127.0.0.1, different ports +``` -- 1 control plane -- 2 node agents -- 1 or more clients -- LAN-first deployment -- API key auth in v1 -- VPN or mTLS in v1.5 +**Recommended (small cluster):** +``` +1 control plane host +2+ node-agent hosts, each with 1+ model servers +1+ clients on LAN +``` -## API Families +**Auth:** +- Client requests: `X-Api-Key` header +- Node registration/heartbeat: `X-GenieHive-Node-Key` header +- Empty key lists disable auth (development only) +- mTLS between control and nodes planned for v1.5 + +--- + +## State Store + +SQLite. Schema: + +| Table | Content | +|---------------------|-------------------------------------------| +| `hosts` | Host registration, resources, labels | +| `services` | Service config, runtime, assets, observed | +| `roles` | Role catalog | +| `benchmark_samples` | Workload results per service | + +Default path: `state/geniehive.sqlite3` + +--- + +## API Reference Summary ### Client API - -- `GET /v1/models` -- `POST /v1/chat/completions` -- `POST /v1/embeddings` -- `POST /v1/audio/transcriptions` - -`GET /v1/models` should expose enough metadata for programmatic clients to make routing decisions about what GenieHive can handle cheaply, especially for lower-complexity offloaded work. That metadata should include direct assets, service-backed aliases, role aliases, operation kind, health, loaded status, and observed performance hints. +| Endpoint | Status | +|---------------------------------|---------------| +| `GET /v1/models` | Implemented | +| `POST /v1/chat/completions` | Implemented | +| `POST /v1/embeddings` | Implemented | +| `POST /v1/audio/transcriptions` | Stub only | ### Operator API - -- `GET /v1/cluster/hosts` -- `GET /v1/cluster/services` -- `GET /v1/cluster/roles` -- `GET /v1/cluster/health` -- `GET /v1/cluster/routes/resolve?model=...` +| Endpoint | Status | +|------------------------------------|-------------| +| `GET /v1/cluster/hosts` | Implemented | +| `GET /v1/cluster/services` | Implemented | +| `GET /v1/cluster/roles` | Implemented | +| `GET /v1/cluster/benchmarks` | Implemented | +| `GET /v1/cluster/health` | Implemented | +| `GET /v1/cluster/routes/resolve` | Implemented | +| `POST /v1/cluster/routes/match` | Implemented | ### Node API +| Endpoint | Status | +|------------------------------|-------------| +| `POST /v1/nodes/register` | Implemented | +| `POST /v1/nodes/heartbeat` | Implemented | +| `GET /v1/node/inventory` | Implemented | +| `GET /v1/node/registration` | Implemented | -- `POST /v1/nodes/register` -- `POST /v1/nodes/heartbeat` -- `GET /v1/node/inventory` -- `POST /v1/node/services/refresh` +--- -## Data Store +## Supported Upstream Backends -V1 should use SQLite for durable state. +Any OpenAI-compatible HTTP server. Tested configurations: -## Routing Rules +- **Ollama** — chat and embeddings +- **llama.cpp** (server mode) — chat and embeddings +- **llamafile** — chat +- **vLLM** — chat and embeddings -### Direct Model Resolution +--- -If a request names a concrete asset alias or service alias: +## Non-Goals for V1 -- prefer loaded and healthy services -- choose the lowest-cost healthy target if multiple matches exist -- fail clearly if all matches are unhealthy - -### Role Resolution - -If direct resolution fails, treat the requested name as a role. - -Role resolution should filter by: - -- operation kind -- modality -- health -- auth and exposure compatibility -- minimum context or memory requirements -- preferred model families - -Then rank by: - -- already loaded -- recent health -- expected latency -- queue pressure -- operator priority - -## First Implementation Sequence - -1. Create the repo skeleton and docs. -2. Implement SQLite-backed registry models. -3. Implement node registration and heartbeat. -4. Implement operator inspection endpoints. -5. Implement client-facing chat routing. -6. Add embeddings routing. -7. Add transcription routing. -8. Add truthful readiness and health reporting. -9. Add role catalog and role-based resolution. -10. Add optional managed local runtime support. +- Peer-to-peer consensus +- Autonomous global model swapping +- WAN zero-trust networking +- Image and TTS generation +- Distributed vector databases +- Billing or multi-tenant quotas diff --git a/docs/local_llm_evaluation.md b/docs/local_llm_evaluation.md new file mode 100644 index 0000000..3a82c23 --- /dev/null +++ b/docs/local_llm_evaluation.md @@ -0,0 +1,251 @@ +# Local LLM Evaluation for GenieHive Agent Roles + +Last updated: 2026-04-27 + +## Purpose + +This document describes a framework for evaluating locally-hosted LLMs against the +roles that GenieHive needs to fulfill in a multi-agent or tool-use pipeline. The goal +is to determine which models are fit for which roles given available hardware, and +to produce benchmark data that GenieHive's own routing layer can consume. + +--- + +## Role Taxonomy + +GenieHive routes by role. Before evaluating models, the roles likely needed in an +agent pipeline must be defined. The following taxonomy covers the most common cases. + +### Tier 1: Core Inference Roles + +| Role ID | Description | Key requirements | +|----------------------|----------------------------------------------------------|---------------------------------------------------| +| `general_assistant` | General-purpose instruction following, Q&A, summarization| Good instruction following, ≥8k context | +| `reasoning` | Multi-step problem solving, chain-of-thought tasks | Extended thinking, ≥16k context, slow OK | +| `code_assistant` | Code generation, explanation, debugging | Strong code benchmarks, fill-in-middle optional | +| `structured_output` | JSON/schema-constrained generation | Grammar sampling or reliable JSON mode | +| `tool_use` | Tool/function call formatting and parsing | Function call format compliance, low hallucination| + +### Tier 2: Supporting Roles + +| Role ID | Description | Key requirements | +|----------------------|----------------------------------------------------------|---------------------------------------------------| +| `embedder` | Semantic embedding for RAG, search, clustering | High MTEB scores, must be loaded (not lazy) | +| `classifier` | Short-text classification, intent detection | Fast TTFT, low token budget, reliable format | +| `summarizer` | Condensing long documents | Long context (≥32k), extractive reliability | +| `critic` | Reviewing, scoring, or evaluating model outputs | Self-consistency, instruction precision | +| `transcriber` | Audio-to-text (Whisper-family) | WER on domain-specific content | + +### Tier 3: Specialized Roles (project-specific) + +These are informed by the current project context (TalkOrigins bibliography pipeline, +Panda's Thumb archive, multi-site search). + +| Role ID | Description | Key requirements | +|------------------------|------------------------------------------------------------|------------------------------------------------| +| `bibliographic_analyst`| Extract, verify, and enrich bibliographic metadata | Precise instruction following, structured JSON | +| `science_explainer` | Explain scientific concepts for a general audience | Factual accuracy, good prose, ≥8k context | +| `search_query_writer` | Generate search queries from topic descriptions | Concise, varied output; fast | +| `html_cleaner` | Identify and convert markup patterns (MT tags, etc.) | Reliable format compliance | + +--- + +## Hardware Context + +Evaluation should be scoped to what is actually available. Document hardware before +running benchmarks. + +Recommended inventory fields per host: + +```yaml +host_id: atlas-01 +gpu: + - name: NVIDIA Tesla P40 + vram_gb: 24 + cuda: "8.0" +cpu: + threads: 24 + model: "Intel Xeon" +ram_gb: 128 +fast_storage: true # NVMe vs. spinning rust matters for model load time +``` + +Models that do not fit in VRAM will run on CPU or split across GPU+CPU. Note +GPU-only, GPU+CPU, and CPU-only fit status explicitly for each candidate. + +--- + +## Candidate Model Selection + +For each role tier, select 2–4 candidate models. Selection criteria: + +1. **Fits hardware** — VRAM budget for the target host +2. **GGUF available** — for llama.cpp / llamafile deployment +3. **License** — permissive enough for intended use +4. **Recency** — prefer models released in the last 12 months unless a classic + substantially outperforms + +### Suggested Starting Candidates (as of 2026-04) + +**General assistant / reasoning:** +- Qwen3-8B-Q4_K_M (fits P40 at 24 GB, extended thinking available) +- Qwen3-14B-Q4_K_M (fits P40 at ~10 GB VRAM + offload, better reasoning) +- Mistral-7B-Instruct-v0.3 (fast, reliable baseline) + +**Code assistant:** +- Qwen2.5-Coder-7B-Instruct +- DeepSeek-Coder-V2-Lite-Instruct (16B MoE, may fit on CPU+GPU split) + +**Structured output / tool use:** +- Qwen3-8B (native tool call support) +- functionary-small-v3.2 (purpose-built for tool use) +- Hermes-3-Llama-3.1-8B (strong JSON reliability) + +**Embeddings:** +- nomic-embed-text-v1.5 (fast, high MTEB, 137M params) +- mxbai-embed-large-v1 (larger, higher MTEB) +- bge-small-en-v1.5 (smallest, acceptable quality for retrieval) + +**Transcription:** +- faster-whisper large-v3 (best WER, GPU accelerated) +- faster-whisper medium.en (faster, smaller, English-only) + +--- + +## Evaluation Protocol + +### Phase 1: Deployment fit check + +For each candidate: + +1. Load the model via llama.cpp or Ollama. +2. Send a minimal completion request to confirm the endpoint is responding. +3. Record: + - Actual VRAM used (from `nvidia-smi`) + - Time to first token on a short prompt (~50 tokens) + - Tokens/sec on a medium completion (~200 tokens) + +Pass criterion: TTFT < 5 s, tokens/sec > 5. + +### Phase 2: Role fitness benchmarks + +Use GenieHive's built-in benchmark runner for chat roles. Extend with custom +workloads for each role. Each workload should have 3–5 cases with known expected +outputs or pass criteria. + +**Workload design principles:** +- Cases should be representative of real workload (not toy examples) +- Pass criteria should be checkable without a judge model where possible + (exact match, JSON parse, regex, non-empty, length bounds) +- Include at least one adversarial case per role (ambiguous prompt, edge input) +- Record `chat_template_kwargs` for models that need them (e.g., Qwen3 thinking) + +**Suggested workloads to add to `benchmark_runner.py`:** + +``` +chat.structured_json — produce valid JSON matching a schema +chat.tool_call_format — emit a well-formed function call +chat.code_python — generate a short working Python function +chat.long_context_recall — answer from a 16k-token context document +chat.concise_classification — classify text into one of N labels +``` + +**Embeddings workloads** (separate evaluation script needed): +- Cosine similarity ranking on semantically close/distant pairs +- Retrieval recall@5 on a small fixed corpus + +### Phase 3: Comparative scoring + +For each role, rank candidates by: + +1. Pass rate (primary) +2. Tokens/sec (secondary, for latency-sensitive roles) +3. TTFT (secondary, for interactive roles) +4. VRAM cost (tie-breaker) + +Document the winner and runner-up. Load both into GenieHive's benchmark store +so the routing layer can score them in live operation. + +--- + +## Integrating Results into GenieHive + +After running benchmarks: + +1. Emit a JSON benchmark report (use `run_benchmark_workload.py`). +2. Ingest into the control plane: `python scripts/ingest_benchmark_report.py`. +3. Define a role in `roles.yaml` with `preferred_families` aligned to the + winning candidate's model family. +4. Verify routing: `GET /v1/cluster/routes/resolve?model=` should + return the winning service. +5. Run a live request through the role to confirm end-to-end. + +--- + +## Evaluation Checklist + +``` +[ ] Hardware inventory documented for each candidate host +[ ] Candidate models selected per role tier +[ ] Each candidate loaded and Phase 1 deployment check passed +[ ] Custom workloads written for at least Tier 1 roles +[ ] Phase 2 benchmarks run and results recorded +[ ] Results ingested into GenieHive benchmark store +[ ] Roles defined in roles.yaml matching Phase 3 winners +[ ] End-to-end routing verified for each role +[ ] Results documented in a summary (see template below) +``` + +--- + +## Results Summary Template + +```markdown +## Evaluation Results — + +### Hardware +- Host: +- GPU: , GB VRAM +- RAM: GB + +### Role: general_assistant +| Model | Pass rate | tok/s | TTFT ms | VRAM GB | Result | +|--------------------|-----------|-------|---------|---------|---------| +| Qwen3-8B-Q4_K_M | 0.92 | 38 | 420 | 6.1 | WINNER | +| Mistral-7B-v0.3 | 0.85 | 52 | 310 | 4.9 | runner-up | + +### Role: embedder +| Model | Recall@5 | Latency ms | VRAM GB | Result | +|------------------------|----------|------------|---------|---------| +| nomic-embed-text-v1.5 | 0.88 | 12 | 0.3 | WINNER | +| bge-small-en-v1.5 | 0.79 | 8 | 0.1 | runner-up | + +... (repeat for each role) +``` + +--- + +## Notes on Model Families and Known Behaviors + +**Qwen3 / Qwen3.5:** +GenieHive auto-detects these and sets `enable_thinking: false` unless a role or +asset explicitly overrides. For the `reasoning` role, set `enable_thinking: true` +in the role's `body_defaults` to engage extended chain-of-thought. + +**Mistral / Mixtral:** +Standard instruction format. No special handling needed. + +**DeepSeek models:** +Some versions use a `` block in their output. GenieHive strips +`reasoning_content` from responses but not inline `` blocks. If the +model emits inline thinking that should be hidden from clients, add a +response-cleaning step or configure the model server to suppress it. + +**Embedding models via Ollama:** +Ollama's embedding endpoint is `/api/embeddings`, not `/v1/embeddings`. The +current `UpstreamClient` uses the OpenAI-compatible path. When registering +an Ollama embedding service, confirm the node config points to the correct +endpoint or that the Ollama version supports `/v1/embeddings`. + +**llamafile:** +Does not support the embeddings endpoint. Only suitable for chat roles. diff --git a/docs/roadmap.md b/docs/roadmap.md index e02339f..1f463a8 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -1,34 +1,175 @@ # GenieHive Roadmap -## Completed Foundations +Last updated: 2026-04-27 -- control-plane registry with SQLite persistence -- node registration and heartbeat -- role catalog and route resolution -- client-facing `GET /v1/models` -- client-facing `POST /v1/chat/completions` -- client-facing `POST /v1/embeddings` -- first control-plus-node demo flow +## What Is Complete -## Immediate Next Milestones +The v1 core is implemented and tested. -1. Run and document the first live LLM demo against real upstream servers. -2. Validate the `GET /v1/models` metadata as a Codex-friendly offload catalog for lower-complexity tasks. -3. Add `POST /v1/audio/transcriptions`. -4. Add a richer node metrics model for queue depth, current load, and observed performance over time. -5. Add a stronger operator/client distinction in the public metadata and auth surfaces. +**Registry and cluster control:** +- SQLite-backed registry with hosts, services, roles, and benchmark samples +- Node registration and heartbeat protocol with auto-re-registration on 404 +- Role catalog loading from YAML +- Route resolution: direct asset/service match → role resolution → clear failure -## LLM Demo Note +**Client-facing API:** +- `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state, + latency hints, offload classification, role aliases) +- `POST /v1/chat/completions` — proxies to upstream with request policy application +- `POST /v1/embeddings` — proxies to upstream -The project is now ready for a first live LLM demo using GenieHive as: +**Request policy system:** +- Body defaults and overrides via deep merge +- System prompt injection (prepend / append / replace) +- Per-asset and per-role policies, merged with role winning on prompts +- Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically -- master: control plane -- peer: one or more node agents with pre-existing local LLM servers -- client: a small demo agent or Codex configured against GenieHive +**Route matching and scoring:** +- `POST /v1/cluster/routes/match` — scored candidate list for role and service targets +- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput, + queue depth), benchmark (workload overlap, quality score) +- `GET /v1/cluster/routes/resolve` — quick single-model resolution -The current live-demo priority is chat-first. Embeddings are also wired in GenieHive, but upstream compatibility differs across local servers, so the safest first demo matrix is: +**Benchmark infrastructure:** +- Built-in workloads: `chat.short_reasoning`, `chat.concise_support` +- `run_benchmark_workload.py` executes workloads and emits a JSON report +- `ingest_benchmark_report.py` posts results to the control plane +- Benchmark samples feed the route scoring pipeline -- Ollama for chat and embeddings -- vLLM for chat and embeddings -- llama.cpp for chat -- llamafile for chat +**Operator inspection:** +- `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health` + +**Auth:** +- Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`) +- Empty key lists disable auth for development + +**Tests:** +- Registry, chat proxy, node inventory, benchmark runner, full demo flow +- All passing + +--- + +## Known Gaps and Issues + +These are confirmed gaps in the current implementation, not aspirational items. + +### 1. Transcription endpoint not implemented + +`POST /v1/audio/transcriptions` is listed in the architecture and wired into +`main.py`, but there is no upstream proxy handler for it. `upstream.py` has no +`transcriptions()` method. The endpoint currently returns nothing useful. + +### 2. Routing strategy field is ignored + +`RoutingConfig.default_strategy` exists in `config.py` (default: `"loaded_first"`), +but `resolve_route()` in `registry.py` does not read it. There is effectively only +one strategy. The field is misleading. + +### 3. Role fallback chain is not implemented + +`RoutingPolicy.fallback_roles` is defined in `models.py` and appears in the schema +docs, but `resolve_route()` never consults it. A role that fails to match any service +fails outright rather than trying its fallbacks. + +### 4. `_benchmark_quality_score` can exceed 1.0 before clamping + +`pass_rate` and `quality_score` are taken as `max()`, then `tokens_per_sec` and +`ttft_ms` are *added* on top. A service with `pass_rate=1.0`, fast tokens, and low +TTFT accumulates a score of up to 1.6 before the final `min(1.0, quality)` clamp. +This means the additive bonuses have no effect once pass_rate or quality_score is +already high, which is probably not the intended behavior. + +### 5. Health is self-reported only + +Service health (`healthy` / `unhealthy`) comes entirely from node-reported state. +The control plane does not probe upstream endpoints. A service can appear healthy +while its endpoint is unreachable. + +### 6. No active model discovery from upstream services + +The node agent scans for `.gguf` files on disk and reads static service config. +It does not query running Ollama or vLLM instances for their loaded model list. +A freshly-pulled Ollama model will not appear until the node config is updated +and the agent restarted. + +### 7. `docs/architecture.md` duplicates `GENIEWARREN_SPEC.md` + +`architecture.md` contains the repo-naming rationale, name alternatives, and +implementation sequence list that are only meaningful in a design/proposal context. +These are noise in a reference architecture document. + +--- + +## Immediate Next Work (Priority Order) + +### P0 — Fix confirmed bugs + +1. **Remove the misleading `default_strategy` field** or implement a dispatch table + so the config field actually selects behavior. Simplest fix: delete the field and + the dead config surface until a second strategy is implemented. + +2. **Fix `_benchmark_quality_score`** so additive bonuses apply only when no + `pass_rate` / `quality_score` is available, or restructure as a weighted average + so the components don't stack additively. + +### P1 — Complete stated v1 scope + +3. **Implement transcription proxy** — add `upstream.transcriptions()` and wire + the handler in `chat.py` and `main.py`. + +4. **Implement role fallback chain** — when `resolve_route()` finds no matching + service for a role, walk `fallback_roles` in order before failing. + +### P2 — Close the most important self-reported-only gaps + +5. **Add active health probing** — the control plane should periodically probe + registered service endpoints (a lightweight `GET /health` or `GET /v1/models` + is sufficient) and update health state independently of node heartbeats. + +6. **Add upstream model discovery for Ollama** — query `GET /api/tags` (Ollama) + or `GET /v1/models` (OpenAI-compatible) from the node agent and merge loaded + model names into the service's asset list. This enables dynamic model tracking + without config restarts. + +### P3 — Documentation cleanup + +7. **Revise `architecture.md`** — remove the design-phase repo-naming rationale + and first-implementation-sequence list; replace with a description of the actual + running system (the four layers as implemented, data flow diagram if possible). + +8. **Update `roadmap.md`** — this file (done). + +--- + +## Near-Term Milestones (After P0–P3) + +- **Live LLM demo** — run control + node against a real upstream (Ollama or + llama.cpp) and document the end-to-end flow, including chat via role and + direct asset addressing +- **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as + a programmatic service catalog for a Claude Code or Codex client selecting + a GenieHive-hosted model for lower-complexity subtasks +- **Richer node metrics** — queue depth, in-flight count, and rolling performance + averages reported from node to control on every heartbeat +- **Second routing strategy** — implement `round_robin` or `least_loaded` as a + second selectable strategy, then make `default_strategy` actually dispatch + +--- + +## V1.5 Scope (Not Yet Started) + +- mTLS between control plane and node agents +- Scoped client tokens (read-only vs. operator vs. admin) +- Active load-aware model swapping (trigger unload/load on a node based on demand) +- Image and TTS generation adapter stubs +- Streaming response passthrough for chat completions + +--- + +## Non-Goals (Unchanged from Original Spec) + +- Peer-to-peer consensus +- Autonomous global model swapping across many nodes +- Full WAN zero-trust platform +- Distributed vector database management +- Billing or multi-tenant quota accounting