# GenieHive Roadmap Last updated: 2026-04-27 ## What Is Complete The v1 core is implemented and tested. **Registry and cluster control:** - SQLite-backed registry with hosts, services, roles, and benchmark samples - Node registration and heartbeat protocol with auto-re-registration on 404 - Role catalog loading from YAML - Route resolution: direct asset/service match → role resolution → clear failure **Client-facing API:** - `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state, latency hints, offload classification, role aliases) - `POST /v1/chat/completions` — proxies to upstream with request policy application - `POST /v1/embeddings` — proxies to upstream **Request policy system:** - Body defaults and overrides via deep merge - System prompt injection (prepend / append / replace) - Per-asset and per-role policies, merged with role winning on prompts - Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically **Route matching and scoring:** - `POST /v1/cluster/routes/match` — scored candidate list for role and service targets - Signals: text overlap, preferred family, runtime (loaded state, latency, throughput, queue depth), benchmark (workload overlap, quality score) - `GET /v1/cluster/routes/resolve` — quick single-model resolution **Benchmark infrastructure:** - Built-in workloads: `chat.short_reasoning`, `chat.concise_support` - `run_benchmark_workload.py` executes workloads and emits a JSON report - `ingest_benchmark_report.py` posts results to the control plane - Benchmark samples feed the route scoring pipeline **Operator inspection:** - `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health` **Auth:** - Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`) - Empty key lists disable auth for development **Tests:** - Registry, chat proxy, node inventory, benchmark runner, full demo flow - All passing --- ## Known Gaps and Issues These are confirmed gaps in the current implementation, not aspirational items. ### 1. Transcription endpoint not implemented `POST /v1/audio/transcriptions` is listed in the architecture and wired into `main.py`, but there is no upstream proxy handler for it. `upstream.py` has no `transcriptions()` method. The endpoint currently returns nothing useful. ### 2. Routing strategy field is ignored `RoutingConfig.default_strategy` exists in `config.py` (default: `"loaded_first"`), but `resolve_route()` in `registry.py` does not read it. There is effectively only one strategy. The field is misleading. ### 3. Role fallback chain is not implemented `RoutingPolicy.fallback_roles` is defined in `models.py` and appears in the schema docs, but `resolve_route()` never consults it. A role that fails to match any service fails outright rather than trying its fallbacks. ### 4. `_benchmark_quality_score` can exceed 1.0 before clamping `pass_rate` and `quality_score` are taken as `max()`, then `tokens_per_sec` and `ttft_ms` are *added* on top. A service with `pass_rate=1.0`, fast tokens, and low TTFT accumulates a score of up to 1.6 before the final `min(1.0, quality)` clamp. This means the additive bonuses have no effect once pass_rate or quality_score is already high, which is probably not the intended behavior. ### 5. Health is self-reported only Service health (`healthy` / `unhealthy`) comes entirely from node-reported state. The control plane does not probe upstream endpoints. A service can appear healthy while its endpoint is unreachable. ### 6. No active model discovery from upstream services The node agent scans for `.gguf` files on disk and reads static service config. It does not query running Ollama or vLLM instances for their loaded model list. A freshly-pulled Ollama model will not appear until the node config is updated and the agent restarted. ### 7. `docs/architecture.md` duplicates `GENIEWARREN_SPEC.md` `architecture.md` contains the repo-naming rationale, name alternatives, and implementation sequence list that are only meaningful in a design/proposal context. These are noise in a reference architecture document. --- ## Immediate Next Work (Priority Order) ### P0 — Fix confirmed bugs 1. **Remove the misleading `default_strategy` field** or implement a dispatch table so the config field actually selects behavior. Simplest fix: delete the field and the dead config surface until a second strategy is implemented. 2. **Fix `_benchmark_quality_score`** so additive bonuses apply only when no `pass_rate` / `quality_score` is available, or restructure as a weighted average so the components don't stack additively. ### P1 — Complete stated v1 scope 3. **Implement transcription proxy** — add `upstream.transcriptions()` and wire the handler in `chat.py` and `main.py`. 4. **Implement role fallback chain** — when `resolve_route()` finds no matching service for a role, walk `fallback_roles` in order before failing. ### P2 — Close the most important self-reported-only gaps 5. **Add active health probing** — the control plane should periodically probe registered service endpoints (a lightweight `GET /health` or `GET /v1/models` is sufficient) and update health state independently of node heartbeats. 6. **Add upstream model discovery for Ollama** — query `GET /api/tags` (Ollama) or `GET /v1/models` (OpenAI-compatible) from the node agent and merge loaded model names into the service's asset list. This enables dynamic model tracking without config restarts. ### P3 — Documentation cleanup 7. **Revise `architecture.md`** — remove the design-phase repo-naming rationale and first-implementation-sequence list; replace with a description of the actual running system (the four layers as implemented, data flow diagram if possible). 8. **Update `roadmap.md`** — this file (done). --- ## Near-Term Milestones (After P0–P3) - **Live LLM demo** — run control + node against a real upstream (Ollama or llama.cpp) and document the end-to-end flow, including chat via role and direct asset addressing - **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as a programmatic service catalog for a Claude Code or Codex client selecting a GenieHive-hosted model for lower-complexity subtasks - **Richer node metrics** — queue depth, in-flight count, and rolling performance averages reported from node to control on every heartbeat - **Second routing strategy** — implement `round_robin` or `least_loaded` as a second selectable strategy, then make `default_strategy` actually dispatch --- ## V1.5 Scope (Not Yet Started) - mTLS between control plane and node agents - Scoped client tokens (read-only vs. operator vs. admin) - Active load-aware model swapping (trigger unload/load on a node based on demand) - Image and TTS generation adapter stubs - Streaming response passthrough for chat completions --- ## Non-Goals (Unchanged from Original Spec) - Peer-to-peer consensus - Autonomous global model swapping across many nodes - Full WAN zero-trust platform - Distributed vector database management - Billing or multi-tenant quota accounting