GenieHive/docs/roadmap.md

7.0 KiB
Raw Blame History

GenieHive Roadmap

Last updated: 2026-04-27

What Is Complete

The v1 core is implemented and tested.

Registry and cluster control:

  • SQLite-backed registry with hosts, services, roles, and benchmark samples
  • Node registration and heartbeat protocol with auto-re-registration on 404
  • Role catalog loading from YAML
  • Route resolution: direct asset/service match → role resolution → clear failure

Client-facing API:

  • GET /v1/models — OpenAI-compatible model list with rich metadata (loaded state, latency hints, offload classification, role aliases)
  • POST /v1/chat/completions — proxies to upstream with request policy application
  • POST /v1/embeddings — proxies to upstream

Request policy system:

  • Body defaults and overrides via deep merge
  • System prompt injection (prepend / append / replace)
  • Per-asset and per-role policies, merged with role winning on prompts
  • Qwen3 / Qwen3.5 auto-detection with enable_thinking: false applied automatically

Route matching and scoring:

  • POST /v1/cluster/routes/match — scored candidate list for role and service targets
  • Signals: text overlap, preferred family, runtime (loaded state, latency, throughput, queue depth), benchmark (workload overlap, quality score)
  • GET /v1/cluster/routes/resolve — quick single-model resolution

Benchmark infrastructure:

  • Built-in workloads: chat.short_reasoning, chat.concise_support
  • run_benchmark_workload.py executes workloads and emits a JSON report
  • ingest_benchmark_report.py posts results to the control plane
  • Benchmark samples feed the route scoring pipeline

Operator inspection:

  • GET /v1/cluster/hosts, /services, /roles, /benchmarks, /health

Auth:

  • Client API key (X-Api-Key) and node registration key (X-GenieHive-Node-Key)
  • Empty key lists disable auth for development

Tests:

  • Registry, chat proxy, node inventory, benchmark runner, full demo flow
  • All passing

Known Gaps and Issues

These are confirmed gaps in the current implementation, not aspirational items.

1. Transcription endpoint not implemented

POST /v1/audio/transcriptions is listed in the architecture and wired into main.py, but there is no upstream proxy handler for it. upstream.py has no transcriptions() method. The endpoint currently returns nothing useful.

2. Routing strategy field is ignored

RoutingConfig.default_strategy exists in config.py (default: "loaded_first"), but resolve_route() in registry.py does not read it. There is effectively only one strategy. The field is misleading.

3. Role fallback chain is not implemented

RoutingPolicy.fallback_roles is defined in models.py and appears in the schema docs, but resolve_route() never consults it. A role that fails to match any service fails outright rather than trying its fallbacks.

4. _benchmark_quality_score can exceed 1.0 before clamping

pass_rate and quality_score are taken as max(), then tokens_per_sec and ttft_ms are added on top. A service with pass_rate=1.0, fast tokens, and low TTFT accumulates a score of up to 1.6 before the final min(1.0, quality) clamp. This means the additive bonuses have no effect once pass_rate or quality_score is already high, which is probably not the intended behavior.

5. Health is self-reported only

Service health (healthy / unhealthy) comes entirely from node-reported state. The control plane does not probe upstream endpoints. A service can appear healthy while its endpoint is unreachable.

6. No active model discovery from upstream services

The node agent scans for .gguf files on disk and reads static service config. It does not query running Ollama or vLLM instances for their loaded model list. A freshly-pulled Ollama model will not appear until the node config is updated and the agent restarted.

7. docs/architecture.md duplicates GENIEWARREN_SPEC.md

architecture.md contains the repo-naming rationale, name alternatives, and implementation sequence list that are only meaningful in a design/proposal context. These are noise in a reference architecture document.


Immediate Next Work (Priority Order)

P0 — Fix confirmed bugs

  1. Remove the misleading default_strategy field or implement a dispatch table so the config field actually selects behavior. Simplest fix: delete the field and the dead config surface until a second strategy is implemented.

  2. Fix _benchmark_quality_score so additive bonuses apply only when no pass_rate / quality_score is available, or restructure as a weighted average so the components don't stack additively.

P1 — Complete stated v1 scope

  1. Implement transcription proxy — add upstream.transcriptions() and wire the handler in chat.py and main.py.

  2. Implement role fallback chain — when resolve_route() finds no matching service for a role, walk fallback_roles in order before failing.

P2 — Close the most important self-reported-only gaps

  1. Add active health probing — the control plane should periodically probe registered service endpoints (a lightweight GET /health or GET /v1/models is sufficient) and update health state independently of node heartbeats.

  2. Add upstream model discovery for Ollama — query GET /api/tags (Ollama) or GET /v1/models (OpenAI-compatible) from the node agent and merge loaded model names into the service's asset list. This enables dynamic model tracking without config restarts.

P3 — Documentation cleanup

  1. Revise architecture.md — remove the design-phase repo-naming rationale and first-implementation-sequence list; replace with a description of the actual running system (the four layers as implemented, data flow diagram if possible).

  2. Update roadmap.md — this file (done).


Near-Term Milestones (After P0P3)

  • Live LLM demo — run control + node against a real upstream (Ollama or llama.cpp) and document the end-to-end flow, including chat via role and direct asset addressing
  • Validate Codex-friendly /v1/models offload — test GET /v1/models as a programmatic service catalog for a Claude Code or Codex client selecting a GenieHive-hosted model for lower-complexity subtasks
  • Richer node metrics — queue depth, in-flight count, and rolling performance averages reported from node to control on every heartbeat
  • Second routing strategy — implement round_robin or least_loaded as a second selectable strategy, then make default_strategy actually dispatch

V1.5 Scope (Not Yet Started)

  • mTLS between control plane and node agents
  • Scoped client tokens (read-only vs. operator vs. admin)
  • Active load-aware model swapping (trigger unload/load on a node based on demand)
  • Image and TTS generation adapter stubs
  • Streaming response passthrough for chat completions

Non-Goals (Unchanged from Original Spec)

  • Peer-to-peer consensus
  • Autonomous global model swapping across many nodes
  • Full WAN zero-trust platform
  • Distributed vector database management
  • Billing or multi-tenant quota accounting