7.1 KiB
GenieHive Roadmap
Last updated: 2026-04-27 (P0–P2 complete + routing strategies + streaming + Ollama load state + observed metrics)
What Is Complete
The v1 core is implemented and tested.
Registry and cluster control:
- SQLite-backed registry with hosts, services, roles, and benchmark samples
- Node registration and heartbeat protocol with auto-re-registration on 404
- Role catalog loading from YAML
- Route resolution: direct asset/service match → role resolution → clear failure
Client-facing API:
GET /v1/models— OpenAI-compatible model list with rich metadata (loaded state, latency hints, offload classification, role aliases)POST /v1/chat/completions— proxies to upstream with request policy applicationPOST /v1/embeddings— proxies to upstream
Request policy system:
- Body defaults and overrides via deep merge
- System prompt injection (prepend / append / replace)
- Per-asset and per-role policies, merged with role winning on prompts
- Qwen3 / Qwen3.5 auto-detection with
enable_thinking: falseapplied automatically
Client-facing proxy:
POST /v1/audio/transcriptions— proxies multipart audio to upstream; uses a real httpx client for multipart form-data (not the injectableAsyncPosterProtocol)
Route matching and scoring:
POST /v1/cluster/routes/match— scored candidate list for role and service targets- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput, queue depth), benchmark (workload overlap, quality score)
GET /v1/cluster/routes/resolve— quick single-model resolutionfallback_roleschain inresolve_route()— walks role fallbacks with cycle protection; each fallback resolves using its own operation (not the primary's kind)
Benchmark infrastructure:
- Built-in workloads:
chat.short_reasoning,chat.concise_support run_benchmark_workload.pyexecutes workloads and emits a JSON reportingest_benchmark_report.pyposts results to the control plane- Benchmark samples feed the route scoring pipeline
Operator inspection:
GET /v1/cluster/hosts,/services,/roles,/benchmarks,/health
Auth:
- Client API key (
X-Api-Key) and node registration key (X-GenieHive-Node-Key) - Empty key lists disable auth for development
Active health probing (control plane):
ServiceProberinprobe.pyprobes each service'sGET /healthendpoint- Health divergences update the registry's
state_jsonwithout touching other fields - Background
probe_looptask launched at app startup whenrouting.probe_interval_s > 0(default 0 = disabled, relies on node heartbeats) - Configurable via
routing.probe_interval_sandrouting.probe_timeout_s
Routing strategies — all three implemented:
routing.default_strategyin config;Registry(routing_strategy=...)dispatchesscored(default): picks best-scoring service per roleround_robin: cycles through healthy candidates; in-memory counter, resets on restartleast_loaded: picks service with lowestqueue_depth + in_flightfrom observed metrics; falls back to latency as a secondary signal when load metrics are equal
Streaming chat completions:
UpstreamClient.chat_completions_stream()— async generator, yields raw SSE bytes usinghttpx.AsyncClient.stream(); raisesUpstreamErrorbefore first yield on non-2xx status_prepare_chat_upstream()extracted fromproxy_chat_completion— synchronous routing/policy step soProxyErrorcan be caught beforeStreamingResponseis createdstream_chat_completion()— async generator wrappingchat_completions_stream, applies_strip_reasoning_from_sse_chunk()to each SSE data line- Route handler detects
body.get("stream"), resolves route eagerly, returnsStreamingResponsewithCache-Control: no-cache, X-Accel-Buffering: no
Upstream model discovery (node agent):
discover_ollama_assets()— queries/api/tags; marks all asloaded: False(available, not necessarily in VRAM)_get_ollama_ps_models()— internal helper; queries/api/ps; returns raw model list (withsize_in_vrametc.) for reuse without extra HTTP requestsquery_ollama_ps()— public wrapper; returns frozenset of VRAM-loaded model namesdiscover_openai_models()— queries/v1/models; marks all asloaded: Trueenrich_service_assets(service, *, protocol)— for"ollama": two-phase query (tags + ps); updatesloadedstate of existing static assets as well as adding new ones; staleloaded: Truein config gets corrected toFalseif the model isn't in/api/ps; populatesobserved.loaded_model_countandobserved.vram_used_bytesfrom/api/psresponse- Per-service
discover_protocol: "ollama" | "openai" | nullconfig field - Heartbeat zips service dicts with config objects to pass protocol correctly
- Separate httpx discovery client allocated only when any service opts in
ServiceObserved extended:
loaded_model_count: int | None— number of models currently in VRAM (from Ollama/api/ps)vram_used_bytes: int | None— total VRAM used across loaded models- Both exposed in
_runtime_signalssignals dict for route scoring visibility
Tests:
- Registry, chat proxy, node inventory, benchmark runner, full demo flow
- ServiceProber probe_once, update_service_health, discover_ollama_assets, enrich_service_assets, observed metrics population — all passing (47 total)
Known Gaps and Issues
No confirmed gaps remain in the current implementation. Improvement areas:
1. Discovery covers Ollama and OpenAI-compatible; faster-whisper not covered
Transcription services (faster-whisper, WhisperX) don't expose /api/tags or
/v1/models. A discover_protocol: "whisper" variant could query
GET /inference/v1/models or read a static manifest.
2. architecture.md could be tightened further
Minor: some sections inherited from earlier drafts could be simplified now that the implementation is stable.
Next Work
-
Live end-to-end demo — run control + node against a real upstream (Ollama or llama.cpp) and validate: chat via role, direct asset addressing, Ollama dynamic discovery with correct load state,
least_loadedrouting with real VRAM metrics, and streaming. -
Validate Codex-friendly
/v1/modelsoffload — testGET /v1/modelsas a programmatic service catalog for a Claude Code or Codex client selecting a GenieHive-hosted model for lower-complexity subtasks. -
queue_depth/in_flightfrom Ollama — populate from/api/psmodel count or from a sidecar queue tracker; currently only set from static config.
V1.5 Scope (Not Yet Started)
- mTLS between control plane and node agents
- Scoped client tokens (read-only vs. operator vs. admin)
- Active load-aware model swapping (trigger unload/load on a node based on demand)
- Image and TTS generation adapter stubs
- Streaming response passthrough for chat completions
Non-Goals (Unchanged from Original Spec)
- Peer-to-peer consensus
- Autonomous global model swapping across many nodes
- Full WAN zero-trust platform
- Distributed vector database management
- Billing or multi-tenant quota accounting