7.1 KiB

Raw Permalink Blame History

GenieHive Roadmap

Last updated: 2026-04-27 (P0–P2 complete + routing strategies + streaming + Ollama load state + observed metrics)

What Is Complete

The v1 core is implemented and tested.

Registry and cluster control:

SQLite-backed registry with hosts, services, roles, and benchmark samples
Node registration and heartbeat protocol with auto-re-registration on 404
Role catalog loading from YAML
Route resolution: direct asset/service match → role resolution → clear failure

Client-facing API:

GET /v1/models — OpenAI-compatible model list with rich metadata (loaded state, latency hints, offload classification, role aliases)
POST /v1/chat/completions — proxies to upstream with request policy application
POST /v1/embeddings — proxies to upstream

Request policy system:

Body defaults and overrides via deep merge
System prompt injection (prepend / append / replace)
Per-asset and per-role policies, merged with role winning on prompts
Qwen3 / Qwen3.5 auto-detection with enable_thinking: false applied automatically

Client-facing proxy:

POST /v1/audio/transcriptions — proxies multipart audio to upstream; uses a real httpx client for multipart form-data (not the injectable AsyncPoster Protocol)

Route matching and scoring:

POST /v1/cluster/routes/match — scored candidate list for role and service targets
Signals: text overlap, preferred family, runtime (loaded state, latency, throughput, queue depth), benchmark (workload overlap, quality score)
GET /v1/cluster/routes/resolve — quick single-model resolution
fallback_roles chain in resolve_route() — walks role fallbacks with cycle protection; each fallback resolves using its own operation (not the primary's kind)

Benchmark infrastructure:

Built-in workloads: chat.short_reasoning, chat.concise_support
run_benchmark_workload.py executes workloads and emits a JSON report
ingest_benchmark_report.py posts results to the control plane
Benchmark samples feed the route scoring pipeline

Operator inspection:

GET /v1/cluster/hosts, /services, /roles, /benchmarks, /health

Auth:

Client API key (X-Api-Key) and node registration key (X-GenieHive-Node-Key)
Empty key lists disable auth for development

Active health probing (control plane):

ServiceProber in probe.py probes each service's GET /health endpoint
Health divergences update the registry's state_json without touching other fields
Background probe_loop task launched at app startup when routing.probe_interval_s > 0 (default 0 = disabled, relies on node heartbeats)
Configurable via routing.probe_interval_s and routing.probe_timeout_s

Routing strategies — all three implemented:

routing.default_strategy in config; Registry(routing_strategy=...) dispatches
scored (default): picks best-scoring service per role
round_robin: cycles through healthy candidates; in-memory counter, resets on restart
least_loaded: picks service with lowest queue_depth + in_flight from observed metrics; falls back to latency as a secondary signal when load metrics are equal

Streaming chat completions:

UpstreamClient.chat_completions_stream() — async generator, yields raw SSE bytes using httpx.AsyncClient.stream(); raises UpstreamError before first yield on non-2xx status
_prepare_chat_upstream() extracted from proxy_chat_completion — synchronous routing/policy step so ProxyError can be caught before StreamingResponse is created
stream_chat_completion() — async generator wrapping chat_completions_stream, applies _strip_reasoning_from_sse_chunk() to each SSE data line
Route handler detects body.get("stream"), resolves route eagerly, returns StreamingResponse with Cache-Control: no-cache, X-Accel-Buffering: no

Upstream model discovery (node agent):

discover_ollama_assets() — queries /api/tags; marks all as loaded: False (available, not necessarily in VRAM)
_get_ollama_ps_models() — internal helper; queries /api/ps; returns raw model list (with size_in_vram etc.) for reuse without extra HTTP requests
query_ollama_ps() — public wrapper; returns frozenset of VRAM-loaded model names
discover_openai_models() — queries /v1/models; marks all as loaded: True
enrich_service_assets(service, *, protocol) — for "ollama": two-phase query (tags + ps); updates loaded state of existing static assets as well as adding new ones; stale loaded: True in config gets corrected to False if the model isn't in /api/ps; populates observed.loaded_model_count and observed.vram_used_bytes from /api/ps response
Per-service discover_protocol: "ollama" | "openai" | null config field
Heartbeat zips service dicts with config objects to pass protocol correctly
Separate httpx discovery client allocated only when any service opts in

ServiceObserved extended:

loaded_model_count: int | None — number of models currently in VRAM (from Ollama /api/ps)
vram_used_bytes: int | None — total VRAM used across loaded models
Both exposed in _runtime_signals signals dict for route scoring visibility

Tests:

Registry, chat proxy, node inventory, benchmark runner, full demo flow
ServiceProber probe_once, update_service_health, discover_ollama_assets, enrich_service_assets, observed metrics population — all passing (47 total)

Known Gaps and Issues

No confirmed gaps remain in the current implementation. Improvement areas:

1. Discovery covers Ollama and OpenAI-compatible; faster-whisper not covered

Transcription services (faster-whisper, WhisperX) don't expose /api/tags or /v1/models. A discover_protocol: "whisper" variant could query GET /inference/v1/models or read a static manifest.

2. `architecture.md` could be tightened further

Minor: some sections inherited from earlier drafts could be simplified now that the implementation is stable.

Next Work

Live end-to-end demo — run control + node against a real upstream (Ollama or llama.cpp) and validate: chat via role, direct asset addressing, Ollama dynamic discovery with correct load state, least_loaded routing with real VRAM metrics, and streaming.
Validate Codex-friendly /v1/models offload — test GET /v1/models as a programmatic service catalog for a Claude Code or Codex client selecting a GenieHive-hosted model for lower-complexity subtasks.
queue_depth / in_flight from Ollama — populate from /api/ps model count or from a sidecar queue tracker; currently only set from static config.

V1.5 Scope (Not Yet Started)

mTLS between control plane and node agents
Scoped client tokens (read-only vs. operator vs. admin)
Active load-aware model swapping (trigger unload/load on a node based on demand)
Image and TTS generation adapter stubs
Streaming response passthrough for chat completions

Non-Goals (Unchanged from Original Spec)

Peer-to-peer consensus
Autonomous global model swapping across many nodes
Full WAN zero-trust platform
Distributed vector database management
Billing or multi-tenant quota accounting

7.1 KiB Raw Permalink Blame History Unescape Escape