GenieHive/docs/roadmap.md

7.1 KiB
Raw Permalink Blame History

GenieHive Roadmap

Last updated: 2026-04-27 (P0P2 complete + routing strategies + streaming + Ollama load state + observed metrics)

What Is Complete

The v1 core is implemented and tested.

Registry and cluster control:

  • SQLite-backed registry with hosts, services, roles, and benchmark samples
  • Node registration and heartbeat protocol with auto-re-registration on 404
  • Role catalog loading from YAML
  • Route resolution: direct asset/service match → role resolution → clear failure

Client-facing API:

  • GET /v1/models — OpenAI-compatible model list with rich metadata (loaded state, latency hints, offload classification, role aliases)
  • POST /v1/chat/completions — proxies to upstream with request policy application
  • POST /v1/embeddings — proxies to upstream

Request policy system:

  • Body defaults and overrides via deep merge
  • System prompt injection (prepend / append / replace)
  • Per-asset and per-role policies, merged with role winning on prompts
  • Qwen3 / Qwen3.5 auto-detection with enable_thinking: false applied automatically

Client-facing proxy:

  • POST /v1/audio/transcriptions — proxies multipart audio to upstream; uses a real httpx client for multipart form-data (not the injectable AsyncPoster Protocol)

Route matching and scoring:

  • POST /v1/cluster/routes/match — scored candidate list for role and service targets
  • Signals: text overlap, preferred family, runtime (loaded state, latency, throughput, queue depth), benchmark (workload overlap, quality score)
  • GET /v1/cluster/routes/resolve — quick single-model resolution
  • fallback_roles chain in resolve_route() — walks role fallbacks with cycle protection; each fallback resolves using its own operation (not the primary's kind)

Benchmark infrastructure:

  • Built-in workloads: chat.short_reasoning, chat.concise_support
  • run_benchmark_workload.py executes workloads and emits a JSON report
  • ingest_benchmark_report.py posts results to the control plane
  • Benchmark samples feed the route scoring pipeline

Operator inspection:

  • GET /v1/cluster/hosts, /services, /roles, /benchmarks, /health

Auth:

  • Client API key (X-Api-Key) and node registration key (X-GenieHive-Node-Key)
  • Empty key lists disable auth for development

Active health probing (control plane):

  • ServiceProber in probe.py probes each service's GET /health endpoint
  • Health divergences update the registry's state_json without touching other fields
  • Background probe_loop task launched at app startup when routing.probe_interval_s > 0 (default 0 = disabled, relies on node heartbeats)
  • Configurable via routing.probe_interval_s and routing.probe_timeout_s

Routing strategies — all three implemented:

  • routing.default_strategy in config; Registry(routing_strategy=...) dispatches
  • scored (default): picks best-scoring service per role
  • round_robin: cycles through healthy candidates; in-memory counter, resets on restart
  • least_loaded: picks service with lowest queue_depth + in_flight from observed metrics; falls back to latency as a secondary signal when load metrics are equal

Streaming chat completions:

  • UpstreamClient.chat_completions_stream() — async generator, yields raw SSE bytes using httpx.AsyncClient.stream(); raises UpstreamError before first yield on non-2xx status
  • _prepare_chat_upstream() extracted from proxy_chat_completion — synchronous routing/policy step so ProxyError can be caught before StreamingResponse is created
  • stream_chat_completion() — async generator wrapping chat_completions_stream, applies _strip_reasoning_from_sse_chunk() to each SSE data line
  • Route handler detects body.get("stream"), resolves route eagerly, returns StreamingResponse with Cache-Control: no-cache, X-Accel-Buffering: no

Upstream model discovery (node agent):

  • discover_ollama_assets() — queries /api/tags; marks all as loaded: False (available, not necessarily in VRAM)
  • _get_ollama_ps_models() — internal helper; queries /api/ps; returns raw model list (with size_in_vram etc.) for reuse without extra HTTP requests
  • query_ollama_ps() — public wrapper; returns frozenset of VRAM-loaded model names
  • discover_openai_models() — queries /v1/models; marks all as loaded: True
  • enrich_service_assets(service, *, protocol) — for "ollama": two-phase query (tags + ps); updates loaded state of existing static assets as well as adding new ones; stale loaded: True in config gets corrected to False if the model isn't in /api/ps; populates observed.loaded_model_count and observed.vram_used_bytes from /api/ps response
  • Per-service discover_protocol: "ollama" | "openai" | null config field
  • Heartbeat zips service dicts with config objects to pass protocol correctly
  • Separate httpx discovery client allocated only when any service opts in

ServiceObserved extended:

  • loaded_model_count: int | None — number of models currently in VRAM (from Ollama /api/ps)
  • vram_used_bytes: int | None — total VRAM used across loaded models
  • Both exposed in _runtime_signals signals dict for route scoring visibility

Tests:

  • Registry, chat proxy, node inventory, benchmark runner, full demo flow
  • ServiceProber probe_once, update_service_health, discover_ollama_assets, enrich_service_assets, observed metrics population — all passing (47 total)

Known Gaps and Issues

No confirmed gaps remain in the current implementation. Improvement areas:

1. Discovery covers Ollama and OpenAI-compatible; faster-whisper not covered

Transcription services (faster-whisper, WhisperX) don't expose /api/tags or /v1/models. A discover_protocol: "whisper" variant could query GET /inference/v1/models or read a static manifest.

2. architecture.md could be tightened further

Minor: some sections inherited from earlier drafts could be simplified now that the implementation is stable.


Next Work

  1. Live end-to-end demo — run control + node against a real upstream (Ollama or llama.cpp) and validate: chat via role, direct asset addressing, Ollama dynamic discovery with correct load state, least_loaded routing with real VRAM metrics, and streaming.

  2. Validate Codex-friendly /v1/models offload — test GET /v1/models as a programmatic service catalog for a Claude Code or Codex client selecting a GenieHive-hosted model for lower-complexity subtasks.

  3. queue_depth / in_flight from Ollama — populate from /api/ps model count or from a sidecar queue tracker; currently only set from static config.


V1.5 Scope (Not Yet Started)

  • mTLS between control plane and node agents
  • Scoped client tokens (read-only vs. operator vs. admin)
  • Active load-aware model swapping (trigger unload/load on a node based on demand)
  • Image and TTS generation adapter stubs
  • Streaming response passthrough for chat completions

Non-Goals (Unchanged from Original Spec)

  • Peer-to-peer consensus
  • Autonomous global model swapping across many nodes
  • Full WAN zero-trust platform
  • Distributed vector database management
  • Billing or multi-tenant quota accounting