GenieHive/docs/roadmap.md

# GenieHive Roadmap

Last updated: 2026-04-27 (P0–P2 complete + routing strategies + streaming + Ollama load state + observed metrics)

## What Is Complete

The v1 core is implemented and tested.

**Registry and cluster control:**
- SQLite-backed registry with hosts, services, roles, and benchmark samples
- Node registration and heartbeat protocol with auto-re-registration on 404
- Role catalog loading from YAML
- Route resolution: direct asset/service match → role resolution → clear failure

**Client-facing API:**
- `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state,
  latency hints, offload classification, role aliases)
- `POST /v1/chat/completions` — proxies to upstream with request policy application
- `POST /v1/embeddings` — proxies to upstream

**Request policy system:**
- Body defaults and overrides via deep merge
- System prompt injection (prepend / append / replace)
- Per-asset and per-role policies, merged with role winning on prompts
- Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically

**Client-facing proxy:**
- `POST /v1/audio/transcriptions` — proxies multipart audio to upstream; uses a
  real httpx client for multipart form-data (not the injectable `AsyncPoster` Protocol)

**Route matching and scoring:**
- `POST /v1/cluster/routes/match` — scored candidate list for role and service targets
- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput,
  queue depth), benchmark (workload overlap, quality score)
- `GET /v1/cluster/routes/resolve` — quick single-model resolution
- `fallback_roles` chain in `resolve_route()` — walks role fallbacks with cycle
  protection; each fallback resolves using its own operation (not the primary's kind)

**Benchmark infrastructure:**
- Built-in workloads: `chat.short_reasoning`, `chat.concise_support`
- `run_benchmark_workload.py` executes workloads and emits a JSON report
- `ingest_benchmark_report.py` posts results to the control plane
- Benchmark samples feed the route scoring pipeline

**Operator inspection:**
- `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health`

**Auth:**
- Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`)
- Empty key lists disable auth for development

**Active health probing (control plane):**
- `ServiceProber` in `probe.py` probes each service's `GET /health` endpoint
- Health divergences update the registry's `state_json` without touching other fields
- Background `probe_loop` task launched at app startup when
  `routing.probe_interval_s > 0` (default 0 = disabled, relies on node heartbeats)
- Configurable via `routing.probe_interval_s` and `routing.probe_timeout_s`

**Routing strategies — all three implemented:**
- `routing.default_strategy` in config; `Registry(routing_strategy=...)` dispatches
- `scored` (default): picks best-scoring service per role
- `round_robin`: cycles through healthy candidates; in-memory counter, resets on restart
- `least_loaded`: picks service with lowest `queue_depth + in_flight` from observed
  metrics; falls back to latency as a secondary signal when load metrics are equal

**Streaming chat completions:**
- `UpstreamClient.chat_completions_stream()` — async generator, yields raw SSE bytes
  using `httpx.AsyncClient.stream()`; raises `UpstreamError` before first yield on
  non-2xx status
- `_prepare_chat_upstream()` extracted from `proxy_chat_completion` — synchronous
  routing/policy step so `ProxyError` can be caught before `StreamingResponse` is created
- `stream_chat_completion()` — async generator wrapping `chat_completions_stream`,
  applies `_strip_reasoning_from_sse_chunk()` to each SSE data line
- Route handler detects `body.get("stream")`, resolves route eagerly, returns
  `StreamingResponse` with `Cache-Control: no-cache, X-Accel-Buffering: no`

**Upstream model discovery (node agent):**
- `discover_ollama_assets()` — queries `/api/tags`; marks all as `loaded: False`
  (available, not necessarily in VRAM)
- `_get_ollama_ps_models()` — internal helper; queries `/api/ps`; returns raw model
  list (with `size_in_vram` etc.) for reuse without extra HTTP requests
- `query_ollama_ps()` — public wrapper; returns frozenset of VRAM-loaded model names
- `discover_openai_models()` — queries `/v1/models`; marks all as `loaded: True`
- `enrich_service_assets(service, *, protocol)` — for `"ollama"`: two-phase query
  (tags + ps); updates `loaded` state of existing static assets as well as adding
  new ones; stale `loaded: True` in config gets corrected to `False` if the model
  isn't in `/api/ps`; populates `observed.loaded_model_count` and
  `observed.vram_used_bytes` from `/api/ps` response
- Per-service `discover_protocol: "ollama" | "openai" | null` config field
- Heartbeat zips service dicts with config objects to pass protocol correctly
- Separate httpx discovery client allocated only when any service opts in

**`ServiceObserved` extended:**
- `loaded_model_count: int | None` — number of models currently in VRAM (from Ollama `/api/ps`)
- `vram_used_bytes: int | None` — total VRAM used across loaded models
- Both exposed in `_runtime_signals` signals dict for route scoring visibility

**Tests:**
- Registry, chat proxy, node inventory, benchmark runner, full demo flow
- ServiceProber probe_once, update_service_health, discover_ollama_assets,
  enrich_service_assets, observed metrics population — all passing (47 total)

---

## Known Gaps and Issues

No confirmed gaps remain in the current implementation.  Improvement areas:

### 1. Discovery covers Ollama and OpenAI-compatible; faster-whisper not covered

Transcription services (faster-whisper, WhisperX) don't expose `/api/tags` or
`/v1/models`.  A `discover_protocol: "whisper"` variant could query
`GET /inference/v1/models` or read a static manifest.

### 2. `architecture.md` could be tightened further

Minor: some sections inherited from earlier drafts could be simplified now that
the implementation is stable.

---

## Next Work

1. **Live end-to-end demo** — run control + node against a real upstream (Ollama
   or llama.cpp) and validate: chat via role, direct asset addressing, Ollama
   dynamic discovery with correct load state, `least_loaded` routing with real
   VRAM metrics, and streaming.

2. **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as
   a programmatic service catalog for a Claude Code or Codex client selecting
   a GenieHive-hosted model for lower-complexity subtasks.

3. **`queue_depth` / `in_flight` from Ollama** — populate from `/api/ps` model
   count or from a sidecar queue tracker; currently only set from static config.

---

## V1.5 Scope (Not Yet Started)

- mTLS between control plane and node agents
- Scoped client tokens (read-only vs. operator vs. admin)
- Active load-aware model swapping (trigger unload/load on a node based on demand)
- Image and TTS generation adapter stubs
- Streaming response passthrough for chat completions

---

## Non-Goals (Unchanged from Original Spec)

- Peer-to-peer consensus
- Autonomous global model swapping across many nodes
- Full WAN zero-trust platform
- Distributed vector database management
- Billing or multi-tenant quota accounting