GenieHive/docs/roadmap.md

155 lines
7.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GenieHive Roadmap
Last updated: 2026-04-27 (P0P2 complete + routing strategies + streaming + Ollama load state + observed metrics)
## What Is Complete
The v1 core is implemented and tested.
**Registry and cluster control:**
- SQLite-backed registry with hosts, services, roles, and benchmark samples
- Node registration and heartbeat protocol with auto-re-registration on 404
- Role catalog loading from YAML
- Route resolution: direct asset/service match → role resolution → clear failure
**Client-facing API:**
- `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state,
latency hints, offload classification, role aliases)
- `POST /v1/chat/completions` — proxies to upstream with request policy application
- `POST /v1/embeddings` — proxies to upstream
**Request policy system:**
- Body defaults and overrides via deep merge
- System prompt injection (prepend / append / replace)
- Per-asset and per-role policies, merged with role winning on prompts
- Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically
**Client-facing proxy:**
- `POST /v1/audio/transcriptions` — proxies multipart audio to upstream; uses a
real httpx client for multipart form-data (not the injectable `AsyncPoster` Protocol)
**Route matching and scoring:**
- `POST /v1/cluster/routes/match` — scored candidate list for role and service targets
- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput,
queue depth), benchmark (workload overlap, quality score)
- `GET /v1/cluster/routes/resolve` — quick single-model resolution
- `fallback_roles` chain in `resolve_route()` — walks role fallbacks with cycle
protection; each fallback resolves using its own operation (not the primary's kind)
**Benchmark infrastructure:**
- Built-in workloads: `chat.short_reasoning`, `chat.concise_support`
- `run_benchmark_workload.py` executes workloads and emits a JSON report
- `ingest_benchmark_report.py` posts results to the control plane
- Benchmark samples feed the route scoring pipeline
**Operator inspection:**
- `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health`
**Auth:**
- Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`)
- Empty key lists disable auth for development
**Active health probing (control plane):**
- `ServiceProber` in `probe.py` probes each service's `GET /health` endpoint
- Health divergences update the registry's `state_json` without touching other fields
- Background `probe_loop` task launched at app startup when
`routing.probe_interval_s > 0` (default 0 = disabled, relies on node heartbeats)
- Configurable via `routing.probe_interval_s` and `routing.probe_timeout_s`
**Routing strategies — all three implemented:**
- `routing.default_strategy` in config; `Registry(routing_strategy=...)` dispatches
- `scored` (default): picks best-scoring service per role
- `round_robin`: cycles through healthy candidates; in-memory counter, resets on restart
- `least_loaded`: picks service with lowest `queue_depth + in_flight` from observed
metrics; falls back to latency as a secondary signal when load metrics are equal
**Streaming chat completions:**
- `UpstreamClient.chat_completions_stream()` — async generator, yields raw SSE bytes
using `httpx.AsyncClient.stream()`; raises `UpstreamError` before first yield on
non-2xx status
- `_prepare_chat_upstream()` extracted from `proxy_chat_completion` — synchronous
routing/policy step so `ProxyError` can be caught before `StreamingResponse` is created
- `stream_chat_completion()` — async generator wrapping `chat_completions_stream`,
applies `_strip_reasoning_from_sse_chunk()` to each SSE data line
- Route handler detects `body.get("stream")`, resolves route eagerly, returns
`StreamingResponse` with `Cache-Control: no-cache, X-Accel-Buffering: no`
**Upstream model discovery (node agent):**
- `discover_ollama_assets()` — queries `/api/tags`; marks all as `loaded: False`
(available, not necessarily in VRAM)
- `_get_ollama_ps_models()` — internal helper; queries `/api/ps`; returns raw model
list (with `size_in_vram` etc.) for reuse without extra HTTP requests
- `query_ollama_ps()` — public wrapper; returns frozenset of VRAM-loaded model names
- `discover_openai_models()` — queries `/v1/models`; marks all as `loaded: True`
- `enrich_service_assets(service, *, protocol)` — for `"ollama"`: two-phase query
(tags + ps); updates `loaded` state of existing static assets as well as adding
new ones; stale `loaded: True` in config gets corrected to `False` if the model
isn't in `/api/ps`; populates `observed.loaded_model_count` and
`observed.vram_used_bytes` from `/api/ps` response
- Per-service `discover_protocol: "ollama" | "openai" | null` config field
- Heartbeat zips service dicts with config objects to pass protocol correctly
- Separate httpx discovery client allocated only when any service opts in
**`ServiceObserved` extended:**
- `loaded_model_count: int | None` — number of models currently in VRAM (from Ollama `/api/ps`)
- `vram_used_bytes: int | None` — total VRAM used across loaded models
- Both exposed in `_runtime_signals` signals dict for route scoring visibility
**Tests:**
- Registry, chat proxy, node inventory, benchmark runner, full demo flow
- ServiceProber probe_once, update_service_health, discover_ollama_assets,
enrich_service_assets, observed metrics population — all passing (47 total)
---
## Known Gaps and Issues
No confirmed gaps remain in the current implementation. Improvement areas:
### 1. Discovery covers Ollama and OpenAI-compatible; faster-whisper not covered
Transcription services (faster-whisper, WhisperX) don't expose `/api/tags` or
`/v1/models`. A `discover_protocol: "whisper"` variant could query
`GET /inference/v1/models` or read a static manifest.
### 2. `architecture.md` could be tightened further
Minor: some sections inherited from earlier drafts could be simplified now that
the implementation is stable.
---
## Next Work
1. **Live end-to-end demo** — run control + node against a real upstream (Ollama
or llama.cpp) and validate: chat via role, direct asset addressing, Ollama
dynamic discovery with correct load state, `least_loaded` routing with real
VRAM metrics, and streaming.
2. **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as
a programmatic service catalog for a Claude Code or Codex client selecting
a GenieHive-hosted model for lower-complexity subtasks.
3. **`queue_depth` / `in_flight` from Ollama** — populate from `/api/ps` model
count or from a sidecar queue tracker; currently only set from static config.
---
## V1.5 Scope (Not Yet Started)
- mTLS between control plane and node agents
- Scoped client tokens (read-only vs. operator vs. admin)
- Active load-aware model swapping (trigger unload/load on a node based on demand)
- Image and TTS generation adapter stubs
- Streaming response passthrough for chat completions
---
## Non-Goals (Unchanged from Original Spec)
- Peer-to-peer consensus
- Autonomous global model swapping across many nodes
- Full WAN zero-trust platform
- Distributed vector database management
- Billing or multi-tenant quota accounting