GenieHive/docs/roadmap.md

176 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GenieHive Roadmap
Last updated: 2026-04-27
## What Is Complete
The v1 core is implemented and tested.
**Registry and cluster control:**
- SQLite-backed registry with hosts, services, roles, and benchmark samples
- Node registration and heartbeat protocol with auto-re-registration on 404
- Role catalog loading from YAML
- Route resolution: direct asset/service match → role resolution → clear failure
**Client-facing API:**
- `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state,
latency hints, offload classification, role aliases)
- `POST /v1/chat/completions` — proxies to upstream with request policy application
- `POST /v1/embeddings` — proxies to upstream
**Request policy system:**
- Body defaults and overrides via deep merge
- System prompt injection (prepend / append / replace)
- Per-asset and per-role policies, merged with role winning on prompts
- Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically
**Route matching and scoring:**
- `POST /v1/cluster/routes/match` — scored candidate list for role and service targets
- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput,
queue depth), benchmark (workload overlap, quality score)
- `GET /v1/cluster/routes/resolve` — quick single-model resolution
**Benchmark infrastructure:**
- Built-in workloads: `chat.short_reasoning`, `chat.concise_support`
- `run_benchmark_workload.py` executes workloads and emits a JSON report
- `ingest_benchmark_report.py` posts results to the control plane
- Benchmark samples feed the route scoring pipeline
**Operator inspection:**
- `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health`
**Auth:**
- Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`)
- Empty key lists disable auth for development
**Tests:**
- Registry, chat proxy, node inventory, benchmark runner, full demo flow
- All passing
---
## Known Gaps and Issues
These are confirmed gaps in the current implementation, not aspirational items.
### 1. Transcription endpoint not implemented
`POST /v1/audio/transcriptions` is listed in the architecture and wired into
`main.py`, but there is no upstream proxy handler for it. `upstream.py` has no
`transcriptions()` method. The endpoint currently returns nothing useful.
### 2. Routing strategy field is ignored
`RoutingConfig.default_strategy` exists in `config.py` (default: `"loaded_first"`),
but `resolve_route()` in `registry.py` does not read it. There is effectively only
one strategy. The field is misleading.
### 3. Role fallback chain is not implemented
`RoutingPolicy.fallback_roles` is defined in `models.py` and appears in the schema
docs, but `resolve_route()` never consults it. A role that fails to match any service
fails outright rather than trying its fallbacks.
### 4. `_benchmark_quality_score` can exceed 1.0 before clamping
`pass_rate` and `quality_score` are taken as `max()`, then `tokens_per_sec` and
`ttft_ms` are *added* on top. A service with `pass_rate=1.0`, fast tokens, and low
TTFT accumulates a score of up to 1.6 before the final `min(1.0, quality)` clamp.
This means the additive bonuses have no effect once pass_rate or quality_score is
already high, which is probably not the intended behavior.
### 5. Health is self-reported only
Service health (`healthy` / `unhealthy`) comes entirely from node-reported state.
The control plane does not probe upstream endpoints. A service can appear healthy
while its endpoint is unreachable.
### 6. No active model discovery from upstream services
The node agent scans for `.gguf` files on disk and reads static service config.
It does not query running Ollama or vLLM instances for their loaded model list.
A freshly-pulled Ollama model will not appear until the node config is updated
and the agent restarted.
### 7. `docs/architecture.md` duplicates `GENIEWARREN_SPEC.md`
`architecture.md` contains the repo-naming rationale, name alternatives, and
implementation sequence list that are only meaningful in a design/proposal context.
These are noise in a reference architecture document.
---
## Immediate Next Work (Priority Order)
### P0 — Fix confirmed bugs
1. **Remove the misleading `default_strategy` field** or implement a dispatch table
so the config field actually selects behavior. Simplest fix: delete the field and
the dead config surface until a second strategy is implemented.
2. **Fix `_benchmark_quality_score`** so additive bonuses apply only when no
`pass_rate` / `quality_score` is available, or restructure as a weighted average
so the components don't stack additively.
### P1 — Complete stated v1 scope
3. **Implement transcription proxy** — add `upstream.transcriptions()` and wire
the handler in `chat.py` and `main.py`.
4. **Implement role fallback chain** — when `resolve_route()` finds no matching
service for a role, walk `fallback_roles` in order before failing.
### P2 — Close the most important self-reported-only gaps
5. **Add active health probing** — the control plane should periodically probe
registered service endpoints (a lightweight `GET /health` or `GET /v1/models`
is sufficient) and update health state independently of node heartbeats.
6. **Add upstream model discovery for Ollama** — query `GET /api/tags` (Ollama)
or `GET /v1/models` (OpenAI-compatible) from the node agent and merge loaded
model names into the service's asset list. This enables dynamic model tracking
without config restarts.
### P3 — Documentation cleanup
7. **Revise `architecture.md`** — remove the design-phase repo-naming rationale
and first-implementation-sequence list; replace with a description of the actual
running system (the four layers as implemented, data flow diagram if possible).
8. **Update `roadmap.md`** — this file (done).
---
## Near-Term Milestones (After P0P3)
- **Live LLM demo** — run control + node against a real upstream (Ollama or
llama.cpp) and document the end-to-end flow, including chat via role and
direct asset addressing
- **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as
a programmatic service catalog for a Claude Code or Codex client selecting
a GenieHive-hosted model for lower-complexity subtasks
- **Richer node metrics** — queue depth, in-flight count, and rolling performance
averages reported from node to control on every heartbeat
- **Second routing strategy** — implement `round_robin` or `least_loaded` as a
second selectable strategy, then make `default_strategy` actually dispatch
---
## V1.5 Scope (Not Yet Started)
- mTLS between control plane and node agents
- Scoped client tokens (read-only vs. operator vs. admin)
- Active load-aware model swapping (trigger unload/load on a node based on demand)
- Image and TTS generation adapter stubs
- Streaming response passthrough for chat completions
---
## Non-Goals (Unchanged from Original Spec)
- Peer-to-peer consensus
- Autonomous global model swapping across many nodes
- Full WAN zero-trust platform
- Distributed vector database management
- Billing or multi-tenant quota accounting