6.9 KiB
GenieHive Architecture
Last updated: 2026-04-27
Mission
GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts. It provides:
- Registration and health tracking for distributed AI services
- A stable, OpenAI-compatible client-facing API
- Role-based routing and scheduling over multiple services
- Integrated benchmarking and performance-informed route scoring
It is not a plain OpenAI-compatible gateway. The control plane layer adds topology awareness, role abstraction, and signal-driven routing that a dumb proxy does not provide.
Four Layers
┌─────────────────────────────────────────────┐
│ Client Facades │
│ OpenAI-compatible completions + embeddings │
│ Operator inspection API │
├─────────────────────────────────────────────┤
│ Control API │
│ Registry · Role catalog · Route resolution │
│ Scheduling · Benchmark store │
├─────────────────────────────────────────────┤
│ Node Agent(s) │
│ Host discovery · Service enumeration │
│ Telemetry reporting · Heartbeat │
├─────────────────────────────────────────────┤
│ Provider Adapters │
│ OpenAI-compatible chat / embeddings │
│ Transcription (partial) │
└─────────────────────────────────────────────┘
Core Concepts
Host
A physical or virtual machine participating in the cluster.
Service
A concrete callable capability on a host: a chat endpoint, an embeddings endpoint, or a transcription endpoint. A host typically exposes multiple services.
Asset
A model weight, model name, or runtime target that a service can serve. Assets carry
optional request_policy fields that adjust how requests are shaped before forwarding.
Role
A reusable task profile that describes how requests should be fulfilled, not which model fills them. A role has a prompt policy (system prompt injection, body defaults) and a routing policy (preferred model families, minimum context size, loaded-first preference). The same role can route to different services as cluster state changes.
Route Resolution
- If
modelmatches a loaded, healthy asset or service alias → route directly. - If
modelmatches a known role → score eligible services and route to the best. - Otherwise → fail with a clear 404.
Data Flow: Chat Completion
Client POST /v1/chat/completions
│
▼
resolve_route(model, kind="chat")
├─ direct: asset_id or service alias match
└─ role: filter by kind/health → score by runtime + benchmark signals
│
▼
apply_request_policy(request, asset, role)
├─ deep-merge body_defaults
├─ apply system prompt (prepend / append / replace)
└─ auto-infer Qwen3 template kwargs if needed
│
▼
UpstreamClient.chat_completions(endpoint, modified_request)
│
▼
_strip_reasoning_fields(response) ← removes reasoning_content / reasoning
│
▼
Response to client
Scoring
Route scoring combines three signal families:
| Signal family | Weight (role) | Weight (service) |
|---|---|---|
| Text overlap | 30% | 20% |
| Runtime | 30% | 45% |
| Benchmark | 25% | 35% |
| Family pref. | 15% | — |
Runtime signals (from last heartbeat):
- Loaded state: +0.35
- Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
- Throughput: ≥40 tok/s +0.20, ≥20 +0.10
- Queue depth: penalty −0.20 if ≥5, −0.10 if ≥2
Benchmark signals (from ingested workload runs):
- Workload overlap score (Jaccard-style token overlap)
- Quality score from results:
0.45 * overlap + 0.55 * quality
Topology
Minimum viable (single machine):
control plane + node agent + model server
all on 127.0.0.1, different ports
Recommended (small cluster):
1 control plane host
2+ node-agent hosts, each with 1+ model servers
1+ clients on LAN
Auth:
- Client requests:
X-Api-Keyheader - Node registration/heartbeat:
X-GenieHive-Node-Keyheader - Empty key lists disable auth (development only)
- mTLS between control and nodes planned for v1.5
State Store
SQLite. Schema:
| Table | Content |
|---|---|
hosts |
Host registration, resources, labels |
services |
Service config, runtime, assets, observed |
roles |
Role catalog |
benchmark_samples |
Workload results per service |
Default path: state/geniehive.sqlite3
API Reference Summary
Client API
| Endpoint | Status |
|---|---|
GET /v1/models |
Implemented |
POST /v1/chat/completions |
Implemented |
POST /v1/embeddings |
Implemented |
POST /v1/audio/transcriptions |
Stub only |
Operator API
| Endpoint | Status |
|---|---|
GET /v1/cluster/hosts |
Implemented |
GET /v1/cluster/services |
Implemented |
GET /v1/cluster/roles |
Implemented |
GET /v1/cluster/benchmarks |
Implemented |
GET /v1/cluster/health |
Implemented |
GET /v1/cluster/routes/resolve |
Implemented |
POST /v1/cluster/routes/match |
Implemented |
Node API
| Endpoint | Status |
|---|---|
POST /v1/nodes/register |
Implemented |
POST /v1/nodes/heartbeat |
Implemented |
GET /v1/node/inventory |
Implemented |
GET /v1/node/registration |
Implemented |
Supported Upstream Backends
Any OpenAI-compatible HTTP server. Tested configurations:
- Ollama — chat and embeddings
- llama.cpp (server mode) — chat and embeddings
- llamafile — chat
- vLLM — chat and embeddings
Non-Goals for V1
- Peer-to-peer consensus
- Autonomous global model swapping
- WAN zero-trust networking
- Image and TTS generation
- Distributed vector databases
- Billing or multi-tenant quotas