# GenieHive Architecture Last updated: 2026-04-27 ## Mission GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts. It provides: - Registration and health tracking for distributed AI services - A stable, OpenAI-compatible client-facing API - Role-based routing and scheduling over multiple services - Integrated benchmarking and performance-informed route scoring It is not a plain OpenAI-compatible gateway. The control plane layer adds topology awareness, role abstraction, and signal-driven routing that a dumb proxy does not provide. --- ## Four Layers ``` ┌─────────────────────────────────────────────┐ │ Client Facades │ │ OpenAI-compatible completions + embeddings │ │ Operator inspection API │ ├─────────────────────────────────────────────┤ │ Control API │ │ Registry · Role catalog · Route resolution │ │ Scheduling · Benchmark store │ ├─────────────────────────────────────────────┤ │ Node Agent(s) │ │ Host discovery · Service enumeration │ │ Telemetry reporting · Heartbeat │ ├─────────────────────────────────────────────┤ │ Provider Adapters │ │ OpenAI-compatible chat / embeddings │ │ Transcription (partial) │ └─────────────────────────────────────────────┘ ``` --- ## Core Concepts ### Host A physical or virtual machine participating in the cluster. ### Service A concrete callable capability on a host: a chat endpoint, an embeddings endpoint, or a transcription endpoint. A host typically exposes multiple services. ### Asset A model weight, model name, or runtime target that a service can serve. Assets carry optional `request_policy` fields that adjust how requests are shaped before forwarding. ### Role A reusable task profile that describes *how* requests should be fulfilled, not *which* model fills them. A role has a prompt policy (system prompt injection, body defaults) and a routing policy (preferred model families, minimum context size, loaded-first preference). The same role can route to different services as cluster state changes. ### Route Resolution 1. If `model` matches a loaded, healthy asset or service alias → route directly. 2. If `model` matches a known role → score eligible services and route to the best. 3. Otherwise → fail with a clear 404. --- ## Data Flow: Chat Completion ``` Client POST /v1/chat/completions │ ▼ resolve_route(model, kind="chat") ├─ direct: asset_id or service alias match └─ role: filter by kind/health → score by runtime + benchmark signals │ ▼ apply_request_policy(request, asset, role) ├─ deep-merge body_defaults ├─ apply system prompt (prepend / append / replace) └─ auto-infer Qwen3 template kwargs if needed │ ▼ UpstreamClient.chat_completions(endpoint, modified_request) │ ▼ _strip_reasoning_fields(response) ← removes reasoning_content / reasoning │ ▼ Response to client ``` --- ## Scoring Route scoring combines three signal families: | Signal family | Weight (role) | Weight (service) | |----------------|---------------|-----------------| | Text overlap | 30% | 20% | | Runtime | 30% | 45% | | Benchmark | 25% | 35% | | Family pref. | 15% | — | **Runtime signals** (from last heartbeat): - Loaded state: +0.35 - Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05 - Throughput: ≥40 tok/s +0.20, ≥20 +0.10 - Queue depth: penalty −0.20 if ≥5, −0.10 if ≥2 **Benchmark signals** (from ingested workload runs): - Workload overlap score (Jaccard-style token overlap) - Quality score from results: `0.45 * overlap + 0.55 * quality` --- ## Topology **Minimum viable (single machine):** ``` control plane + node agent + model server all on 127.0.0.1, different ports ``` **Recommended (small cluster):** ``` 1 control plane host 2+ node-agent hosts, each with 1+ model servers 1+ clients on LAN ``` **Auth:** - Client requests: `X-Api-Key` header - Node registration/heartbeat: `X-GenieHive-Node-Key` header - Empty key lists disable auth (development only) - mTLS between control and nodes planned for v1.5 --- ## State Store SQLite. Schema: | Table | Content | |---------------------|-------------------------------------------| | `hosts` | Host registration, resources, labels | | `services` | Service config, runtime, assets, observed | | `roles` | Role catalog | | `benchmark_samples` | Workload results per service | Default path: `state/geniehive.sqlite3` --- ## API Reference Summary ### Client API | Endpoint | Status | |---------------------------------|---------------| | `GET /v1/models` | Implemented | | `POST /v1/chat/completions` | Implemented | | `POST /v1/embeddings` | Implemented | | `POST /v1/audio/transcriptions` | Stub only | ### Operator API | Endpoint | Status | |------------------------------------|-------------| | `GET /v1/cluster/hosts` | Implemented | | `GET /v1/cluster/services` | Implemented | | `GET /v1/cluster/roles` | Implemented | | `GET /v1/cluster/benchmarks` | Implemented | | `GET /v1/cluster/health` | Implemented | | `GET /v1/cluster/routes/resolve` | Implemented | | `POST /v1/cluster/routes/match` | Implemented | ### Node API | Endpoint | Status | |------------------------------|-------------| | `POST /v1/nodes/register` | Implemented | | `POST /v1/nodes/heartbeat` | Implemented | | `GET /v1/node/inventory` | Implemented | | `GET /v1/node/registration` | Implemented | --- ## Supported Upstream Backends Any OpenAI-compatible HTTP server. Tested configurations: - **Ollama** — chat and embeddings - **llama.cpp** (server mode) — chat and embeddings - **llamafile** — chat - **vLLM** — chat and embeddings --- ## Non-Goals for V1 - Peer-to-peer consensus - Autonomous global model swapping - WAN zero-trust networking - Image and TTS generation - Distributed vector databases - Billing or multi-tenant quotas