215 lines
6.9 KiB
Markdown
215 lines
6.9 KiB
Markdown
# GenieHive Architecture
|
||
|
||
Last updated: 2026-04-27
|
||
|
||
## Mission
|
||
|
||
GenieHive is a local-first control plane for heterogeneous generative AI services
|
||
running across one or more hosts. It provides:
|
||
|
||
- Registration and health tracking for distributed AI services
|
||
- A stable, OpenAI-compatible client-facing API
|
||
- Role-based routing and scheduling over multiple services
|
||
- Integrated benchmarking and performance-informed route scoring
|
||
|
||
It is not a plain OpenAI-compatible gateway. The control plane layer adds topology
|
||
awareness, role abstraction, and signal-driven routing that a dumb proxy does not
|
||
provide.
|
||
|
||
---
|
||
|
||
## Four Layers
|
||
|
||
```
|
||
┌─────────────────────────────────────────────┐
|
||
│ Client Facades │
|
||
│ OpenAI-compatible completions + embeddings │
|
||
│ Operator inspection API │
|
||
├─────────────────────────────────────────────┤
|
||
│ Control API │
|
||
│ Registry · Role catalog · Route resolution │
|
||
│ Scheduling · Benchmark store │
|
||
├─────────────────────────────────────────────┤
|
||
│ Node Agent(s) │
|
||
│ Host discovery · Service enumeration │
|
||
│ Telemetry reporting · Heartbeat │
|
||
├─────────────────────────────────────────────┤
|
||
│ Provider Adapters │
|
||
│ OpenAI-compatible chat / embeddings │
|
||
│ Transcription (partial) │
|
||
└─────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Core Concepts
|
||
|
||
### Host
|
||
|
||
A physical or virtual machine participating in the cluster.
|
||
|
||
### Service
|
||
|
||
A concrete callable capability on a host: a chat endpoint, an embeddings endpoint,
|
||
or a transcription endpoint. A host typically exposes multiple services.
|
||
|
||
### Asset
|
||
|
||
A model weight, model name, or runtime target that a service can serve. Assets carry
|
||
optional `request_policy` fields that adjust how requests are shaped before forwarding.
|
||
|
||
### Role
|
||
|
||
A reusable task profile that describes *how* requests should be fulfilled, not *which*
|
||
model fills them. A role has a prompt policy (system prompt injection, body defaults)
|
||
and a routing policy (preferred model families, minimum context size, loaded-first
|
||
preference). The same role can route to different services as cluster state changes.
|
||
|
||
### Route Resolution
|
||
|
||
1. If `model` matches a loaded, healthy asset or service alias → route directly.
|
||
2. If `model` matches a known role → score eligible services and route to the best.
|
||
3. Otherwise → fail with a clear 404.
|
||
|
||
---
|
||
|
||
## Data Flow: Chat Completion
|
||
|
||
```
|
||
Client POST /v1/chat/completions
|
||
│
|
||
▼
|
||
resolve_route(model, kind="chat")
|
||
├─ direct: asset_id or service alias match
|
||
└─ role: filter by kind/health → score by runtime + benchmark signals
|
||
│
|
||
▼
|
||
apply_request_policy(request, asset, role)
|
||
├─ deep-merge body_defaults
|
||
├─ apply system prompt (prepend / append / replace)
|
||
└─ auto-infer Qwen3 template kwargs if needed
|
||
│
|
||
▼
|
||
UpstreamClient.chat_completions(endpoint, modified_request)
|
||
│
|
||
▼
|
||
_strip_reasoning_fields(response) ← removes reasoning_content / reasoning
|
||
│
|
||
▼
|
||
Response to client
|
||
```
|
||
|
||
---
|
||
|
||
## Scoring
|
||
|
||
Route scoring combines three signal families:
|
||
|
||
| Signal family | Weight (role) | Weight (service) |
|
||
|----------------|---------------|-----------------|
|
||
| Text overlap | 30% | 20% |
|
||
| Runtime | 30% | 45% |
|
||
| Benchmark | 25% | 35% |
|
||
| Family pref. | 15% | — |
|
||
|
||
**Runtime signals** (from last heartbeat):
|
||
- Loaded state: +0.35
|
||
- Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
|
||
- Throughput: ≥40 tok/s +0.20, ≥20 +0.10
|
||
- Queue depth: penalty −0.20 if ≥5, −0.10 if ≥2
|
||
|
||
**Benchmark signals** (from ingested workload runs):
|
||
- Workload overlap score (Jaccard-style token overlap)
|
||
- Quality score from results: `0.45 * overlap + 0.55 * quality`
|
||
|
||
---
|
||
|
||
## Topology
|
||
|
||
**Minimum viable (single machine):**
|
||
```
|
||
control plane + node agent + model server
|
||
all on 127.0.0.1, different ports
|
||
```
|
||
|
||
**Recommended (small cluster):**
|
||
```
|
||
1 control plane host
|
||
2+ node-agent hosts, each with 1+ model servers
|
||
1+ clients on LAN
|
||
```
|
||
|
||
**Auth:**
|
||
- Client requests: `X-Api-Key` header
|
||
- Node registration/heartbeat: `X-GenieHive-Node-Key` header
|
||
- Empty key lists disable auth (development only)
|
||
- mTLS between control and nodes planned for v1.5
|
||
|
||
---
|
||
|
||
## State Store
|
||
|
||
SQLite. Schema:
|
||
|
||
| Table | Content |
|
||
|---------------------|-------------------------------------------|
|
||
| `hosts` | Host registration, resources, labels |
|
||
| `services` | Service config, runtime, assets, observed |
|
||
| `roles` | Role catalog |
|
||
| `benchmark_samples` | Workload results per service |
|
||
|
||
Default path: `state/geniehive.sqlite3`
|
||
|
||
---
|
||
|
||
## API Reference Summary
|
||
|
||
### Client API
|
||
| Endpoint | Status |
|
||
|---------------------------------|---------------|
|
||
| `GET /v1/models` | Implemented |
|
||
| `POST /v1/chat/completions` | Implemented |
|
||
| `POST /v1/embeddings` | Implemented |
|
||
| `POST /v1/audio/transcriptions` | Stub only |
|
||
|
||
### Operator API
|
||
| Endpoint | Status |
|
||
|------------------------------------|-------------|
|
||
| `GET /v1/cluster/hosts` | Implemented |
|
||
| `GET /v1/cluster/services` | Implemented |
|
||
| `GET /v1/cluster/roles` | Implemented |
|
||
| `GET /v1/cluster/benchmarks` | Implemented |
|
||
| `GET /v1/cluster/health` | Implemented |
|
||
| `GET /v1/cluster/routes/resolve` | Implemented |
|
||
| `POST /v1/cluster/routes/match` | Implemented |
|
||
|
||
### Node API
|
||
| Endpoint | Status |
|
||
|------------------------------|-------------|
|
||
| `POST /v1/nodes/register` | Implemented |
|
||
| `POST /v1/nodes/heartbeat` | Implemented |
|
||
| `GET /v1/node/inventory` | Implemented |
|
||
| `GET /v1/node/registration` | Implemented |
|
||
|
||
---
|
||
|
||
## Supported Upstream Backends
|
||
|
||
Any OpenAI-compatible HTTP server. Tested configurations:
|
||
|
||
- **Ollama** — chat and embeddings
|
||
- **llama.cpp** (server mode) — chat and embeddings
|
||
- **llamafile** — chat
|
||
- **vLLM** — chat and embeddings
|
||
|
||
---
|
||
|
||
## Non-Goals for V1
|
||
|
||
- Peer-to-peer consensus
|
||
- Autonomous global model swapping
|
||
- WAN zero-trust networking
|
||
- Image and TTS generation
|
||
- Distributed vector databases
|
||
- Billing or multi-tenant quotas
|