GenieHive/docs/architecture.md

215 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GenieHive Architecture
Last updated: 2026-04-27
## Mission
GenieHive is a local-first control plane for heterogeneous generative AI services
running across one or more hosts. It provides:
- Registration and health tracking for distributed AI services
- A stable, OpenAI-compatible client-facing API
- Role-based routing and scheduling over multiple services
- Integrated benchmarking and performance-informed route scoring
It is not a plain OpenAI-compatible gateway. The control plane layer adds topology
awareness, role abstraction, and signal-driven routing that a dumb proxy does not
provide.
---
## Four Layers
```
┌─────────────────────────────────────────────┐
│ Client Facades │
│ OpenAI-compatible completions + embeddings │
│ Operator inspection API │
├─────────────────────────────────────────────┤
│ Control API │
│ Registry · Role catalog · Route resolution │
│ Scheduling · Benchmark store │
├─────────────────────────────────────────────┤
│ Node Agent(s) │
│ Host discovery · Service enumeration │
│ Telemetry reporting · Heartbeat │
├─────────────────────────────────────────────┤
│ Provider Adapters │
│ OpenAI-compatible chat / embeddings │
│ Transcription (partial) │
└─────────────────────────────────────────────┘
```
---
## Core Concepts
### Host
A physical or virtual machine participating in the cluster.
### Service
A concrete callable capability on a host: a chat endpoint, an embeddings endpoint,
or a transcription endpoint. A host typically exposes multiple services.
### Asset
A model weight, model name, or runtime target that a service can serve. Assets carry
optional `request_policy` fields that adjust how requests are shaped before forwarding.
### Role
A reusable task profile that describes *how* requests should be fulfilled, not *which*
model fills them. A role has a prompt policy (system prompt injection, body defaults)
and a routing policy (preferred model families, minimum context size, loaded-first
preference). The same role can route to different services as cluster state changes.
### Route Resolution
1. If `model` matches a loaded, healthy asset or service alias → route directly.
2. If `model` matches a known role → score eligible services and route to the best.
3. Otherwise → fail with a clear 404.
---
## Data Flow: Chat Completion
```
Client POST /v1/chat/completions
resolve_route(model, kind="chat")
├─ direct: asset_id or service alias match
└─ role: filter by kind/health → score by runtime + benchmark signals
apply_request_policy(request, asset, role)
├─ deep-merge body_defaults
├─ apply system prompt (prepend / append / replace)
└─ auto-infer Qwen3 template kwargs if needed
UpstreamClient.chat_completions(endpoint, modified_request)
_strip_reasoning_fields(response) ← removes reasoning_content / reasoning
Response to client
```
---
## Scoring
Route scoring combines three signal families:
| Signal family | Weight (role) | Weight (service) |
|----------------|---------------|-----------------|
| Text overlap | 30% | 20% |
| Runtime | 30% | 45% |
| Benchmark | 25% | 35% |
| Family pref. | 15% | — |
**Runtime signals** (from last heartbeat):
- Loaded state: +0.35
- Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
- Throughput: 40 tok/s +0.20, 20 +0.10
- Queue depth: penalty 0.20 if 5, 0.10 if 2
**Benchmark signals** (from ingested workload runs):
- Workload overlap score (Jaccard-style token overlap)
- Quality score from results: `0.45 * overlap + 0.55 * quality`
---
## Topology
**Minimum viable (single machine):**
```
control plane + node agent + model server
all on 127.0.0.1, different ports
```
**Recommended (small cluster):**
```
1 control plane host
2+ node-agent hosts, each with 1+ model servers
1+ clients on LAN
```
**Auth:**
- Client requests: `X-Api-Key` header
- Node registration/heartbeat: `X-GenieHive-Node-Key` header
- Empty key lists disable auth (development only)
- mTLS between control and nodes planned for v1.5
---
## State Store
SQLite. Schema:
| Table | Content |
|---------------------|-------------------------------------------|
| `hosts` | Host registration, resources, labels |
| `services` | Service config, runtime, assets, observed |
| `roles` | Role catalog |
| `benchmark_samples` | Workload results per service |
Default path: `state/geniehive.sqlite3`
---
## API Reference Summary
### Client API
| Endpoint | Status |
|---------------------------------|---------------|
| `GET /v1/models` | Implemented |
| `POST /v1/chat/completions` | Implemented |
| `POST /v1/embeddings` | Implemented |
| `POST /v1/audio/transcriptions` | Stub only |
### Operator API
| Endpoint | Status |
|------------------------------------|-------------|
| `GET /v1/cluster/hosts` | Implemented |
| `GET /v1/cluster/services` | Implemented |
| `GET /v1/cluster/roles` | Implemented |
| `GET /v1/cluster/benchmarks` | Implemented |
| `GET /v1/cluster/health` | Implemented |
| `GET /v1/cluster/routes/resolve` | Implemented |
| `POST /v1/cluster/routes/match` | Implemented |
### Node API
| Endpoint | Status |
|------------------------------|-------------|
| `POST /v1/nodes/register` | Implemented |
| `POST /v1/nodes/heartbeat` | Implemented |
| `GET /v1/node/inventory` | Implemented |
| `GET /v1/node/registration` | Implemented |
---
## Supported Upstream Backends
Any OpenAI-compatible HTTP server. Tested configurations:
- **Ollama** chat and embeddings
- **llama.cpp** (server mode) chat and embeddings
- **llamafile** chat
- **vLLM** chat and embeddings
---
## Non-Goals for V1
- Peer-to-peer consensus
- Autonomous global model swapping
- WAN zero-trust networking
- Image and TTS generation
- Distributed vector databases
- Billing or multi-tenant quotas