GenieHive/docs/architecture.md

6.9 KiB
Raw Permalink Blame History

GenieHive Architecture

Last updated: 2026-04-27

Mission

GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts. It provides:

  • Registration and health tracking for distributed AI services
  • A stable, OpenAI-compatible client-facing API
  • Role-based routing and scheduling over multiple services
  • Integrated benchmarking and performance-informed route scoring

It is not a plain OpenAI-compatible gateway. The control plane layer adds topology awareness, role abstraction, and signal-driven routing that a dumb proxy does not provide.


Four Layers

┌─────────────────────────────────────────────┐
│  Client Facades                              │
│  OpenAI-compatible completions + embeddings  │
│  Operator inspection API                     │
├─────────────────────────────────────────────┤
│  Control API                                 │
│  Registry · Role catalog · Route resolution  │
│  Scheduling · Benchmark store                │
├─────────────────────────────────────────────┤
│  Node Agent(s)                               │
│  Host discovery · Service enumeration        │
│  Telemetry reporting · Heartbeat             │
├─────────────────────────────────────────────┤
│  Provider Adapters                           │
│  OpenAI-compatible chat / embeddings         │
│  Transcription (partial)                     │
└─────────────────────────────────────────────┘

Core Concepts

Host

A physical or virtual machine participating in the cluster.

Service

A concrete callable capability on a host: a chat endpoint, an embeddings endpoint, or a transcription endpoint. A host typically exposes multiple services.

Asset

A model weight, model name, or runtime target that a service can serve. Assets carry optional request_policy fields that adjust how requests are shaped before forwarding.

Role

A reusable task profile that describes how requests should be fulfilled, not which model fills them. A role has a prompt policy (system prompt injection, body defaults) and a routing policy (preferred model families, minimum context size, loaded-first preference). The same role can route to different services as cluster state changes.

Route Resolution

  1. If model matches a loaded, healthy asset or service alias → route directly.
  2. If model matches a known role → score eligible services and route to the best.
  3. Otherwise → fail with a clear 404.

Data Flow: Chat Completion

Client POST /v1/chat/completions
  │
  ▼
resolve_route(model, kind="chat")
  ├─ direct: asset_id or service alias match
  └─ role: filter by kind/health → score by runtime + benchmark signals
  │
  ▼
apply_request_policy(request, asset, role)
  ├─ deep-merge body_defaults
  ├─ apply system prompt (prepend / append / replace)
  └─ auto-infer Qwen3 template kwargs if needed
  │
  ▼
UpstreamClient.chat_completions(endpoint, modified_request)
  │
  ▼
_strip_reasoning_fields(response)  ← removes reasoning_content / reasoning
  │
  ▼
Response to client

Scoring

Route scoring combines three signal families:

Signal family Weight (role) Weight (service)
Text overlap 30% 20%
Runtime 30% 45%
Benchmark 25% 35%
Family pref. 15%

Runtime signals (from last heartbeat):

  • Loaded state: +0.35
  • Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
  • Throughput: ≥40 tok/s +0.20, ≥20 +0.10
  • Queue depth: penalty 0.20 if ≥5, 0.10 if ≥2

Benchmark signals (from ingested workload runs):

  • Workload overlap score (Jaccard-style token overlap)
  • Quality score from results: 0.45 * overlap + 0.55 * quality

Topology

Minimum viable (single machine):

control plane + node agent + model server
all on 127.0.0.1, different ports

Recommended (small cluster):

1 control plane host
2+ node-agent hosts, each with 1+ model servers
1+ clients on LAN

Auth:

  • Client requests: X-Api-Key header
  • Node registration/heartbeat: X-GenieHive-Node-Key header
  • Empty key lists disable auth (development only)
  • mTLS between control and nodes planned for v1.5

State Store

SQLite. Schema:

Table Content
hosts Host registration, resources, labels
services Service config, runtime, assets, observed
roles Role catalog
benchmark_samples Workload results per service

Default path: state/geniehive.sqlite3


API Reference Summary

Client API

Endpoint Status
GET /v1/models Implemented
POST /v1/chat/completions Implemented
POST /v1/embeddings Implemented
POST /v1/audio/transcriptions Stub only

Operator API

Endpoint Status
GET /v1/cluster/hosts Implemented
GET /v1/cluster/services Implemented
GET /v1/cluster/roles Implemented
GET /v1/cluster/benchmarks Implemented
GET /v1/cluster/health Implemented
GET /v1/cluster/routes/resolve Implemented
POST /v1/cluster/routes/match Implemented

Node API

Endpoint Status
POST /v1/nodes/register Implemented
POST /v1/nodes/heartbeat Implemented
GET /v1/node/inventory Implemented
GET /v1/node/registration Implemented

Supported Upstream Backends

Any OpenAI-compatible HTTP server. Tested configurations:

  • Ollama — chat and embeddings
  • llama.cpp (server mode) — chat and embeddings
  • llamafile — chat
  • vLLM — chat and embeddings

Non-Goals for V1

  • Peer-to-peer consensus
  • Autonomous global model swapping
  • WAN zero-trust networking
  • Image and TTS generation
  • Distributed vector databases
  • Billing or multi-tenant quotas