Revise architecture/roadmap docs and add LLM evaluation guide
- architecture.md: rewrite to describe the actual running system; remove design-phase repo-naming discussion and initial-implementation-sequence list; add data-flow diagram, scoring weights table, API status table - roadmap.md: replace aspirational list with concrete completed/gap/next structure; document four confirmed implementation gaps (transcription stub, strategy field ignored, fallback_roles unimplemented, benchmark quality score additive overflow); prioritise fixes as P0/P1/P2/P3 - docs/local_llm_evaluation.md: new document; role taxonomy (tier 1–3), hardware inventory template, candidate model suggestions, three-phase evaluation protocol, GenieHive integration steps, results template, notes on Qwen3/Mistral/DeepSeek/Ollama embedding path quirks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
e36650a017
commit
a76c7e81f4
|
|
@ -1,74 +1,46 @@
|
||||||
# GenieHive Architecture
|
# GenieHive Architecture
|
||||||
|
|
||||||
Status: proposed v1 architecture
|
Last updated: 2026-04-27
|
||||||
Drafted: 2026-04-05
|
|
||||||
|
|
||||||
## Repo Name
|
|
||||||
|
|
||||||
Chosen name: `GenieHive`
|
|
||||||
|
|
||||||
Why this name:
|
|
||||||
|
|
||||||
- suggestive: "genie" implies generative AI services, "hive" implies a cooperating cluster
|
|
||||||
- accessible: easy to say, remember, and explain
|
|
||||||
- whimsical enough to feel like a project name rather than a dry infrastructure label
|
|
||||||
|
|
||||||
Tradeoff:
|
|
||||||
|
|
||||||
- `GenieHive` is less search-distinct than `Geniewarren` because `hive` is a common product metaphor
|
|
||||||
|
|
||||||
## Mission
|
## Mission
|
||||||
|
|
||||||
GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts.
|
GenieHive is a local-first control plane for heterogeneous generative AI services
|
||||||
|
running across one or more hosts. It provides:
|
||||||
|
|
||||||
It should:
|
- Registration and health tracking for distributed AI services
|
||||||
|
- A stable, OpenAI-compatible client-facing API
|
||||||
|
- Role-based routing and scheduling over multiple services
|
||||||
|
- Integrated benchmarking and performance-informed route scoring
|
||||||
|
|
||||||
- register hosts and their available services
|
It is not a plain OpenAI-compatible gateway. The control plane layer adds topology
|
||||||
- expose a stable client-facing API
|
awareness, role abstraction, and signal-driven routing that a dumb proxy does not
|
||||||
- track health, capacity, and observed performance
|
provide.
|
||||||
- support direct model addressing and higher-level role addressing
|
|
||||||
- route requests to healthy loaded services first
|
|
||||||
- optionally coordinate loading or swapping when policy allows
|
|
||||||
- remain practical for a small self-hosted deployment with two hosts
|
|
||||||
|
|
||||||
## Non-Goals For V1
|
---
|
||||||
|
|
||||||
Out of scope initially:
|
## Four Layers
|
||||||
|
|
||||||
- peer-to-peer consensus
|
```
|
||||||
- autonomous global model swapping across many nodes
|
┌─────────────────────────────────────────────┐
|
||||||
- full WAN zero-trust platform engineering
|
│ Client Facades │
|
||||||
- image and TTS generation orchestration
|
│ OpenAI-compatible completions + embeddings │
|
||||||
- distributed vector database management
|
│ Operator inspection API │
|
||||||
- billing or multi-tenant quota accounting
|
├─────────────────────────────────────────────┤
|
||||||
|
│ Control API │
|
||||||
|
│ Registry · Role catalog · Route resolution │
|
||||||
|
│ Scheduling · Benchmark store │
|
||||||
|
├─────────────────────────────────────────────┤
|
||||||
|
│ Node Agent(s) │
|
||||||
|
│ Host discovery · Service enumeration │
|
||||||
|
│ Telemetry reporting · Heartbeat │
|
||||||
|
├─────────────────────────────────────────────┤
|
||||||
|
│ Provider Adapters │
|
||||||
|
│ OpenAI-compatible chat / embeddings │
|
||||||
|
│ Transcription (partial) │
|
||||||
|
└─────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
## Architectural Position
|
---
|
||||||
|
|
||||||
GenieHive is not just an OpenAI-compatible gateway.
|
|
||||||
|
|
||||||
It is a control plane with these layers:
|
|
||||||
|
|
||||||
1. Control API
|
|
||||||
- authoritative registry
|
|
||||||
- routing and scheduling
|
|
||||||
- role catalog
|
|
||||||
- operator inspection
|
|
||||||
|
|
||||||
2. Node Agent
|
|
||||||
- host discovery
|
|
||||||
- service discovery
|
|
||||||
- telemetry reporting
|
|
||||||
- optional local process management
|
|
||||||
|
|
||||||
3. Provider Adapters
|
|
||||||
- OpenAI-compatible chat backends
|
|
||||||
- OpenAI-compatible embedding backends
|
|
||||||
- transcription backends
|
|
||||||
- future adapters for image and speech synthesis
|
|
||||||
|
|
||||||
4. Client Facades
|
|
||||||
- OpenAI-compatible facade for completions and embeddings
|
|
||||||
- operator API for topology, health, and inventory
|
|
||||||
|
|
||||||
## Core Concepts
|
## Core Concepts
|
||||||
|
|
||||||
|
|
@ -78,117 +50,165 @@ A physical or virtual machine participating in the cluster.
|
||||||
|
|
||||||
### Service
|
### Service
|
||||||
|
|
||||||
A concrete callable capability on a host. Examples:
|
A concrete callable capability on a host: a chat endpoint, an embeddings endpoint,
|
||||||
|
or a transcription endpoint. A host typically exposes multiple services.
|
||||||
- chat completion endpoint
|
|
||||||
- embedding endpoint
|
|
||||||
- transcription endpoint
|
|
||||||
|
|
||||||
### Asset
|
### Asset
|
||||||
|
|
||||||
A model weight, model name, application, or runtime target that a service can serve.
|
A model weight, model name, or runtime target that a service can serve. Assets carry
|
||||||
|
optional `request_policy` fields that adjust how requests are shaped before forwarding.
|
||||||
|
|
||||||
### Role
|
### Role
|
||||||
|
|
||||||
A reusable task profile that describes how requests should be fulfilled. A role is policy, not a concrete model.
|
A reusable task profile that describes *how* requests should be fulfilled, not *which*
|
||||||
|
model fills them. A role has a prompt policy (system prompt injection, body defaults)
|
||||||
|
and a routing policy (preferred model families, minimum context size, loaded-first
|
||||||
|
preference). The same role can route to different services as cluster state changes.
|
||||||
|
|
||||||
### Route Resolution
|
### Route Resolution
|
||||||
|
|
||||||
Request handling order:
|
1. If `model` matches a loaded, healthy asset or service alias → route directly.
|
||||||
|
2. If `model` matches a known role → score eligible services and route to the best.
|
||||||
|
3. Otherwise → fail with a clear 404.
|
||||||
|
|
||||||
1. If the requested `model` matches a currently loaded and healthy concrete asset or service alias, route directly.
|
---
|
||||||
2. Otherwise, if the requested `model` matches a known role, resolve the role to the best eligible service.
|
|
||||||
3. Otherwise, fail clearly.
|
|
||||||
|
|
||||||
## V1 Capability Scope
|
## Data Flow: Chat Completion
|
||||||
|
|
||||||
V1 supports only:
|
```
|
||||||
|
Client POST /v1/chat/completions
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
resolve_route(model, kind="chat")
|
||||||
|
├─ direct: asset_id or service alias match
|
||||||
|
└─ role: filter by kind/health → score by runtime + benchmark signals
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
apply_request_policy(request, asset, role)
|
||||||
|
├─ deep-merge body_defaults
|
||||||
|
├─ apply system prompt (prepend / append / replace)
|
||||||
|
└─ auto-infer Qwen3 template kwargs if needed
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
UpstreamClient.chat_completions(endpoint, modified_request)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
_strip_reasoning_fields(response) ← removes reasoning_content / reasoning
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
Response to client
|
||||||
|
```
|
||||||
|
|
||||||
- chat completions
|
---
|
||||||
- embeddings
|
|
||||||
- transcription
|
## Scoring
|
||||||
|
|
||||||
|
Route scoring combines three signal families:
|
||||||
|
|
||||||
|
| Signal family | Weight (role) | Weight (service) |
|
||||||
|
|----------------|---------------|-----------------|
|
||||||
|
| Text overlap | 30% | 20% |
|
||||||
|
| Runtime | 30% | 45% |
|
||||||
|
| Benchmark | 25% | 35% |
|
||||||
|
| Family pref. | 15% | — |
|
||||||
|
|
||||||
|
**Runtime signals** (from last heartbeat):
|
||||||
|
- Loaded state: +0.35
|
||||||
|
- Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
|
||||||
|
- Throughput: ≥40 tok/s +0.20, ≥20 +0.10
|
||||||
|
- Queue depth: penalty −0.20 if ≥5, −0.10 if ≥2
|
||||||
|
|
||||||
|
**Benchmark signals** (from ingested workload runs):
|
||||||
|
- Workload overlap score (Jaccard-style token overlap)
|
||||||
|
- Quality score from results: `0.45 * overlap + 0.55 * quality`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Topology
|
## Topology
|
||||||
|
|
||||||
Recommended initial topology:
|
**Minimum viable (single machine):**
|
||||||
|
```
|
||||||
|
control plane + node agent + model server
|
||||||
|
all on 127.0.0.1, different ports
|
||||||
|
```
|
||||||
|
|
||||||
- 1 control plane
|
**Recommended (small cluster):**
|
||||||
- 2 node agents
|
```
|
||||||
- 1 or more clients
|
1 control plane host
|
||||||
- LAN-first deployment
|
2+ node-agent hosts, each with 1+ model servers
|
||||||
- API key auth in v1
|
1+ clients on LAN
|
||||||
- VPN or mTLS in v1.5
|
```
|
||||||
|
|
||||||
## API Families
|
**Auth:**
|
||||||
|
- Client requests: `X-Api-Key` header
|
||||||
|
- Node registration/heartbeat: `X-GenieHive-Node-Key` header
|
||||||
|
- Empty key lists disable auth (development only)
|
||||||
|
- mTLS between control and nodes planned for v1.5
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## State Store
|
||||||
|
|
||||||
|
SQLite. Schema:
|
||||||
|
|
||||||
|
| Table | Content |
|
||||||
|
|---------------------|-------------------------------------------|
|
||||||
|
| `hosts` | Host registration, resources, labels |
|
||||||
|
| `services` | Service config, runtime, assets, observed |
|
||||||
|
| `roles` | Role catalog |
|
||||||
|
| `benchmark_samples` | Workload results per service |
|
||||||
|
|
||||||
|
Default path: `state/geniehive.sqlite3`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Reference Summary
|
||||||
|
|
||||||
### Client API
|
### Client API
|
||||||
|
| Endpoint | Status |
|
||||||
- `GET /v1/models`
|
|---------------------------------|---------------|
|
||||||
- `POST /v1/chat/completions`
|
| `GET /v1/models` | Implemented |
|
||||||
- `POST /v1/embeddings`
|
| `POST /v1/chat/completions` | Implemented |
|
||||||
- `POST /v1/audio/transcriptions`
|
| `POST /v1/embeddings` | Implemented |
|
||||||
|
| `POST /v1/audio/transcriptions` | Stub only |
|
||||||
`GET /v1/models` should expose enough metadata for programmatic clients to make routing decisions about what GenieHive can handle cheaply, especially for lower-complexity offloaded work. That metadata should include direct assets, service-backed aliases, role aliases, operation kind, health, loaded status, and observed performance hints.
|
|
||||||
|
|
||||||
### Operator API
|
### Operator API
|
||||||
|
| Endpoint | Status |
|
||||||
- `GET /v1/cluster/hosts`
|
|------------------------------------|-------------|
|
||||||
- `GET /v1/cluster/services`
|
| `GET /v1/cluster/hosts` | Implemented |
|
||||||
- `GET /v1/cluster/roles`
|
| `GET /v1/cluster/services` | Implemented |
|
||||||
- `GET /v1/cluster/health`
|
| `GET /v1/cluster/roles` | Implemented |
|
||||||
- `GET /v1/cluster/routes/resolve?model=...`
|
| `GET /v1/cluster/benchmarks` | Implemented |
|
||||||
|
| `GET /v1/cluster/health` | Implemented |
|
||||||
|
| `GET /v1/cluster/routes/resolve` | Implemented |
|
||||||
|
| `POST /v1/cluster/routes/match` | Implemented |
|
||||||
|
|
||||||
### Node API
|
### Node API
|
||||||
|
| Endpoint | Status |
|
||||||
|
|------------------------------|-------------|
|
||||||
|
| `POST /v1/nodes/register` | Implemented |
|
||||||
|
| `POST /v1/nodes/heartbeat` | Implemented |
|
||||||
|
| `GET /v1/node/inventory` | Implemented |
|
||||||
|
| `GET /v1/node/registration` | Implemented |
|
||||||
|
|
||||||
- `POST /v1/nodes/register`
|
---
|
||||||
- `POST /v1/nodes/heartbeat`
|
|
||||||
- `GET /v1/node/inventory`
|
|
||||||
- `POST /v1/node/services/refresh`
|
|
||||||
|
|
||||||
## Data Store
|
## Supported Upstream Backends
|
||||||
|
|
||||||
V1 should use SQLite for durable state.
|
Any OpenAI-compatible HTTP server. Tested configurations:
|
||||||
|
|
||||||
## Routing Rules
|
- **Ollama** — chat and embeddings
|
||||||
|
- **llama.cpp** (server mode) — chat and embeddings
|
||||||
|
- **llamafile** — chat
|
||||||
|
- **vLLM** — chat and embeddings
|
||||||
|
|
||||||
### Direct Model Resolution
|
---
|
||||||
|
|
||||||
If a request names a concrete asset alias or service alias:
|
## Non-Goals for V1
|
||||||
|
|
||||||
- prefer loaded and healthy services
|
- Peer-to-peer consensus
|
||||||
- choose the lowest-cost healthy target if multiple matches exist
|
- Autonomous global model swapping
|
||||||
- fail clearly if all matches are unhealthy
|
- WAN zero-trust networking
|
||||||
|
- Image and TTS generation
|
||||||
### Role Resolution
|
- Distributed vector databases
|
||||||
|
- Billing or multi-tenant quotas
|
||||||
If direct resolution fails, treat the requested name as a role.
|
|
||||||
|
|
||||||
Role resolution should filter by:
|
|
||||||
|
|
||||||
- operation kind
|
|
||||||
- modality
|
|
||||||
- health
|
|
||||||
- auth and exposure compatibility
|
|
||||||
- minimum context or memory requirements
|
|
||||||
- preferred model families
|
|
||||||
|
|
||||||
Then rank by:
|
|
||||||
|
|
||||||
- already loaded
|
|
||||||
- recent health
|
|
||||||
- expected latency
|
|
||||||
- queue pressure
|
|
||||||
- operator priority
|
|
||||||
|
|
||||||
## First Implementation Sequence
|
|
||||||
|
|
||||||
1. Create the repo skeleton and docs.
|
|
||||||
2. Implement SQLite-backed registry models.
|
|
||||||
3. Implement node registration and heartbeat.
|
|
||||||
4. Implement operator inspection endpoints.
|
|
||||||
5. Implement client-facing chat routing.
|
|
||||||
6. Add embeddings routing.
|
|
||||||
7. Add transcription routing.
|
|
||||||
8. Add truthful readiness and health reporting.
|
|
||||||
9. Add role catalog and role-based resolution.
|
|
||||||
10. Add optional managed local runtime support.
|
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,251 @@
|
||||||
|
# Local LLM Evaluation for GenieHive Agent Roles
|
||||||
|
|
||||||
|
Last updated: 2026-04-27
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
This document describes a framework for evaluating locally-hosted LLMs against the
|
||||||
|
roles that GenieHive needs to fulfill in a multi-agent or tool-use pipeline. The goal
|
||||||
|
is to determine which models are fit for which roles given available hardware, and
|
||||||
|
to produce benchmark data that GenieHive's own routing layer can consume.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Role Taxonomy
|
||||||
|
|
||||||
|
GenieHive routes by role. Before evaluating models, the roles likely needed in an
|
||||||
|
agent pipeline must be defined. The following taxonomy covers the most common cases.
|
||||||
|
|
||||||
|
### Tier 1: Core Inference Roles
|
||||||
|
|
||||||
|
| Role ID | Description | Key requirements |
|
||||||
|
|----------------------|----------------------------------------------------------|---------------------------------------------------|
|
||||||
|
| `general_assistant` | General-purpose instruction following, Q&A, summarization| Good instruction following, ≥8k context |
|
||||||
|
| `reasoning` | Multi-step problem solving, chain-of-thought tasks | Extended thinking, ≥16k context, slow OK |
|
||||||
|
| `code_assistant` | Code generation, explanation, debugging | Strong code benchmarks, fill-in-middle optional |
|
||||||
|
| `structured_output` | JSON/schema-constrained generation | Grammar sampling or reliable JSON mode |
|
||||||
|
| `tool_use` | Tool/function call formatting and parsing | Function call format compliance, low hallucination|
|
||||||
|
|
||||||
|
### Tier 2: Supporting Roles
|
||||||
|
|
||||||
|
| Role ID | Description | Key requirements |
|
||||||
|
|----------------------|----------------------------------------------------------|---------------------------------------------------|
|
||||||
|
| `embedder` | Semantic embedding for RAG, search, clustering | High MTEB scores, must be loaded (not lazy) |
|
||||||
|
| `classifier` | Short-text classification, intent detection | Fast TTFT, low token budget, reliable format |
|
||||||
|
| `summarizer` | Condensing long documents | Long context (≥32k), extractive reliability |
|
||||||
|
| `critic` | Reviewing, scoring, or evaluating model outputs | Self-consistency, instruction precision |
|
||||||
|
| `transcriber` | Audio-to-text (Whisper-family) | WER on domain-specific content |
|
||||||
|
|
||||||
|
### Tier 3: Specialized Roles (project-specific)
|
||||||
|
|
||||||
|
These are informed by the current project context (TalkOrigins bibliography pipeline,
|
||||||
|
Panda's Thumb archive, multi-site search).
|
||||||
|
|
||||||
|
| Role ID | Description | Key requirements |
|
||||||
|
|------------------------|------------------------------------------------------------|------------------------------------------------|
|
||||||
|
| `bibliographic_analyst`| Extract, verify, and enrich bibliographic metadata | Precise instruction following, structured JSON |
|
||||||
|
| `science_explainer` | Explain scientific concepts for a general audience | Factual accuracy, good prose, ≥8k context |
|
||||||
|
| `search_query_writer` | Generate search queries from topic descriptions | Concise, varied output; fast |
|
||||||
|
| `html_cleaner` | Identify and convert markup patterns (MT tags, etc.) | Reliable format compliance |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware Context
|
||||||
|
|
||||||
|
Evaluation should be scoped to what is actually available. Document hardware before
|
||||||
|
running benchmarks.
|
||||||
|
|
||||||
|
Recommended inventory fields per host:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
host_id: atlas-01
|
||||||
|
gpu:
|
||||||
|
- name: NVIDIA Tesla P40
|
||||||
|
vram_gb: 24
|
||||||
|
cuda: "8.0"
|
||||||
|
cpu:
|
||||||
|
threads: 24
|
||||||
|
model: "Intel Xeon"
|
||||||
|
ram_gb: 128
|
||||||
|
fast_storage: true # NVMe vs. spinning rust matters for model load time
|
||||||
|
```
|
||||||
|
|
||||||
|
Models that do not fit in VRAM will run on CPU or split across GPU+CPU. Note
|
||||||
|
GPU-only, GPU+CPU, and CPU-only fit status explicitly for each candidate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Candidate Model Selection
|
||||||
|
|
||||||
|
For each role tier, select 2–4 candidate models. Selection criteria:
|
||||||
|
|
||||||
|
1. **Fits hardware** — VRAM budget for the target host
|
||||||
|
2. **GGUF available** — for llama.cpp / llamafile deployment
|
||||||
|
3. **License** — permissive enough for intended use
|
||||||
|
4. **Recency** — prefer models released in the last 12 months unless a classic
|
||||||
|
substantially outperforms
|
||||||
|
|
||||||
|
### Suggested Starting Candidates (as of 2026-04)
|
||||||
|
|
||||||
|
**General assistant / reasoning:**
|
||||||
|
- Qwen3-8B-Q4_K_M (fits P40 at 24 GB, extended thinking available)
|
||||||
|
- Qwen3-14B-Q4_K_M (fits P40 at ~10 GB VRAM + offload, better reasoning)
|
||||||
|
- Mistral-7B-Instruct-v0.3 (fast, reliable baseline)
|
||||||
|
|
||||||
|
**Code assistant:**
|
||||||
|
- Qwen2.5-Coder-7B-Instruct
|
||||||
|
- DeepSeek-Coder-V2-Lite-Instruct (16B MoE, may fit on CPU+GPU split)
|
||||||
|
|
||||||
|
**Structured output / tool use:**
|
||||||
|
- Qwen3-8B (native tool call support)
|
||||||
|
- functionary-small-v3.2 (purpose-built for tool use)
|
||||||
|
- Hermes-3-Llama-3.1-8B (strong JSON reliability)
|
||||||
|
|
||||||
|
**Embeddings:**
|
||||||
|
- nomic-embed-text-v1.5 (fast, high MTEB, 137M params)
|
||||||
|
- mxbai-embed-large-v1 (larger, higher MTEB)
|
||||||
|
- bge-small-en-v1.5 (smallest, acceptable quality for retrieval)
|
||||||
|
|
||||||
|
**Transcription:**
|
||||||
|
- faster-whisper large-v3 (best WER, GPU accelerated)
|
||||||
|
- faster-whisper medium.en (faster, smaller, English-only)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Evaluation Protocol
|
||||||
|
|
||||||
|
### Phase 1: Deployment fit check
|
||||||
|
|
||||||
|
For each candidate:
|
||||||
|
|
||||||
|
1. Load the model via llama.cpp or Ollama.
|
||||||
|
2. Send a minimal completion request to confirm the endpoint is responding.
|
||||||
|
3. Record:
|
||||||
|
- Actual VRAM used (from `nvidia-smi`)
|
||||||
|
- Time to first token on a short prompt (~50 tokens)
|
||||||
|
- Tokens/sec on a medium completion (~200 tokens)
|
||||||
|
|
||||||
|
Pass criterion: TTFT < 5 s, tokens/sec > 5.
|
||||||
|
|
||||||
|
### Phase 2: Role fitness benchmarks
|
||||||
|
|
||||||
|
Use GenieHive's built-in benchmark runner for chat roles. Extend with custom
|
||||||
|
workloads for each role. Each workload should have 3–5 cases with known expected
|
||||||
|
outputs or pass criteria.
|
||||||
|
|
||||||
|
**Workload design principles:**
|
||||||
|
- Cases should be representative of real workload (not toy examples)
|
||||||
|
- Pass criteria should be checkable without a judge model where possible
|
||||||
|
(exact match, JSON parse, regex, non-empty, length bounds)
|
||||||
|
- Include at least one adversarial case per role (ambiguous prompt, edge input)
|
||||||
|
- Record `chat_template_kwargs` for models that need them (e.g., Qwen3 thinking)
|
||||||
|
|
||||||
|
**Suggested workloads to add to `benchmark_runner.py`:**
|
||||||
|
|
||||||
|
```
|
||||||
|
chat.structured_json — produce valid JSON matching a schema
|
||||||
|
chat.tool_call_format — emit a well-formed function call
|
||||||
|
chat.code_python — generate a short working Python function
|
||||||
|
chat.long_context_recall — answer from a 16k-token context document
|
||||||
|
chat.concise_classification — classify text into one of N labels
|
||||||
|
```
|
||||||
|
|
||||||
|
**Embeddings workloads** (separate evaluation script needed):
|
||||||
|
- Cosine similarity ranking on semantically close/distant pairs
|
||||||
|
- Retrieval recall@5 on a small fixed corpus
|
||||||
|
|
||||||
|
### Phase 3: Comparative scoring
|
||||||
|
|
||||||
|
For each role, rank candidates by:
|
||||||
|
|
||||||
|
1. Pass rate (primary)
|
||||||
|
2. Tokens/sec (secondary, for latency-sensitive roles)
|
||||||
|
3. TTFT (secondary, for interactive roles)
|
||||||
|
4. VRAM cost (tie-breaker)
|
||||||
|
|
||||||
|
Document the winner and runner-up. Load both into GenieHive's benchmark store
|
||||||
|
so the routing layer can score them in live operation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integrating Results into GenieHive
|
||||||
|
|
||||||
|
After running benchmarks:
|
||||||
|
|
||||||
|
1. Emit a JSON benchmark report (use `run_benchmark_workload.py`).
|
||||||
|
2. Ingest into the control plane: `python scripts/ingest_benchmark_report.py`.
|
||||||
|
3. Define a role in `roles.yaml` with `preferred_families` aligned to the
|
||||||
|
winning candidate's model family.
|
||||||
|
4. Verify routing: `GET /v1/cluster/routes/resolve?model=<role_id>` should
|
||||||
|
return the winning service.
|
||||||
|
5. Run a live request through the role to confirm end-to-end.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Evaluation Checklist
|
||||||
|
|
||||||
|
```
|
||||||
|
[ ] Hardware inventory documented for each candidate host
|
||||||
|
[ ] Candidate models selected per role tier
|
||||||
|
[ ] Each candidate loaded and Phase 1 deployment check passed
|
||||||
|
[ ] Custom workloads written for at least Tier 1 roles
|
||||||
|
[ ] Phase 2 benchmarks run and results recorded
|
||||||
|
[ ] Results ingested into GenieHive benchmark store
|
||||||
|
[ ] Roles defined in roles.yaml matching Phase 3 winners
|
||||||
|
[ ] End-to-end routing verified for each role
|
||||||
|
[ ] Results documented in a summary (see template below)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Results Summary Template
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Evaluation Results — <date>
|
||||||
|
|
||||||
|
### Hardware
|
||||||
|
- Host: <host_id>
|
||||||
|
- GPU: <name>, <vram_gb> GB VRAM
|
||||||
|
- RAM: <ram_gb> GB
|
||||||
|
|
||||||
|
### Role: general_assistant
|
||||||
|
| Model | Pass rate | tok/s | TTFT ms | VRAM GB | Result |
|
||||||
|
|--------------------|-----------|-------|---------|---------|---------|
|
||||||
|
| Qwen3-8B-Q4_K_M | 0.92 | 38 | 420 | 6.1 | WINNER |
|
||||||
|
| Mistral-7B-v0.3 | 0.85 | 52 | 310 | 4.9 | runner-up |
|
||||||
|
|
||||||
|
### Role: embedder
|
||||||
|
| Model | Recall@5 | Latency ms | VRAM GB | Result |
|
||||||
|
|------------------------|----------|------------|---------|---------|
|
||||||
|
| nomic-embed-text-v1.5 | 0.88 | 12 | 0.3 | WINNER |
|
||||||
|
| bge-small-en-v1.5 | 0.79 | 8 | 0.1 | runner-up |
|
||||||
|
|
||||||
|
... (repeat for each role)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes on Model Families and Known Behaviors
|
||||||
|
|
||||||
|
**Qwen3 / Qwen3.5:**
|
||||||
|
GenieHive auto-detects these and sets `enable_thinking: false` unless a role or
|
||||||
|
asset explicitly overrides. For the `reasoning` role, set `enable_thinking: true`
|
||||||
|
in the role's `body_defaults` to engage extended chain-of-thought.
|
||||||
|
|
||||||
|
**Mistral / Mixtral:**
|
||||||
|
Standard instruction format. No special handling needed.
|
||||||
|
|
||||||
|
**DeepSeek models:**
|
||||||
|
Some versions use a `<think>` block in their output. GenieHive strips
|
||||||
|
`reasoning_content` from responses but not inline `<think>` blocks. If the
|
||||||
|
model emits inline thinking that should be hidden from clients, add a
|
||||||
|
response-cleaning step or configure the model server to suppress it.
|
||||||
|
|
||||||
|
**Embedding models via Ollama:**
|
||||||
|
Ollama's embedding endpoint is `/api/embeddings`, not `/v1/embeddings`. The
|
||||||
|
current `UpstreamClient` uses the OpenAI-compatible path. When registering
|
||||||
|
an Ollama embedding service, confirm the node config points to the correct
|
||||||
|
endpoint or that the Ollama version supports `/v1/embeddings`.
|
||||||
|
|
||||||
|
**llamafile:**
|
||||||
|
Does not support the embeddings endpoint. Only suitable for chat roles.
|
||||||
189
docs/roadmap.md
189
docs/roadmap.md
|
|
@ -1,34 +1,175 @@
|
||||||
# GenieHive Roadmap
|
# GenieHive Roadmap
|
||||||
|
|
||||||
## Completed Foundations
|
Last updated: 2026-04-27
|
||||||
|
|
||||||
- control-plane registry with SQLite persistence
|
## What Is Complete
|
||||||
- node registration and heartbeat
|
|
||||||
- role catalog and route resolution
|
|
||||||
- client-facing `GET /v1/models`
|
|
||||||
- client-facing `POST /v1/chat/completions`
|
|
||||||
- client-facing `POST /v1/embeddings`
|
|
||||||
- first control-plus-node demo flow
|
|
||||||
|
|
||||||
## Immediate Next Milestones
|
The v1 core is implemented and tested.
|
||||||
|
|
||||||
1. Run and document the first live LLM demo against real upstream servers.
|
**Registry and cluster control:**
|
||||||
2. Validate the `GET /v1/models` metadata as a Codex-friendly offload catalog for lower-complexity tasks.
|
- SQLite-backed registry with hosts, services, roles, and benchmark samples
|
||||||
3. Add `POST /v1/audio/transcriptions`.
|
- Node registration and heartbeat protocol with auto-re-registration on 404
|
||||||
4. Add a richer node metrics model for queue depth, current load, and observed performance over time.
|
- Role catalog loading from YAML
|
||||||
5. Add a stronger operator/client distinction in the public metadata and auth surfaces.
|
- Route resolution: direct asset/service match → role resolution → clear failure
|
||||||
|
|
||||||
## LLM Demo Note
|
**Client-facing API:**
|
||||||
|
- `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state,
|
||||||
|
latency hints, offload classification, role aliases)
|
||||||
|
- `POST /v1/chat/completions` — proxies to upstream with request policy application
|
||||||
|
- `POST /v1/embeddings` — proxies to upstream
|
||||||
|
|
||||||
The project is now ready for a first live LLM demo using GenieHive as:
|
**Request policy system:**
|
||||||
|
- Body defaults and overrides via deep merge
|
||||||
|
- System prompt injection (prepend / append / replace)
|
||||||
|
- Per-asset and per-role policies, merged with role winning on prompts
|
||||||
|
- Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically
|
||||||
|
|
||||||
- master: control plane
|
**Route matching and scoring:**
|
||||||
- peer: one or more node agents with pre-existing local LLM servers
|
- `POST /v1/cluster/routes/match` — scored candidate list for role and service targets
|
||||||
- client: a small demo agent or Codex configured against GenieHive
|
- Signals: text overlap, preferred family, runtime (loaded state, latency, throughput,
|
||||||
|
queue depth), benchmark (workload overlap, quality score)
|
||||||
|
- `GET /v1/cluster/routes/resolve` — quick single-model resolution
|
||||||
|
|
||||||
The current live-demo priority is chat-first. Embeddings are also wired in GenieHive, but upstream compatibility differs across local servers, so the safest first demo matrix is:
|
**Benchmark infrastructure:**
|
||||||
|
- Built-in workloads: `chat.short_reasoning`, `chat.concise_support`
|
||||||
|
- `run_benchmark_workload.py` executes workloads and emits a JSON report
|
||||||
|
- `ingest_benchmark_report.py` posts results to the control plane
|
||||||
|
- Benchmark samples feed the route scoring pipeline
|
||||||
|
|
||||||
- Ollama for chat and embeddings
|
**Operator inspection:**
|
||||||
- vLLM for chat and embeddings
|
- `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health`
|
||||||
- llama.cpp for chat
|
|
||||||
- llamafile for chat
|
**Auth:**
|
||||||
|
- Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`)
|
||||||
|
- Empty key lists disable auth for development
|
||||||
|
|
||||||
|
**Tests:**
|
||||||
|
- Registry, chat proxy, node inventory, benchmark runner, full demo flow
|
||||||
|
- All passing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Known Gaps and Issues
|
||||||
|
|
||||||
|
These are confirmed gaps in the current implementation, not aspirational items.
|
||||||
|
|
||||||
|
### 1. Transcription endpoint not implemented
|
||||||
|
|
||||||
|
`POST /v1/audio/transcriptions` is listed in the architecture and wired into
|
||||||
|
`main.py`, but there is no upstream proxy handler for it. `upstream.py` has no
|
||||||
|
`transcriptions()` method. The endpoint currently returns nothing useful.
|
||||||
|
|
||||||
|
### 2. Routing strategy field is ignored
|
||||||
|
|
||||||
|
`RoutingConfig.default_strategy` exists in `config.py` (default: `"loaded_first"`),
|
||||||
|
but `resolve_route()` in `registry.py` does not read it. There is effectively only
|
||||||
|
one strategy. The field is misleading.
|
||||||
|
|
||||||
|
### 3. Role fallback chain is not implemented
|
||||||
|
|
||||||
|
`RoutingPolicy.fallback_roles` is defined in `models.py` and appears in the schema
|
||||||
|
docs, but `resolve_route()` never consults it. A role that fails to match any service
|
||||||
|
fails outright rather than trying its fallbacks.
|
||||||
|
|
||||||
|
### 4. `_benchmark_quality_score` can exceed 1.0 before clamping
|
||||||
|
|
||||||
|
`pass_rate` and `quality_score` are taken as `max()`, then `tokens_per_sec` and
|
||||||
|
`ttft_ms` are *added* on top. A service with `pass_rate=1.0`, fast tokens, and low
|
||||||
|
TTFT accumulates a score of up to 1.6 before the final `min(1.0, quality)` clamp.
|
||||||
|
This means the additive bonuses have no effect once pass_rate or quality_score is
|
||||||
|
already high, which is probably not the intended behavior.
|
||||||
|
|
||||||
|
### 5. Health is self-reported only
|
||||||
|
|
||||||
|
Service health (`healthy` / `unhealthy`) comes entirely from node-reported state.
|
||||||
|
The control plane does not probe upstream endpoints. A service can appear healthy
|
||||||
|
while its endpoint is unreachable.
|
||||||
|
|
||||||
|
### 6. No active model discovery from upstream services
|
||||||
|
|
||||||
|
The node agent scans for `.gguf` files on disk and reads static service config.
|
||||||
|
It does not query running Ollama or vLLM instances for their loaded model list.
|
||||||
|
A freshly-pulled Ollama model will not appear until the node config is updated
|
||||||
|
and the agent restarted.
|
||||||
|
|
||||||
|
### 7. `docs/architecture.md` duplicates `GENIEWARREN_SPEC.md`
|
||||||
|
|
||||||
|
`architecture.md` contains the repo-naming rationale, name alternatives, and
|
||||||
|
implementation sequence list that are only meaningful in a design/proposal context.
|
||||||
|
These are noise in a reference architecture document.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Immediate Next Work (Priority Order)
|
||||||
|
|
||||||
|
### P0 — Fix confirmed bugs
|
||||||
|
|
||||||
|
1. **Remove the misleading `default_strategy` field** or implement a dispatch table
|
||||||
|
so the config field actually selects behavior. Simplest fix: delete the field and
|
||||||
|
the dead config surface until a second strategy is implemented.
|
||||||
|
|
||||||
|
2. **Fix `_benchmark_quality_score`** so additive bonuses apply only when no
|
||||||
|
`pass_rate` / `quality_score` is available, or restructure as a weighted average
|
||||||
|
so the components don't stack additively.
|
||||||
|
|
||||||
|
### P1 — Complete stated v1 scope
|
||||||
|
|
||||||
|
3. **Implement transcription proxy** — add `upstream.transcriptions()` and wire
|
||||||
|
the handler in `chat.py` and `main.py`.
|
||||||
|
|
||||||
|
4. **Implement role fallback chain** — when `resolve_route()` finds no matching
|
||||||
|
service for a role, walk `fallback_roles` in order before failing.
|
||||||
|
|
||||||
|
### P2 — Close the most important self-reported-only gaps
|
||||||
|
|
||||||
|
5. **Add active health probing** — the control plane should periodically probe
|
||||||
|
registered service endpoints (a lightweight `GET /health` or `GET /v1/models`
|
||||||
|
is sufficient) and update health state independently of node heartbeats.
|
||||||
|
|
||||||
|
6. **Add upstream model discovery for Ollama** — query `GET /api/tags` (Ollama)
|
||||||
|
or `GET /v1/models` (OpenAI-compatible) from the node agent and merge loaded
|
||||||
|
model names into the service's asset list. This enables dynamic model tracking
|
||||||
|
without config restarts.
|
||||||
|
|
||||||
|
### P3 — Documentation cleanup
|
||||||
|
|
||||||
|
7. **Revise `architecture.md`** — remove the design-phase repo-naming rationale
|
||||||
|
and first-implementation-sequence list; replace with a description of the actual
|
||||||
|
running system (the four layers as implemented, data flow diagram if possible).
|
||||||
|
|
||||||
|
8. **Update `roadmap.md`** — this file (done).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Near-Term Milestones (After P0–P3)
|
||||||
|
|
||||||
|
- **Live LLM demo** — run control + node against a real upstream (Ollama or
|
||||||
|
llama.cpp) and document the end-to-end flow, including chat via role and
|
||||||
|
direct asset addressing
|
||||||
|
- **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as
|
||||||
|
a programmatic service catalog for a Claude Code or Codex client selecting
|
||||||
|
a GenieHive-hosted model for lower-complexity subtasks
|
||||||
|
- **Richer node metrics** — queue depth, in-flight count, and rolling performance
|
||||||
|
averages reported from node to control on every heartbeat
|
||||||
|
- **Second routing strategy** — implement `round_robin` or `least_loaded` as a
|
||||||
|
second selectable strategy, then make `default_strategy` actually dispatch
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## V1.5 Scope (Not Yet Started)
|
||||||
|
|
||||||
|
- mTLS between control plane and node agents
|
||||||
|
- Scoped client tokens (read-only vs. operator vs. admin)
|
||||||
|
- Active load-aware model swapping (trigger unload/load on a node based on demand)
|
||||||
|
- Image and TTS generation adapter stubs
|
||||||
|
- Streaming response passthrough for chat completions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-Goals (Unchanged from Original Spec)
|
||||||
|
|
||||||
|
- Peer-to-peer consensus
|
||||||
|
- Autonomous global model swapping across many nodes
|
||||||
|
- Full WAN zero-trust platform
|
||||||
|
- Distributed vector database management
|
||||||
|
- Billing or multi-tenant quota accounting
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue