Revise architecture/roadmap docs and add LLM evaluation guide

- architecture.md: rewrite to describe the actual running system; remove
  design-phase repo-naming discussion and initial-implementation-sequence
  list; add data-flow diagram, scoring weights table, API status table
- roadmap.md: replace aspirational list with concrete completed/gap/next
  structure; document four confirmed implementation gaps (transcription
  stub, strategy field ignored, fallback_roles unimplemented, benchmark
  quality score additive overflow); prioritise fixes as P0/P1/P2/P3
- docs/local_llm_evaluation.md: new document; role taxonomy (tier 1–3),
  hardware inventory template, candidate model suggestions, three-phase
  evaluation protocol, GenieHive integration steps, results template,
  notes on Qwen3/Mistral/DeepSeek/Ollama embedding path quirks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
welberr 2026-04-27 09:25:51 -04:00
parent e36650a017
commit a76c7e81f4
3 changed files with 579 additions and 167 deletions

View File

@ -1,74 +1,46 @@
# GenieHive Architecture # GenieHive Architecture
Status: proposed v1 architecture Last updated: 2026-04-27
Drafted: 2026-04-05
## Repo Name
Chosen name: `GenieHive`
Why this name:
- suggestive: "genie" implies generative AI services, "hive" implies a cooperating cluster
- accessible: easy to say, remember, and explain
- whimsical enough to feel like a project name rather than a dry infrastructure label
Tradeoff:
- `GenieHive` is less search-distinct than `Geniewarren` because `hive` is a common product metaphor
## Mission ## Mission
GenieHive is a local-first control plane for heterogeneous generative AI services running across one or more hosts. GenieHive is a local-first control plane for heterogeneous generative AI services
running across one or more hosts. It provides:
It should: - Registration and health tracking for distributed AI services
- A stable, OpenAI-compatible client-facing API
- Role-based routing and scheduling over multiple services
- Integrated benchmarking and performance-informed route scoring
- register hosts and their available services It is not a plain OpenAI-compatible gateway. The control plane layer adds topology
- expose a stable client-facing API awareness, role abstraction, and signal-driven routing that a dumb proxy does not
- track health, capacity, and observed performance provide.
- support direct model addressing and higher-level role addressing
- route requests to healthy loaded services first
- optionally coordinate loading or swapping when policy allows
- remain practical for a small self-hosted deployment with two hosts
## Non-Goals For V1 ---
Out of scope initially: ## Four Layers
- peer-to-peer consensus ```
- autonomous global model swapping across many nodes ┌─────────────────────────────────────────────┐
- full WAN zero-trust platform engineering │ Client Facades │
- image and TTS generation orchestration │ OpenAI-compatible completions + embeddings │
- distributed vector database management │ Operator inspection API │
- billing or multi-tenant quota accounting ├─────────────────────────────────────────────┤
│ Control API │
│ Registry · Role catalog · Route resolution │
│ Scheduling · Benchmark store │
├─────────────────────────────────────────────┤
│ Node Agent(s) │
│ Host discovery · Service enumeration │
│ Telemetry reporting · Heartbeat │
├─────────────────────────────────────────────┤
│ Provider Adapters │
│ OpenAI-compatible chat / embeddings │
│ Transcription (partial) │
└─────────────────────────────────────────────┘
```
## Architectural Position ---
GenieHive is not just an OpenAI-compatible gateway.
It is a control plane with these layers:
1. Control API
- authoritative registry
- routing and scheduling
- role catalog
- operator inspection
2. Node Agent
- host discovery
- service discovery
- telemetry reporting
- optional local process management
3. Provider Adapters
- OpenAI-compatible chat backends
- OpenAI-compatible embedding backends
- transcription backends
- future adapters for image and speech synthesis
4. Client Facades
- OpenAI-compatible facade for completions and embeddings
- operator API for topology, health, and inventory
## Core Concepts ## Core Concepts
@ -78,117 +50,165 @@ A physical or virtual machine participating in the cluster.
### Service ### Service
A concrete callable capability on a host. Examples: A concrete callable capability on a host: a chat endpoint, an embeddings endpoint,
or a transcription endpoint. A host typically exposes multiple services.
- chat completion endpoint
- embedding endpoint
- transcription endpoint
### Asset ### Asset
A model weight, model name, application, or runtime target that a service can serve. A model weight, model name, or runtime target that a service can serve. Assets carry
optional `request_policy` fields that adjust how requests are shaped before forwarding.
### Role ### Role
A reusable task profile that describes how requests should be fulfilled. A role is policy, not a concrete model. A reusable task profile that describes *how* requests should be fulfilled, not *which*
model fills them. A role has a prompt policy (system prompt injection, body defaults)
and a routing policy (preferred model families, minimum context size, loaded-first
preference). The same role can route to different services as cluster state changes.
### Route Resolution ### Route Resolution
Request handling order: 1. If `model` matches a loaded, healthy asset or service alias → route directly.
2. If `model` matches a known role → score eligible services and route to the best.
3. Otherwise → fail with a clear 404.
1. If the requested `model` matches a currently loaded and healthy concrete asset or service alias, route directly. ---
2. Otherwise, if the requested `model` matches a known role, resolve the role to the best eligible service.
3. Otherwise, fail clearly.
## V1 Capability Scope ## Data Flow: Chat Completion
V1 supports only: ```
Client POST /v1/chat/completions
resolve_route(model, kind="chat")
├─ direct: asset_id or service alias match
└─ role: filter by kind/health → score by runtime + benchmark signals
apply_request_policy(request, asset, role)
├─ deep-merge body_defaults
├─ apply system prompt (prepend / append / replace)
└─ auto-infer Qwen3 template kwargs if needed
UpstreamClient.chat_completions(endpoint, modified_request)
_strip_reasoning_fields(response) ← removes reasoning_content / reasoning
Response to client
```
- chat completions ---
- embeddings
- transcription ## Scoring
Route scoring combines three signal families:
| Signal family | Weight (role) | Weight (service) |
|----------------|---------------|-----------------|
| Text overlap | 30% | 20% |
| Runtime | 30% | 45% |
| Benchmark | 25% | 35% |
| Family pref. | 15% | — |
**Runtime signals** (from last heartbeat):
- Loaded state: +0.35
- Latency bands: p50 <500 ms +0.30, <1500 ms +0.20, <3000 ms +0.10, else +0.05
- Throughput: ≥40 tok/s +0.20, ≥20 +0.10
- Queue depth: penalty 0.20 if ≥5, 0.10 if ≥2
**Benchmark signals** (from ingested workload runs):
- Workload overlap score (Jaccard-style token overlap)
- Quality score from results: `0.45 * overlap + 0.55 * quality`
---
## Topology ## Topology
Recommended initial topology: **Minimum viable (single machine):**
```
control plane + node agent + model server
all on 127.0.0.1, different ports
```
- 1 control plane **Recommended (small cluster):**
- 2 node agents ```
- 1 or more clients 1 control plane host
- LAN-first deployment 2+ node-agent hosts, each with 1+ model servers
- API key auth in v1 1+ clients on LAN
- VPN or mTLS in v1.5 ```
## API Families **Auth:**
- Client requests: `X-Api-Key` header
- Node registration/heartbeat: `X-GenieHive-Node-Key` header
- Empty key lists disable auth (development only)
- mTLS between control and nodes planned for v1.5
---
## State Store
SQLite. Schema:
| Table | Content |
|---------------------|-------------------------------------------|
| `hosts` | Host registration, resources, labels |
| `services` | Service config, runtime, assets, observed |
| `roles` | Role catalog |
| `benchmark_samples` | Workload results per service |
Default path: `state/geniehive.sqlite3`
---
## API Reference Summary
### Client API ### Client API
| Endpoint | Status |
- `GET /v1/models` |---------------------------------|---------------|
- `POST /v1/chat/completions` | `GET /v1/models` | Implemented |
- `POST /v1/embeddings` | `POST /v1/chat/completions` | Implemented |
- `POST /v1/audio/transcriptions` | `POST /v1/embeddings` | Implemented |
| `POST /v1/audio/transcriptions` | Stub only |
`GET /v1/models` should expose enough metadata for programmatic clients to make routing decisions about what GenieHive can handle cheaply, especially for lower-complexity offloaded work. That metadata should include direct assets, service-backed aliases, role aliases, operation kind, health, loaded status, and observed performance hints.
### Operator API ### Operator API
| Endpoint | Status |
- `GET /v1/cluster/hosts` |------------------------------------|-------------|
- `GET /v1/cluster/services` | `GET /v1/cluster/hosts` | Implemented |
- `GET /v1/cluster/roles` | `GET /v1/cluster/services` | Implemented |
- `GET /v1/cluster/health` | `GET /v1/cluster/roles` | Implemented |
- `GET /v1/cluster/routes/resolve?model=...` | `GET /v1/cluster/benchmarks` | Implemented |
| `GET /v1/cluster/health` | Implemented |
| `GET /v1/cluster/routes/resolve` | Implemented |
| `POST /v1/cluster/routes/match` | Implemented |
### Node API ### Node API
| Endpoint | Status |
|------------------------------|-------------|
| `POST /v1/nodes/register` | Implemented |
| `POST /v1/nodes/heartbeat` | Implemented |
| `GET /v1/node/inventory` | Implemented |
| `GET /v1/node/registration` | Implemented |
- `POST /v1/nodes/register` ---
- `POST /v1/nodes/heartbeat`
- `GET /v1/node/inventory`
- `POST /v1/node/services/refresh`
## Data Store ## Supported Upstream Backends
V1 should use SQLite for durable state. Any OpenAI-compatible HTTP server. Tested configurations:
## Routing Rules - **Ollama** — chat and embeddings
- **llama.cpp** (server mode) — chat and embeddings
- **llamafile** — chat
- **vLLM** — chat and embeddings
### Direct Model Resolution ---
If a request names a concrete asset alias or service alias: ## Non-Goals for V1
- prefer loaded and healthy services - Peer-to-peer consensus
- choose the lowest-cost healthy target if multiple matches exist - Autonomous global model swapping
- fail clearly if all matches are unhealthy - WAN zero-trust networking
- Image and TTS generation
### Role Resolution - Distributed vector databases
- Billing or multi-tenant quotas
If direct resolution fails, treat the requested name as a role.
Role resolution should filter by:
- operation kind
- modality
- health
- auth and exposure compatibility
- minimum context or memory requirements
- preferred model families
Then rank by:
- already loaded
- recent health
- expected latency
- queue pressure
- operator priority
## First Implementation Sequence
1. Create the repo skeleton and docs.
2. Implement SQLite-backed registry models.
3. Implement node registration and heartbeat.
4. Implement operator inspection endpoints.
5. Implement client-facing chat routing.
6. Add embeddings routing.
7. Add transcription routing.
8. Add truthful readiness and health reporting.
9. Add role catalog and role-based resolution.
10. Add optional managed local runtime support.

View File

@ -0,0 +1,251 @@
# Local LLM Evaluation for GenieHive Agent Roles
Last updated: 2026-04-27
## Purpose
This document describes a framework for evaluating locally-hosted LLMs against the
roles that GenieHive needs to fulfill in a multi-agent or tool-use pipeline. The goal
is to determine which models are fit for which roles given available hardware, and
to produce benchmark data that GenieHive's own routing layer can consume.
---
## Role Taxonomy
GenieHive routes by role. Before evaluating models, the roles likely needed in an
agent pipeline must be defined. The following taxonomy covers the most common cases.
### Tier 1: Core Inference Roles
| Role ID | Description | Key requirements |
|----------------------|----------------------------------------------------------|---------------------------------------------------|
| `general_assistant` | General-purpose instruction following, Q&A, summarization| Good instruction following, ≥8k context |
| `reasoning` | Multi-step problem solving, chain-of-thought tasks | Extended thinking, ≥16k context, slow OK |
| `code_assistant` | Code generation, explanation, debugging | Strong code benchmarks, fill-in-middle optional |
| `structured_output` | JSON/schema-constrained generation | Grammar sampling or reliable JSON mode |
| `tool_use` | Tool/function call formatting and parsing | Function call format compliance, low hallucination|
### Tier 2: Supporting Roles
| Role ID | Description | Key requirements |
|----------------------|----------------------------------------------------------|---------------------------------------------------|
| `embedder` | Semantic embedding for RAG, search, clustering | High MTEB scores, must be loaded (not lazy) |
| `classifier` | Short-text classification, intent detection | Fast TTFT, low token budget, reliable format |
| `summarizer` | Condensing long documents | Long context (≥32k), extractive reliability |
| `critic` | Reviewing, scoring, or evaluating model outputs | Self-consistency, instruction precision |
| `transcriber` | Audio-to-text (Whisper-family) | WER on domain-specific content |
### Tier 3: Specialized Roles (project-specific)
These are informed by the current project context (TalkOrigins bibliography pipeline,
Panda's Thumb archive, multi-site search).
| Role ID | Description | Key requirements |
|------------------------|------------------------------------------------------------|------------------------------------------------|
| `bibliographic_analyst`| Extract, verify, and enrich bibliographic metadata | Precise instruction following, structured JSON |
| `science_explainer` | Explain scientific concepts for a general audience | Factual accuracy, good prose, ≥8k context |
| `search_query_writer` | Generate search queries from topic descriptions | Concise, varied output; fast |
| `html_cleaner` | Identify and convert markup patterns (MT tags, etc.) | Reliable format compliance |
---
## Hardware Context
Evaluation should be scoped to what is actually available. Document hardware before
running benchmarks.
Recommended inventory fields per host:
```yaml
host_id: atlas-01
gpu:
- name: NVIDIA Tesla P40
vram_gb: 24
cuda: "8.0"
cpu:
threads: 24
model: "Intel Xeon"
ram_gb: 128
fast_storage: true # NVMe vs. spinning rust matters for model load time
```
Models that do not fit in VRAM will run on CPU or split across GPU+CPU. Note
GPU-only, GPU+CPU, and CPU-only fit status explicitly for each candidate.
---
## Candidate Model Selection
For each role tier, select 24 candidate models. Selection criteria:
1. **Fits hardware** — VRAM budget for the target host
2. **GGUF available** — for llama.cpp / llamafile deployment
3. **License** — permissive enough for intended use
4. **Recency** — prefer models released in the last 12 months unless a classic
substantially outperforms
### Suggested Starting Candidates (as of 2026-04)
**General assistant / reasoning:**
- Qwen3-8B-Q4_K_M (fits P40 at 24 GB, extended thinking available)
- Qwen3-14B-Q4_K_M (fits P40 at ~10 GB VRAM + offload, better reasoning)
- Mistral-7B-Instruct-v0.3 (fast, reliable baseline)
**Code assistant:**
- Qwen2.5-Coder-7B-Instruct
- DeepSeek-Coder-V2-Lite-Instruct (16B MoE, may fit on CPU+GPU split)
**Structured output / tool use:**
- Qwen3-8B (native tool call support)
- functionary-small-v3.2 (purpose-built for tool use)
- Hermes-3-Llama-3.1-8B (strong JSON reliability)
**Embeddings:**
- nomic-embed-text-v1.5 (fast, high MTEB, 137M params)
- mxbai-embed-large-v1 (larger, higher MTEB)
- bge-small-en-v1.5 (smallest, acceptable quality for retrieval)
**Transcription:**
- faster-whisper large-v3 (best WER, GPU accelerated)
- faster-whisper medium.en (faster, smaller, English-only)
---
## Evaluation Protocol
### Phase 1: Deployment fit check
For each candidate:
1. Load the model via llama.cpp or Ollama.
2. Send a minimal completion request to confirm the endpoint is responding.
3. Record:
- Actual VRAM used (from `nvidia-smi`)
- Time to first token on a short prompt (~50 tokens)
- Tokens/sec on a medium completion (~200 tokens)
Pass criterion: TTFT < 5 s, tokens/sec > 5.
### Phase 2: Role fitness benchmarks
Use GenieHive's built-in benchmark runner for chat roles. Extend with custom
workloads for each role. Each workload should have 35 cases with known expected
outputs or pass criteria.
**Workload design principles:**
- Cases should be representative of real workload (not toy examples)
- Pass criteria should be checkable without a judge model where possible
(exact match, JSON parse, regex, non-empty, length bounds)
- Include at least one adversarial case per role (ambiguous prompt, edge input)
- Record `chat_template_kwargs` for models that need them (e.g., Qwen3 thinking)
**Suggested workloads to add to `benchmark_runner.py`:**
```
chat.structured_json — produce valid JSON matching a schema
chat.tool_call_format — emit a well-formed function call
chat.code_python — generate a short working Python function
chat.long_context_recall — answer from a 16k-token context document
chat.concise_classification — classify text into one of N labels
```
**Embeddings workloads** (separate evaluation script needed):
- Cosine similarity ranking on semantically close/distant pairs
- Retrieval recall@5 on a small fixed corpus
### Phase 3: Comparative scoring
For each role, rank candidates by:
1. Pass rate (primary)
2. Tokens/sec (secondary, for latency-sensitive roles)
3. TTFT (secondary, for interactive roles)
4. VRAM cost (tie-breaker)
Document the winner and runner-up. Load both into GenieHive's benchmark store
so the routing layer can score them in live operation.
---
## Integrating Results into GenieHive
After running benchmarks:
1. Emit a JSON benchmark report (use `run_benchmark_workload.py`).
2. Ingest into the control plane: `python scripts/ingest_benchmark_report.py`.
3. Define a role in `roles.yaml` with `preferred_families` aligned to the
winning candidate's model family.
4. Verify routing: `GET /v1/cluster/routes/resolve?model=<role_id>` should
return the winning service.
5. Run a live request through the role to confirm end-to-end.
---
## Evaluation Checklist
```
[ ] Hardware inventory documented for each candidate host
[ ] Candidate models selected per role tier
[ ] Each candidate loaded and Phase 1 deployment check passed
[ ] Custom workloads written for at least Tier 1 roles
[ ] Phase 2 benchmarks run and results recorded
[ ] Results ingested into GenieHive benchmark store
[ ] Roles defined in roles.yaml matching Phase 3 winners
[ ] End-to-end routing verified for each role
[ ] Results documented in a summary (see template below)
```
---
## Results Summary Template
```markdown
## Evaluation Results — <date>
### Hardware
- Host: <host_id>
- GPU: <name>, <vram_gb> GB VRAM
- RAM: <ram_gb> GB
### Role: general_assistant
| Model | Pass rate | tok/s | TTFT ms | VRAM GB | Result |
|--------------------|-----------|-------|---------|---------|---------|
| Qwen3-8B-Q4_K_M | 0.92 | 38 | 420 | 6.1 | WINNER |
| Mistral-7B-v0.3 | 0.85 | 52 | 310 | 4.9 | runner-up |
### Role: embedder
| Model | Recall@5 | Latency ms | VRAM GB | Result |
|------------------------|----------|------------|---------|---------|
| nomic-embed-text-v1.5 | 0.88 | 12 | 0.3 | WINNER |
| bge-small-en-v1.5 | 0.79 | 8 | 0.1 | runner-up |
... (repeat for each role)
```
---
## Notes on Model Families and Known Behaviors
**Qwen3 / Qwen3.5:**
GenieHive auto-detects these and sets `enable_thinking: false` unless a role or
asset explicitly overrides. For the `reasoning` role, set `enable_thinking: true`
in the role's `body_defaults` to engage extended chain-of-thought.
**Mistral / Mixtral:**
Standard instruction format. No special handling needed.
**DeepSeek models:**
Some versions use a `<think>` block in their output. GenieHive strips
`reasoning_content` from responses but not inline `<think>` blocks. If the
model emits inline thinking that should be hidden from clients, add a
response-cleaning step or configure the model server to suppress it.
**Embedding models via Ollama:**
Ollama's embedding endpoint is `/api/embeddings`, not `/v1/embeddings`. The
current `UpstreamClient` uses the OpenAI-compatible path. When registering
an Ollama embedding service, confirm the node config points to the correct
endpoint or that the Ollama version supports `/v1/embeddings`.
**llamafile:**
Does not support the embeddings endpoint. Only suitable for chat roles.

View File

@ -1,34 +1,175 @@
# GenieHive Roadmap # GenieHive Roadmap
## Completed Foundations Last updated: 2026-04-27
- control-plane registry with SQLite persistence ## What Is Complete
- node registration and heartbeat
- role catalog and route resolution
- client-facing `GET /v1/models`
- client-facing `POST /v1/chat/completions`
- client-facing `POST /v1/embeddings`
- first control-plus-node demo flow
## Immediate Next Milestones The v1 core is implemented and tested.
1. Run and document the first live LLM demo against real upstream servers. **Registry and cluster control:**
2. Validate the `GET /v1/models` metadata as a Codex-friendly offload catalog for lower-complexity tasks. - SQLite-backed registry with hosts, services, roles, and benchmark samples
3. Add `POST /v1/audio/transcriptions`. - Node registration and heartbeat protocol with auto-re-registration on 404
4. Add a richer node metrics model for queue depth, current load, and observed performance over time. - Role catalog loading from YAML
5. Add a stronger operator/client distinction in the public metadata and auth surfaces. - Route resolution: direct asset/service match → role resolution → clear failure
## LLM Demo Note **Client-facing API:**
- `GET /v1/models` — OpenAI-compatible model list with rich metadata (loaded state,
latency hints, offload classification, role aliases)
- `POST /v1/chat/completions` — proxies to upstream with request policy application
- `POST /v1/embeddings` — proxies to upstream
The project is now ready for a first live LLM demo using GenieHive as: **Request policy system:**
- Body defaults and overrides via deep merge
- System prompt injection (prepend / append / replace)
- Per-asset and per-role policies, merged with role winning on prompts
- Qwen3 / Qwen3.5 auto-detection with `enable_thinking: false` applied automatically
- master: control plane **Route matching and scoring:**
- peer: one or more node agents with pre-existing local LLM servers - `POST /v1/cluster/routes/match` — scored candidate list for role and service targets
- client: a small demo agent or Codex configured against GenieHive - Signals: text overlap, preferred family, runtime (loaded state, latency, throughput,
queue depth), benchmark (workload overlap, quality score)
- `GET /v1/cluster/routes/resolve` — quick single-model resolution
The current live-demo priority is chat-first. Embeddings are also wired in GenieHive, but upstream compatibility differs across local servers, so the safest first demo matrix is: **Benchmark infrastructure:**
- Built-in workloads: `chat.short_reasoning`, `chat.concise_support`
- `run_benchmark_workload.py` executes workloads and emits a JSON report
- `ingest_benchmark_report.py` posts results to the control plane
- Benchmark samples feed the route scoring pipeline
- Ollama for chat and embeddings **Operator inspection:**
- vLLM for chat and embeddings - `GET /v1/cluster/hosts`, `/services`, `/roles`, `/benchmarks`, `/health`
- llama.cpp for chat
- llamafile for chat **Auth:**
- Client API key (`X-Api-Key`) and node registration key (`X-GenieHive-Node-Key`)
- Empty key lists disable auth for development
**Tests:**
- Registry, chat proxy, node inventory, benchmark runner, full demo flow
- All passing
---
## Known Gaps and Issues
These are confirmed gaps in the current implementation, not aspirational items.
### 1. Transcription endpoint not implemented
`POST /v1/audio/transcriptions` is listed in the architecture and wired into
`main.py`, but there is no upstream proxy handler for it. `upstream.py` has no
`transcriptions()` method. The endpoint currently returns nothing useful.
### 2. Routing strategy field is ignored
`RoutingConfig.default_strategy` exists in `config.py` (default: `"loaded_first"`),
but `resolve_route()` in `registry.py` does not read it. There is effectively only
one strategy. The field is misleading.
### 3. Role fallback chain is not implemented
`RoutingPolicy.fallback_roles` is defined in `models.py` and appears in the schema
docs, but `resolve_route()` never consults it. A role that fails to match any service
fails outright rather than trying its fallbacks.
### 4. `_benchmark_quality_score` can exceed 1.0 before clamping
`pass_rate` and `quality_score` are taken as `max()`, then `tokens_per_sec` and
`ttft_ms` are *added* on top. A service with `pass_rate=1.0`, fast tokens, and low
TTFT accumulates a score of up to 1.6 before the final `min(1.0, quality)` clamp.
This means the additive bonuses have no effect once pass_rate or quality_score is
already high, which is probably not the intended behavior.
### 5. Health is self-reported only
Service health (`healthy` / `unhealthy`) comes entirely from node-reported state.
The control plane does not probe upstream endpoints. A service can appear healthy
while its endpoint is unreachable.
### 6. No active model discovery from upstream services
The node agent scans for `.gguf` files on disk and reads static service config.
It does not query running Ollama or vLLM instances for their loaded model list.
A freshly-pulled Ollama model will not appear until the node config is updated
and the agent restarted.
### 7. `docs/architecture.md` duplicates `GENIEWARREN_SPEC.md`
`architecture.md` contains the repo-naming rationale, name alternatives, and
implementation sequence list that are only meaningful in a design/proposal context.
These are noise in a reference architecture document.
---
## Immediate Next Work (Priority Order)
### P0 — Fix confirmed bugs
1. **Remove the misleading `default_strategy` field** or implement a dispatch table
so the config field actually selects behavior. Simplest fix: delete the field and
the dead config surface until a second strategy is implemented.
2. **Fix `_benchmark_quality_score`** so additive bonuses apply only when no
`pass_rate` / `quality_score` is available, or restructure as a weighted average
so the components don't stack additively.
### P1 — Complete stated v1 scope
3. **Implement transcription proxy** — add `upstream.transcriptions()` and wire
the handler in `chat.py` and `main.py`.
4. **Implement role fallback chain** — when `resolve_route()` finds no matching
service for a role, walk `fallback_roles` in order before failing.
### P2 — Close the most important self-reported-only gaps
5. **Add active health probing** — the control plane should periodically probe
registered service endpoints (a lightweight `GET /health` or `GET /v1/models`
is sufficient) and update health state independently of node heartbeats.
6. **Add upstream model discovery for Ollama** — query `GET /api/tags` (Ollama)
or `GET /v1/models` (OpenAI-compatible) from the node agent and merge loaded
model names into the service's asset list. This enables dynamic model tracking
without config restarts.
### P3 — Documentation cleanup
7. **Revise `architecture.md`** — remove the design-phase repo-naming rationale
and first-implementation-sequence list; replace with a description of the actual
running system (the four layers as implemented, data flow diagram if possible).
8. **Update `roadmap.md`** — this file (done).
---
## Near-Term Milestones (After P0P3)
- **Live LLM demo** — run control + node against a real upstream (Ollama or
llama.cpp) and document the end-to-end flow, including chat via role and
direct asset addressing
- **Validate Codex-friendly `/v1/models` offload** — test `GET /v1/models` as
a programmatic service catalog for a Claude Code or Codex client selecting
a GenieHive-hosted model for lower-complexity subtasks
- **Richer node metrics** — queue depth, in-flight count, and rolling performance
averages reported from node to control on every heartbeat
- **Second routing strategy** — implement `round_robin` or `least_loaded` as a
second selectable strategy, then make `default_strategy` actually dispatch
---
## V1.5 Scope (Not Yet Started)
- mTLS between control plane and node agents
- Scoped client tokens (read-only vs. operator vs. admin)
- Active load-aware model swapping (trigger unload/load on a node based on demand)
- Image and TTS generation adapter stubs
- Streaming response passthrough for chat completions
---
## Non-Goals (Unchanged from Original Spec)
- Peer-to-peer consensus
- Autonomous global model swapping across many nodes
- Full WAN zero-trust platform
- Distributed vector database management
- Billing or multi-tenant quota accounting