271 lines
6.6 KiB
Markdown
271 lines
6.6 KiB
Markdown
# GenieHive Schemas
|
|
|
|
These are canonical logical schemas for v1. They are documentation first, not final implementation code.
|
|
|
|
## Host
|
|
|
|
```yaml
|
|
host:
|
|
host_id: "atlas-01"
|
|
display_name: "Atlas GPU Box"
|
|
address: "192.168.1.101"
|
|
labels:
|
|
site: "home-lab"
|
|
class: "gpu"
|
|
capabilities:
|
|
cuda: true
|
|
rocm: false
|
|
metal: false
|
|
resources:
|
|
cpu_threads: 24
|
|
ram_gb: 128
|
|
gpus:
|
|
- gpu_id: "cuda:0"
|
|
name: "RTX 4090"
|
|
vram_gb: 24
|
|
auth:
|
|
node_key_id: "nk_atlas_01"
|
|
status:
|
|
state: "online"
|
|
last_seen: "2026-04-05T15:30:00Z"
|
|
```
|
|
|
|
## Service
|
|
|
|
```yaml
|
|
service:
|
|
service_id: "atlas-01/chat/qwen3-8b"
|
|
host_id: "atlas-01"
|
|
kind: "chat"
|
|
protocol: "openai"
|
|
endpoint: "http://192.168.1.101:18091"
|
|
runtime:
|
|
engine: "llama.cpp"
|
|
launcher: "managed"
|
|
assets:
|
|
- asset_id: "qwen3-8b-q4km"
|
|
loaded: true
|
|
request_policy:
|
|
body_defaults:
|
|
chat_template_kwargs:
|
|
enable_thinking: false
|
|
state:
|
|
health: "healthy"
|
|
load_state: "loaded"
|
|
accept_requests: true
|
|
observed:
|
|
p50_latency_ms: 920
|
|
p95_latency_ms: 1900
|
|
tokens_per_sec: 42
|
|
```
|
|
|
|
## Asset
|
|
|
|
```yaml
|
|
asset:
|
|
asset_id: "qwen3-8b-q4km"
|
|
family: "Qwen3-8B"
|
|
modality: "text"
|
|
operation: "chat"
|
|
format: "gguf"
|
|
locator:
|
|
kind: "path"
|
|
value: "/models/qwen3-8b/qwen3-8b-q4_k_m.gguf"
|
|
metadata:
|
|
quant: "Q4_K_M"
|
|
ctx_train: 32768
|
|
```
|
|
|
|
## Role Profile
|
|
|
|
```yaml
|
|
role:
|
|
role_id: "mentor"
|
|
display_name: "Mentor"
|
|
description: "Guidance-oriented instructional reasoning"
|
|
modality: "text"
|
|
operation: "chat"
|
|
prompt_policy:
|
|
system_prompt: "You guide without doing the user's work for them."
|
|
user_template: "{{ user_input }}"
|
|
request_policy:
|
|
body_defaults:
|
|
temperature: 0.2
|
|
routing_policy:
|
|
preferred_families: ["Qwen3", "Mistral"]
|
|
preferred_labels: ["instruction", "stable"]
|
|
min_context: 8192
|
|
require_loaded: false
|
|
fallback_roles: ["general_assistant"]
|
|
```
|
|
|
|
## Request Shape Policy
|
|
|
|
This is a general representation for model- or route-specific request shaping.
|
|
|
|
```yaml
|
|
request_shape_policy:
|
|
body_defaults:
|
|
chat_template_kwargs:
|
|
enable_thinking: false
|
|
temperature: 0.2
|
|
system_prompt: "Return only visible final answer text."
|
|
system_prompt_position: "prepend"
|
|
```
|
|
|
|
Use it for:
|
|
|
|
- model-specific request flags such as `chat_template_kwargs.enable_thinking`
|
|
- default OpenAI-compatible body fields that should be applied unless the caller already set them
|
|
- model-specific prompt instructions that should be prepended, appended, or replace an existing system message
|
|
|
|
GenieHive currently supports this policy on:
|
|
|
|
- `service.assets[].request_policy`
|
|
- `role.prompt_policy.request_policy`
|
|
|
|
The control plane may also infer built-in request policies from model family metadata. For example, Qwen3/Qwen3.5 chat routes default to `chat_template_kwargs.enable_thinking: false` unless the caller explicitly sets a different value.
|
|
|
|
`GET /v1/models` exposes the merged result as `geniehive.effective_request_policy` on service, asset, and role-backed model entries so clients can discover what GenieHive will apply by default.
|
|
|
|
## Health Sample
|
|
|
|
```yaml
|
|
health_sample:
|
|
sample_id: "hs_01"
|
|
target_type: "service"
|
|
target_id: "atlas-01/chat/qwen3-8b"
|
|
observed_at: "2026-04-05T15:30:00Z"
|
|
status: "healthy"
|
|
checks:
|
|
http_ok: true
|
|
models_ok: true
|
|
auth_ok: true
|
|
metrics:
|
|
queue_depth: 1
|
|
in_flight: 1
|
|
mem_used_gb: 18.4
|
|
```
|
|
|
|
## Benchmark Sample
|
|
|
|
```yaml
|
|
benchmark_sample:
|
|
benchmark_id: "bench_01"
|
|
service_id: "atlas-01/chat/qwen3-8b"
|
|
asset_id: "qwen3-8b-q4km"
|
|
observed_at: "2026-04-05T15:25:00Z"
|
|
workload: "chat.short_reasoning"
|
|
results:
|
|
prompt_tokens: 512
|
|
completion_tokens: 256
|
|
ttft_ms: 780
|
|
tokens_per_sec: 44
|
|
```
|
|
|
|
## Route Match Request
|
|
|
|
```yaml
|
|
route_match_request:
|
|
task: "fast technical reasoning for an interactive assistant"
|
|
tasks:
|
|
- "interactive debugging help"
|
|
- "concise technical explanations"
|
|
workload: "chat.short_reasoning"
|
|
workloads:
|
|
- "chat.short_reasoning"
|
|
- "chat.concise_support"
|
|
kind: "chat"
|
|
modality: "text"
|
|
include_direct_services: true
|
|
limit: 5
|
|
```
|
|
|
|
This request is meant to answer:
|
|
|
|
- which role-backed route is the best current fit for this task or task suite
|
|
- which direct services also look suitable right now
|
|
|
|
V1 matching is metadata- and runtime-driven. It uses:
|
|
|
|
- role text and routing policy overlap
|
|
- service asset and runtime metadata overlap
|
|
- loaded state
|
|
- observed latency
|
|
- observed throughput
|
|
- current queue depth when available
|
|
- recent benchmark sample workload overlap and empirical quality/performance hints
|
|
|
|
If benchmark samples exist for a candidate service, workload hints such as `chat.short_reasoning` can boost routes with recent empirical fit.
|
|
|
|
## Route Match Candidate
|
|
|
|
```yaml
|
|
route_match_candidate:
|
|
candidate_type: "role"
|
|
candidate_id: "general_assistant"
|
|
operation: "chat"
|
|
score: 0.86
|
|
reasons:
|
|
- "task text overlaps role description or policy"
|
|
- "resolved service matches role preferred model family"
|
|
- "service already has a loaded asset"
|
|
- "low observed latency"
|
|
- "good observed throughput"
|
|
signals:
|
|
task_overlap: 0.33
|
|
preferred_family_match: 1.0
|
|
loaded: true
|
|
p50_latency_ms: 1100
|
|
tokens_per_sec: 28
|
|
queue_depth: 0
|
|
benchmark_match_count: 2
|
|
best_workload_overlap: 1.0
|
|
benchmark_quality_score: 0.9
|
|
role:
|
|
role_id: "general_assistant"
|
|
service:
|
|
service_id: "p40-box/chat/gpu1-secondary"
|
|
```
|
|
|
|
## Benchmark Ingest Request
|
|
|
|
```yaml
|
|
benchmark_ingest_request:
|
|
samples:
|
|
- benchmark_id: "bench-qwen-1"
|
|
service_id: "p40-box/chat/gpu1-secondary"
|
|
asset_id: "Qwen3.5-9B-Q5_K_M"
|
|
workload: "chat.short_reasoning"
|
|
observed_at: 1775582000.0
|
|
results:
|
|
ttft_ms: 900
|
|
tokens_per_sec: 30
|
|
quality_score: 0.9
|
|
```
|
|
|
|
## Benchmark Report File
|
|
|
|
This is a file-oriented format meant for repeatable benchmark runs before ingestion into GenieHive.
|
|
|
|
```yaml
|
|
benchmark_report:
|
|
report_id: "p40-short-reasoning"
|
|
observed_at: 1775583000.0
|
|
source: "local-smoke"
|
|
samples:
|
|
- service_id: "p40-box/chat/gpu1-secondary"
|
|
asset_id: "Qwen3.5-9B-Q5_K_M"
|
|
workload: "chat.short_reasoning"
|
|
results:
|
|
ttft_ms: 900
|
|
tokens_per_sec: 30
|
|
quality_score: 0.9
|
|
```
|
|
|
|
Notes:
|
|
|
|
- `observed_at` may be set once at the report level or per sample
|
|
- `benchmark_id` is optional in the file format; GenieHive tooling can generate a stable ID during conversion
|
|
- the helper script `scripts/ingest_benchmark_report.py` loads this format and posts the expanded samples to `POST /v1/cluster/benchmarks`
|