GenieHive/docs/schemas.md

# GenieHive Schemas

These are canonical logical schemas for v1. They are documentation first, not final implementation code.

## Host

```yaml
host:
  host_id: "atlas-01"
  display_name: "Atlas GPU Box"
  address: "192.168.1.101"
  labels:
    site: "home-lab"
    class: "gpu"
  capabilities:
    cuda: true
    rocm: false
    metal: false
  resources:
    cpu_threads: 24
    ram_gb: 128
    gpus:
      - gpu_id: "cuda:0"
        name: "RTX 4090"
        vram_gb: 24
  auth:
    node_key_id: "nk_atlas_01"
  status:
    state: "online"
    last_seen: "2026-04-05T15:30:00Z"
```

## Service

```yaml
service:
  service_id: "atlas-01/chat/qwen3-8b"
  host_id: "atlas-01"
  kind: "chat"
  protocol: "openai"
  endpoint: "http://192.168.1.101:18091"
  runtime:
    engine: "llama.cpp"
    launcher: "managed"
  assets:
    - asset_id: "qwen3-8b-q4km"
      loaded: true
      request_policy:
        body_defaults:
          chat_template_kwargs:
            enable_thinking: false
  state:
    health: "healthy"
    load_state: "loaded"
    accept_requests: true
  observed:
    p50_latency_ms: 920
    p95_latency_ms: 1900
    tokens_per_sec: 42
```

## Asset

```yaml
asset:
  asset_id: "qwen3-8b-q4km"
  family: "Qwen3-8B"
  modality: "text"
  operation: "chat"
  format: "gguf"
  locator:
    kind: "path"
    value: "/models/qwen3-8b/qwen3-8b-q4_k_m.gguf"
  metadata:
    quant: "Q4_K_M"
    ctx_train: 32768
```

## Role Profile

```yaml
role:
  role_id: "mentor"
  display_name: "Mentor"
  description: "Guidance-oriented instructional reasoning"
  modality: "text"
  operation: "chat"
  prompt_policy:
    system_prompt: "You guide without doing the user's work for them."
    user_template: "{{ user_input }}"
    request_policy:
      body_defaults:
        temperature: 0.2
  routing_policy:
    preferred_families: ["Qwen3", "Mistral"]
    preferred_labels: ["instruction", "stable"]
    min_context: 8192
    require_loaded: false
    fallback_roles: ["general_assistant"]
```

## Request Shape Policy

This is a general representation for model- or route-specific request shaping.

```yaml
request_shape_policy:
  body_defaults:
    chat_template_kwargs:
      enable_thinking: false
    temperature: 0.2
  system_prompt: "Return only visible final answer text."
  system_prompt_position: "prepend"
```

Use it for:

- model-specific request flags such as `chat_template_kwargs.enable_thinking`
- default OpenAI-compatible body fields that should be applied unless the caller already set them
- model-specific prompt instructions that should be prepended, appended, or replace an existing system message

GenieHive currently supports this policy on:

- `service.assets[].request_policy`
- `role.prompt_policy.request_policy`

The control plane may also infer built-in request policies from model family metadata. For example, Qwen3/Qwen3.5 chat routes default to `chat_template_kwargs.enable_thinking: false` unless the caller explicitly sets a different value.

`GET /v1/models` exposes the merged result as `geniehive.effective_request_policy` on service, asset, and role-backed model entries so clients can discover what GenieHive will apply by default.

## Health Sample

```yaml
health_sample:
  sample_id: "hs_01"
  target_type: "service"
  target_id: "atlas-01/chat/qwen3-8b"
  observed_at: "2026-04-05T15:30:00Z"
  status: "healthy"
  checks:
    http_ok: true
    models_ok: true
    auth_ok: true
  metrics:
    queue_depth: 1
    in_flight: 1
    mem_used_gb: 18.4
```

## Benchmark Sample

```yaml
benchmark_sample:
  benchmark_id: "bench_01"
  service_id: "atlas-01/chat/qwen3-8b"
  asset_id: "qwen3-8b-q4km"
  observed_at: "2026-04-05T15:25:00Z"
  workload: "chat.short_reasoning"
  results:
    prompt_tokens: 512
    completion_tokens: 256
    ttft_ms: 780
    tokens_per_sec: 44
```

## Route Match Request

```yaml
route_match_request:
  task: "fast technical reasoning for an interactive assistant"
  tasks:
    - "interactive debugging help"
    - "concise technical explanations"
  workload: "chat.short_reasoning"
  workloads:
    - "chat.short_reasoning"
    - "chat.concise_support"
  kind: "chat"
  modality: "text"
  include_direct_services: true
  limit: 5
```

This request is meant to answer:

- which role-backed route is the best current fit for this task or task suite
- which direct services also look suitable right now

V1 matching is metadata- and runtime-driven. It uses:

- role text and routing policy overlap
- service asset and runtime metadata overlap
- loaded state
- observed latency
- observed throughput
- current queue depth when available
- recent benchmark sample workload overlap and empirical quality/performance hints

If benchmark samples exist for a candidate service, workload hints such as `chat.short_reasoning` can boost routes with recent empirical fit.

## Route Match Candidate

```yaml
route_match_candidate:
  candidate_type: "role"
  candidate_id: "general_assistant"
  operation: "chat"
  score: 0.86
  reasons:
    - "task text overlaps role description or policy"
    - "resolved service matches role preferred model family"
    - "service already has a loaded asset"
    - "low observed latency"
    - "good observed throughput"
  signals:
    task_overlap: 0.33
    preferred_family_match: 1.0
    loaded: true
    p50_latency_ms: 1100
    tokens_per_sec: 28
    queue_depth: 0
    benchmark_match_count: 2
    best_workload_overlap: 1.0
    benchmark_quality_score: 0.9
  role:
    role_id: "general_assistant"
  service:
    service_id: "p40-box/chat/gpu1-secondary"
```

## Benchmark Ingest Request

```yaml
benchmark_ingest_request:
  samples:
    - benchmark_id: "bench-qwen-1"
      service_id: "p40-box/chat/gpu1-secondary"
      asset_id: "Qwen3.5-9B-Q5_K_M"
      workload: "chat.short_reasoning"
      observed_at: 1775582000.0
      results:
        ttft_ms: 900
        tokens_per_sec: 30
        quality_score: 0.9
```

## Benchmark Report File

This is a file-oriented format meant for repeatable benchmark runs before ingestion into GenieHive.

```yaml
benchmark_report:
  report_id: "p40-short-reasoning"
  observed_at: 1775583000.0
  source: "local-smoke"
  samples:
    - service_id: "p40-box/chat/gpu1-secondary"
      asset_id: "Qwen3.5-9B-Q5_K_M"
      workload: "chat.short_reasoning"
      results:
        ttft_ms: 900
        tokens_per_sec: 30
        quality_score: 0.9
```

Notes:

- `observed_at` may be set once at the report level or per sample
- `benchmark_id` is optional in the file format; GenieHive tooling can generate a stable ID during conversion
- the helper script `scripts/ingest_benchmark_report.py` loads this format and posts the expanded samples to `POST /v1/cluster/benchmarks`