GenieHive/docs/llm_demo.md

# GenieHive LLM Demo

This runbook covers the first practical GenieHive LLM demo with three roles:

- master: the GenieHive control plane
- peer: a GenieHive node agent attached to one or more local LLM servers
- client: a demo client agent or Codex using GenieHive as the API front door

## Current Readiness

GenieHive v1 is fully implemented and ready for live demo.

What works:

- Node registration and heartbeat with auto-re-registration on 404
- Role-aware route resolution with `fallback_roles` chain and cycle protection
- Three routing strategies: `scored` (default), `round_robin`, `least_loaded`
- `GET /v1/models` — OpenAI-compatible catalog with rich GenieHive metadata
- `POST /v1/chat/completions` — non-streaming and streaming (`stream: true`)
- `POST /v1/embeddings`
- `POST /v1/audio/transcriptions` — multipart audio proxy
- Active health probing (`routing.probe_interval_s` in control config)
- Ollama dynamic model discovery: `discover_protocol: "ollama"` in node config
  queries `/api/tags` and `/api/ps` each heartbeat; corrects loaded state and
  populates `observed.loaded_model_count` and `observed.vram_used_bytes`
- OpenAI-compatible discovery: `discover_protocol: "openai"` queries `/v1/models`
- Reasoning-field stripping (`reasoning_content`, `reasoning`) from both
  non-streaming and streaming responses
- Request policy: body defaults, overrides, system prompt injection per asset or role
- Qwen3/Qwen3.5 auto-detection with `enable_thinking: false` applied automatically

GenieHive does not launch upstream LLM servers for you. Treat it as a
metadata-rich router over already-running local servers.

## Smoke Test

After bringing up control + node + upstream, run:

```bash
python scripts/smoke_test.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key
```

This validates in sequence: health, cluster state, model catalog, route
resolution, non-streaming chat (role and direct asset), streaming chat,
embeddings, Ollama discovery metrics, and reasoning-field stripping. Each
check reports PASS / FAIL / SKIP with a short explanation.

Optional flags:

```bash
python scripts/smoke_test.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --chat-role mentor \
  --chat-asset qwen3 \
  --embed-asset nomic-embed-text
```

## New Capabilities Since Initial Demo

### Streaming chat

Add `"stream": true` to any chat request. GenieHive returns a standard
`text/event-stream` response with `Cache-Control: no-cache` and
`X-Accel-Buffering: no` headers set for nginx/proxy compatibility:

```bash
curl -sS http://127.0.0.1:8800/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "mentor",
    "messages": [{"role":"user","content":"Count to five."}],
    "stream": true
  }'
```

Reasoning fields (`reasoning_content`, `reasoning`) are stripped from every
SSE chunk before forwarding, just as they are for non-streaming responses.

### Routing strategy

Set `routing.default_strategy` in your control config:

```yaml
routing:
  default_strategy: "least_loaded"   # scored | round_robin | least_loaded
```

`least_loaded` picks the service with the lowest `queue_depth + in_flight`.
When Ollama discovery is enabled, `loaded_model_count` and `vram_used_bytes`
are available in `observed` and visible via `GET /v1/cluster/services`.

### Ollama dynamic model discovery

Add `discover_protocol: "ollama"` to any Ollama-backed service in your node
config. On each heartbeat the node queries `/api/tags` (available models) and
`/api/ps` (VRAM-loaded models) and merges the results into the service's asset
list. Stale `loaded: true` entries in static config are corrected automatically.

```yaml
services:
  - service_id: "singlebox/chat/qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:11434"
    discover_protocol: "ollama"
    assets:
      - asset_id: "qwen3"   # static baseline; enriched each heartbeat
        loaded: true
```

After the first enriched heartbeat, `GET /v1/cluster/services` will show
`observed.loaded_model_count` and `observed.vram_used_bytes` for that service.

### Role catalogs

Five role catalog files are now included under `configs/`:

| File | Framework | Roles |
|---|---|---|
| `roles.surgical-team.example.yaml` | Brooks/Mills surgical team | 9 (`surg_` prefix) |
| `roles.belbin.example.yaml` | Belbin team roles | 9 (`belbin_` prefix) |
| `roles.sixhats.example.yaml` | De Bono Six Thinking Hats | 6 (`sixhats_` prefix) |
| `roles.disney.example.yaml` | Disney creative strategy | 3 (`disney_` prefix) |
| `roles.xp.example.yaml` | XP team roles | 5 (`xp_` prefix) |

Point `roles_path` in your control config at any of these files to load that
catalog. Multiple catalogs can be merged by listing them in sequence — or
concatenate the `roles:` blocks manually into a single file.

## Topologies

### Smallest Demo

Run everything on one host:

- control plane on `127.0.0.1:8800`
- node agent on `127.0.0.1:8891`
- one or more upstream model servers on local ports

This is also the recommended setup for users who do not have a cluster. GenieHive still provides value as:

- a local router
- a metadata-rich local model catalog
- a role-to-model indirection layer
- a common front door for client tools

### Two-Host Demo

- master host runs GenieHive control plane
- peer host runs GenieHive node agent and one or more local LLM servers
- client runs anywhere that can reach the master

## Master Instructions

On the control-plane host:

1. Create a repo-local Python environment if you want isolation.
2. Start GenieHive control:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_control.sh
```

3. Confirm health:

```bash
curl -sS http://127.0.0.1:8800/health
```

Expected result:

- JSON containing `{"status":"ok"}`

4. Keep note of the example client and node keys from `configs/control.example.yaml`.

### Single-Box Shortcut

If you are running control and node on the same machine, use:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
```

For your P40 host, repo-provided external bind helpers now exist:

LAN:

```bash
bash scripts/run_control_p40_lan.sh
```

ZeroTier:

```bash
bash scripts/run_control_p40_zerotier.sh
```

Both use the P40-specific control config and only change the bind interface.

## Peer Instructions

On each peer host you need:

- one or more local LLM servers already running
- one GenieHive node config that points at those servers
- the control-plane base URL and node API key

For a single-machine setup, the peer is simply another process on the same host.

The node agent should advertise upstream server roots, not endpoint suffixes. For example:

- good: `http://127.0.0.1:11434`
- good: `http://127.0.0.1:18091`
- not good: `http://127.0.0.1:11434/v1/chat/completions`

### Option A: Ollama

Use this when you want the lowest-friction chat and embeddings demo.

1. Start Ollama if it is not already running:

```bash
ollama serve
```

2. Pull the model or models you want:

```bash
ollama pull qwen3
ollama pull nomic-embed-text
```

3. Example peer service config:

```yaml
services:
  - service_id: "peer1/chat/qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "qwen3"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/nomic-embed-text"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "nomic-embed-text"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true
```

4. Start the node:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml
```

### Option B: llama.cpp

Use this when you want direct GGUF serving with `llama-server`.

1. Start a chat server:

```bash
llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091
```

2. Example peer service config:

```yaml
services:
  - service_id: "peer1/chat/qwen3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llama.cpp"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true
```

Then start the node:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml
```

Note:

- The official `llama.cpp` docs clearly show OpenAI-compatible chat serving.
- For embeddings, some `llama.cpp` builds document non-OpenAI embedding endpoints such as `/embedding`, so GenieHive’s current `POST /v1/embeddings` path is safest with Ollama or vLLM unless you have verified your specific build.

### Option C: llamafile

Use this when you want a single-file local server built around llama.cpp.

1. Start a chat server:

```bash
./your-model.llamafile --server --host 127.0.0.1 --port 18091 --nobrowser
```

2. Example peer service config:

```yaml
services:
  - service_id: "peer1/chat/llamafile-qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llamafile"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true
```

Then start the node:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamafile.example.yaml
```

### Option D: vLLM

Use this when you want a more server-oriented OpenAI-compatible stack and you have the hardware budget for it.

1. Start the server:

```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
```

2. Example peer service config:

```yaml
services:
  - service_id: "peer1/chat/llama3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:8000"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "NousResearch/Meta-Llama-3-8B-Instruct"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/bge-base"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:8001"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "BAAI/bge-base-en-v1.5"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true
```

## Minimal Node Config Pattern

For a real peer host, the fields you most likely need to edit in `configs/node.example.yaml` are:

- `node.host_id`
- `node.display_name`
- `node.address`
- `control_plane.base_url`
- `control_plane.node_api_key`
- `inventory.capabilities`
- `services`

## Client Instructions

You now have two simple ways to exercise GenieHive as a client.

### Option 1: Inspect and call it manually

List models:

```bash
curl -sS http://127.0.0.1:8800/v1/models \
  -H 'X-Api-Key: change-me-client-key'
```

Chat using a role:

```bash
curl -sS http://127.0.0.1:8800/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "mentor",
    "messages": [{"role":"user","content":"Give me a 2-sentence summary of why SQLite is useful here."}]
  }'
```

Embeddings using a direct embedding asset:

```bash
curl -sS http://127.0.0.1:8800/v1/embeddings \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "nomic-embed-text",
    "input": "GenieHive is a local-first control plane."
  }'
```

### Option 2: Use the demo client agent

Run:

```bash
cd /home/netuser/bin/geniehive
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize the current GenieHive demo in three bullets."
```

That script will:

- read `GET /v1/models`
- choose a chat-capable model automatically if you do not specify one
- prefer entries GenieHive marks as suitable for lower-complexity offload
- submit a chat request and print the answer

If you want to force a specific route:

```bash
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State what host and route type you would expect for this demo."
```

## Codex-As-Client

For Codex or another agentic client, the intended pattern is:

1. Read `GET /v1/models`.
2. Filter for `geniehive.operation == "chat"`.
3. Prefer:
   - `geniehive.offload_hint.suitability == "good_for_low_complexity"`
   - `geniehive.loaded_target_count > 0` for role entries
   - lower `best_p50_latency_ms`
4. Send lower-complexity requests to GenieHive.
5. Keep higher-complexity, high-context, or high-risk tasks local unless the catalog indicates a better remote fit.

## Good First Live Demo

If you want the safest first success path:

- control plane on one host
- node agent on the same host
- Ollama upstream with one chat model
- role alias `mentor`
- demo client agent calling `mentor`

That avoids GGUF-specific launch tuning while still exercising the full GenieHive master/peer/client path.

## Single-Machine End-to-End Example

### Ollama-backed single box

1. Start Ollama:

```bash
ollama serve
```

2. Pull models:

```bash
ollama pull qwen3
ollama pull nomic-embed-text
```

3. Start GenieHive control:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
```

4. Start GenieHive node:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml
```

5. Inspect:

```bash
bash scripts/demo_inspect.sh
```

6. Run the client agent:

```bash
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Explain in three bullets what GenieHive is doing in this single-machine demo."
```

### llama.cpp-backed single box

1. Start the local server:

```bash
llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091
```

2. Start GenieHive control:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
```

3. Start GenieHive node:

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml
```

4. Run the client agent:

```bash
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize why a single-machine GenieHive setup can still be useful."
```

## Host-Specific Note: Dual Tesla P40 + 128 GB RAM

For a machine with:

- `2 x Nvidia Tesla P40`
- `AMD Ryzen 5600G`
- `128 GB RAM`

the most practical first GenieHive layout is:

- one chat model on `GPU0`
- one chat or utility model on `GPU1`
- one slower fallback chat model on CPU

This is now sketched in:

- `configs/node.singlebox.p40-triple.example.yaml`
- `configs/control.singlebox.p40.example.yaml`
- `configs/roles.singlebox.p40.example.yaml`
- `scripts/start_p40_triple_llamacpp.sh`
- `scripts/launch_p40_triple.sh`
- `scripts/p40_triple_gpu0.sh`
- `scripts/p40_triple_gpu1.sh`
- `scripts/p40_triple_cpu.sh`

The current concrete defaults use models already present under `/home/netuser/bin/models/llm`:

- `GPU0`: `Qwen2.5-14B-Instruct-1M-Q5_K_M.gguf`
- `GPU1`: `Qwen3.5-9B-Q5_K_M.gguf`
- `CPU`: `rocket-3b.Q5_K_M.gguf`

### Why this layout works

- each P40 has enough VRAM for a quantized 7B to 14B model comfortably
- 128 GB RAM is enough to hold a separate CPU-served fallback model without much trouble
- the CPU route will be much slower, but it is still useful for low-priority offload or fallback handling

### Suggested role usage

- `mentor` or primary chat role -> `GPU0`
- `general_assistant` or alternate chat role -> `GPU1`
- `fallback_writer` or `background_summarizer` -> CPU route

The repo now includes a host-specific role catalog with exactly that intent.

### Launch pattern

1. Edit your model paths:

```bash
cd /home/netuser/bin/geniehive
bash scripts/start_p40_triple_llamacpp.sh
```

If the defaults look good, you do not need to edit them before trying the first run.

If `tmux` is available, you can also launch the three processes detached:

```bash
cd /home/netuser/bin/geniehive
bash scripts/launch_p40_triple.sh
```

Then inspect pane state without binding your current terminal to the session:

```bash
bash scripts/tmux_session_status.sh
```

That status helper checks whether the session exists and whether each pane's launcher process is still running or has already exited. If `tmux` is not installed, the combined launcher prints the three helper commands instead.

2. Start the three `llama-server` processes in separate shells.

3. Start GenieHive control:

```bash
bash scripts/run_control_singlebox.sh configs/control.singlebox.p40.example.yaml
```

4. Start GenieHive node with the host-specific config:

```bash
bash scripts/run_node_singlebox.sh configs/node.singlebox.p40-triple.example.yaml
```

5. Inspect the catalog:

```bash
bash scripts/demo_inspect.sh
```

If something is not coming up cleanly, run:

```bash
bash scripts/check_singlebox_health.sh
```

That checks:

- `GPU0` upstream health
- `GPU1` upstream health
- CPU fallback upstream health
- GenieHive control health
- GenieHive node health
- authenticated cluster and model-catalog endpoints

6. Exercise the chat path:

```bash
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State which route should be preferred for low-latency chat and which should be the slow fallback."
```

### Practical expectations

- `GPU0` and `GPU1` should be the preferred targets for normal chat work
- the CPU route should mostly be treated as fallback or low-priority background work
- GenieHive metadata should make that visible to clients through latency and offload hints

### Containerized Qwen3.5 probe

If the host-installed `llama-server` is too old for `Qwen3.5`, but the NVIDIA Container Toolkit is installed, you can test a newer CUDA-enabled `llama.cpp` without changing the host CUDA stack:

```bash
cd /home/netuser/bin/geniehive
bash scripts/test_qwen35_server_cuda_container.sh
```

Useful overrides:

```bash
GPU_INDEX=1 PORT=19092 bash scripts/test_qwen35_server_cuda_container.sh
MODEL_PATH=/home/netuser/bin/models/llm/Qwen3.5-9B-Q5_K_M.gguf bash scripts/test_qwen35_server_cuda_container.sh
```

That probe uses the official `ghcr.io/ggml-org/llama.cpp:server-cuda` image. If it loads the model and starts serving, then the remaining blocker is your host `llama.cpp` install, not GPU compatibility.

## External Client Access

For your current host addresses:

- LAN: `192.168.40.207`
- ZeroTier: `172.24.50.65`

The cleanest rule is:

- keep upstream model servers on `127.0.0.1`
- keep the GenieHive node on `127.0.0.1` unless you specifically need remote node access
- expose only the GenieHive control plane to LAN or ZeroTier clients

That gives remote clients a single stable endpoint without exposing the underlying model servers directly.

### LAN bind

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_lan.sh
```

Remote client example:

```bash
python scripts/demo_client_agent.py \
  --base-url http://192.168.40.207:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "Briefly describe the preferred and fallback routes on this host."
```

### ZeroTier bind

```bash
cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_zerotier.sh
```

Remote client example:

```bash
python scripts/demo_client_agent.py \
  --base-url http://172.24.50.65:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "Briefly describe the preferred and fallback routes on this host."
```

### Security note

Prefer ZeroTier over general LAN exposure when possible. In both cases:

- do not expose the upstream `llama-server` ports
- keep the client API key enabled
- if you later open this beyond trusted networks, add a reverse proxy or VPN-only boundary rather than binding GenieHive broadly

### Role meanings for this host

- `mentor` should bias toward the `GPU0` Qwen2.5 14B route
- `general_assistant` should bias toward the `GPU1` Qwen3.5 9B route
- `background_summarizer` should bias toward the CPU Rocket 3B fallback route