20 KiB

Raw Blame History

GenieHive LLM Demo

This runbook covers the first practical GenieHive LLM demo with three roles:

master: the GenieHive control plane
peer: a GenieHive node agent attached to one or more local LLM servers
client: a demo client agent or Codex using GenieHive as the API front door

Current Readiness

GenieHive v1 is fully implemented and ready for live demo.

What works:

Node registration and heartbeat with auto-re-registration on 404
Role-aware route resolution with fallback_roles chain and cycle protection
Three routing strategies: scored (default), round_robin, least_loaded
GET /v1/models — OpenAI-compatible catalog with rich GenieHive metadata
POST /v1/chat/completions — non-streaming and streaming (stream: true)
POST /v1/embeddings
POST /v1/audio/transcriptions — multipart audio proxy
Active health probing (routing.probe_interval_s in control config)
Ollama dynamic model discovery: discover_protocol: "ollama" in node config queries /api/tags and /api/ps each heartbeat; corrects loaded state and populates observed.loaded_model_count and observed.vram_used_bytes
OpenAI-compatible discovery: discover_protocol: "openai" queries /v1/models
Reasoning-field stripping (reasoning_content, reasoning) from both non-streaming and streaming responses
Request policy: body defaults, overrides, system prompt injection per asset or role
Qwen3/Qwen3.5 auto-detection with enable_thinking: false applied automatically

GenieHive does not launch upstream LLM servers for you. Treat it as a metadata-rich router over already-running local servers.

Smoke Test

After bringing up control + node + upstream, run:

python scripts/smoke_test.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key

This validates in sequence: health, cluster state, model catalog, route resolution, non-streaming chat (role and direct asset), streaming chat, embeddings, Ollama discovery metrics, and reasoning-field stripping. Each check reports PASS / FAIL / SKIP with a short explanation.

Optional flags:

python scripts/smoke_test.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --chat-role mentor \
  --chat-asset qwen3 \
  --embed-asset nomic-embed-text

New Capabilities Since Initial Demo

Streaming chat

Add "stream": true to any chat request. GenieHive returns a standard text/event-stream response with Cache-Control: no-cache and X-Accel-Buffering: no headers set for nginx/proxy compatibility:

curl -sS http://127.0.0.1:8800/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "mentor",
    "messages": [{"role":"user","content":"Count to five."}],
    "stream": true
  }'

Reasoning fields (reasoning_content, reasoning) are stripped from every SSE chunk before forwarding, just as they are for non-streaming responses.

Routing strategy

Set routing.default_strategy in your control config:

routing:
  default_strategy: "least_loaded"   # scored | round_robin | least_loaded

least_loaded picks the service with the lowest queue_depth + in_flight. When Ollama discovery is enabled, loaded_model_count and vram_used_bytes are available in observed and visible via GET /v1/cluster/services.

Ollama dynamic model discovery

Add discover_protocol: "ollama" to any Ollama-backed service in your node config. On each heartbeat the node queries /api/tags (available models) and /api/ps (VRAM-loaded models) and merges the results into the service's asset list. Stale loaded: true entries in static config are corrected automatically.

services:
  - service_id: "singlebox/chat/qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:11434"
    discover_protocol: "ollama"
    assets:
      - asset_id: "qwen3"   # static baseline; enriched each heartbeat
        loaded: true

After the first enriched heartbeat, GET /v1/cluster/services will show observed.loaded_model_count and observed.vram_used_bytes for that service.

Role catalogs

Five role catalog files are now included under configs/:

File	Framework	Roles
`roles.surgical-team.example.yaml`	Brooks/Mills surgical team	9 (`surg_` prefix)
`roles.belbin.example.yaml`	Belbin team roles	9 (`belbin_` prefix)
`roles.sixhats.example.yaml`	De Bono Six Thinking Hats	6 (`sixhats_` prefix)
`roles.disney.example.yaml`	Disney creative strategy	3 (`disney_` prefix)
`roles.xp.example.yaml`	XP team roles	5 (`xp_` prefix)

Point roles_path in your control config at any of these files to load that catalog. Multiple catalogs can be merged by listing them in sequence — or concatenate the roles: blocks manually into a single file.

Topologies

Smallest Demo

Run everything on one host:

control plane on 127.0.0.1:8800
node agent on 127.0.0.1:8891
one or more upstream model servers on local ports

This is also the recommended setup for users who do not have a cluster. GenieHive still provides value as:

a local router
a metadata-rich local model catalog
a role-to-model indirection layer
a common front door for client tools

Two-Host Demo

master host runs GenieHive control plane
peer host runs GenieHive node agent and one or more local LLM servers
client runs anywhere that can reach the master

Master Instructions

On the control-plane host:

Create a repo-local Python environment if you want isolation.
Start GenieHive control:

cd /home/netuser/bin/geniehive
bash scripts/run_control.sh

Confirm health:

curl -sS http://127.0.0.1:8800/health

Expected result:

JSON containing {"status":"ok"}

Keep note of the example client and node keys from configs/control.example.yaml.

Single-Box Shortcut

If you are running control and node on the same machine, use:

cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh

For your P40 host, repo-provided external bind helpers now exist:

LAN:

bash scripts/run_control_p40_lan.sh

ZeroTier:

bash scripts/run_control_p40_zerotier.sh

Both use the P40-specific control config and only change the bind interface.

Peer Instructions

On each peer host you need:

one or more local LLM servers already running
one GenieHive node config that points at those servers
the control-plane base URL and node API key

For a single-machine setup, the peer is simply another process on the same host.

The node agent should advertise upstream server roots, not endpoint suffixes. For example:

good: http://127.0.0.1:11434
good: http://127.0.0.1:18091
not good: http://127.0.0.1:11434/v1/chat/completions

Option A: Ollama

Use this when you want the lowest-friction chat and embeddings demo.

Start Ollama if it is not already running:

ollama serve

Pull the model or models you want:

ollama pull qwen3
ollama pull nomic-embed-text

Example peer service config:

services:
  - service_id: "peer1/chat/qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "qwen3"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/nomic-embed-text"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "nomic-embed-text"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml

Option B: llama.cpp

Use this when you want direct GGUF serving with llama-server.

Start a chat server:

llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091

Example peer service config:

services:
  - service_id: "peer1/chat/qwen3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llama.cpp"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Then start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml

Note:

The official llama.cpp docs clearly show OpenAI-compatible chat serving.
For embeddings, some llama.cpp builds document non-OpenAI embedding endpoints such as /embedding, so GenieHive’s current POST /v1/embeddings path is safest with Ollama or vLLM unless you have verified your specific build.

Option C: llamafile

Use this when you want a single-file local server built around llama.cpp.

Start a chat server:

./your-model.llamafile --server --host 127.0.0.1 --port 18091 --nobrowser

Example peer service config:

services:
  - service_id: "peer1/chat/llamafile-qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llamafile"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Then start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamafile.example.yaml

Option D: vLLM

Use this when you want a more server-oriented OpenAI-compatible stack and you have the hardware budget for it.

Start the server:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123

Example peer service config:

services:
  - service_id: "peer1/chat/llama3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:8000"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "NousResearch/Meta-Llama-3-8B-Instruct"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/bge-base"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:8001"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "BAAI/bge-base-en-v1.5"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Minimal Node Config Pattern

For a real peer host, the fields you most likely need to edit in configs/node.example.yaml are:

node.host_id
node.display_name
node.address
control_plane.base_url
control_plane.node_api_key
inventory.capabilities
services

Client Instructions

You now have two simple ways to exercise GenieHive as a client.

Option 1: Inspect and call it manually

List models:

curl -sS http://127.0.0.1:8800/v1/models \
  -H 'X-Api-Key: change-me-client-key'

Chat using a role:

curl -sS http://127.0.0.1:8800/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "mentor",
    "messages": [{"role":"user","content":"Give me a 2-sentence summary of why SQLite is useful here."}]
  }'

Embeddings using a direct embedding asset:

curl -sS http://127.0.0.1:8800/v1/embeddings \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "nomic-embed-text",
    "input": "GenieHive is a local-first control plane."
  }'

Option 2: Use the demo client agent

Run:

cd /home/netuser/bin/geniehive
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize the current GenieHive demo in three bullets."

That script will:

read GET /v1/models
choose a chat-capable model automatically if you do not specify one
prefer entries GenieHive marks as suitable for lower-complexity offload
submit a chat request and print the answer

If you want to force a specific route:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State what host and route type you would expect for this demo."

Codex-As-Client

For Codex or another agentic client, the intended pattern is:

Read GET /v1/models.
Filter for geniehive.operation == "chat".
Prefer:
- geniehive.offload_hint.suitability == "good_for_low_complexity"
- geniehive.loaded_target_count > 0 for role entries
- lower best_p50_latency_ms
Send lower-complexity requests to GenieHive.
Keep higher-complexity, high-context, or high-risk tasks local unless the catalog indicates a better remote fit.

Good First Live Demo

If you want the safest first success path:

control plane on one host
node agent on the same host
Ollama upstream with one chat model
role alias mentor
demo client agent calling mentor

That avoids GGUF-specific launch tuning while still exercising the full GenieHive master/peer/client path.

Single-Machine End-to-End Example

Ollama-backed single box

Start Ollama:

ollama serve

Pull models:

ollama pull qwen3
ollama pull nomic-embed-text

Start GenieHive control:

cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh

Start GenieHive node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml

Inspect:

bash scripts/demo_inspect.sh

Run the client agent:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Explain in three bullets what GenieHive is doing in this single-machine demo."

llama.cpp-backed single box

Start the local server:

llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091

Start GenieHive control:

cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh

Start GenieHive node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml

Run the client agent:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize why a single-machine GenieHive setup can still be useful."

Host-Specific Note: Dual Tesla P40 + 128 GB RAM

For a machine with:

2 x Nvidia Tesla P40
AMD Ryzen 5600G
128 GB RAM

the most practical first GenieHive layout is:

one chat model on GPU0
one chat or utility model on GPU1
one slower fallback chat model on CPU

This is now sketched in:

configs/node.singlebox.p40-triple.example.yaml
configs/control.singlebox.p40.example.yaml
configs/roles.singlebox.p40.example.yaml
scripts/start_p40_triple_llamacpp.sh
scripts/launch_p40_triple.sh
scripts/p40_triple_gpu0.sh
scripts/p40_triple_gpu1.sh
scripts/p40_triple_cpu.sh

The current concrete defaults use models already present under /home/netuser/bin/models/llm:

GPU0: Qwen2.5-14B-Instruct-1M-Q5_K_M.gguf
GPU1: Qwen3.5-9B-Q5_K_M.gguf
CPU: rocket-3b.Q5_K_M.gguf

Why this layout works

each P40 has enough VRAM for a quantized 7B to 14B model comfortably
128 GB RAM is enough to hold a separate CPU-served fallback model without much trouble
the CPU route will be much slower, but it is still useful for low-priority offload or fallback handling

Suggested role usage

mentor or primary chat role -> GPU0
general_assistant or alternate chat role -> GPU1
fallback_writer or background_summarizer -> CPU route

The repo now includes a host-specific role catalog with exactly that intent.

Launch pattern

Edit your model paths:

cd /home/netuser/bin/geniehive
bash scripts/start_p40_triple_llamacpp.sh

If the defaults look good, you do not need to edit them before trying the first run.

If tmux is available, you can also launch the three processes detached:

cd /home/netuser/bin/geniehive
bash scripts/launch_p40_triple.sh

Then inspect pane state without binding your current terminal to the session:

bash scripts/tmux_session_status.sh

That status helper checks whether the session exists and whether each pane's launcher process is still running or has already exited. If tmux is not installed, the combined launcher prints the three helper commands instead.

Start the three llama-server processes in separate shells.
Start GenieHive control:

bash scripts/run_control_singlebox.sh configs/control.singlebox.p40.example.yaml

Start GenieHive node with the host-specific config:

bash scripts/run_node_singlebox.sh configs/node.singlebox.p40-triple.example.yaml

Inspect the catalog:

bash scripts/demo_inspect.sh

If something is not coming up cleanly, run:

bash scripts/check_singlebox_health.sh

That checks:

GPU0 upstream health
GPU1 upstream health
CPU fallback upstream health
GenieHive control health
GenieHive node health
authenticated cluster and model-catalog endpoints

Exercise the chat path:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State which route should be preferred for low-latency chat and which should be the slow fallback."

Practical expectations

GPU0 and GPU1 should be the preferred targets for normal chat work
the CPU route should mostly be treated as fallback or low-priority background work
GenieHive metadata should make that visible to clients through latency and offload hints

Containerized Qwen3.5 probe

If the host-installed llama-server is too old for Qwen3.5, but the NVIDIA Container Toolkit is installed, you can test a newer CUDA-enabled llama.cpp without changing the host CUDA stack:

cd /home/netuser/bin/geniehive
bash scripts/test_qwen35_server_cuda_container.sh

Useful overrides:

GPU_INDEX=1 PORT=19092 bash scripts/test_qwen35_server_cuda_container.sh
MODEL_PATH=/home/netuser/bin/models/llm/Qwen3.5-9B-Q5_K_M.gguf bash scripts/test_qwen35_server_cuda_container.sh

That probe uses the official ghcr.io/ggml-org/llama.cpp:server-cuda image. If it loads the model and starts serving, then the remaining blocker is your host llama.cpp install, not GPU compatibility.

External Client Access

For your current host addresses:

LAN: 192.168.40.207
ZeroTier: 172.24.50.65

The cleanest rule is:

keep upstream model servers on 127.0.0.1
keep the GenieHive node on 127.0.0.1 unless you specifically need remote node access
expose only the GenieHive control plane to LAN or ZeroTier clients

That gives remote clients a single stable endpoint without exposing the underlying model servers directly.

LAN bind

cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_lan.sh

Remote client example:

python scripts/demo_client_agent.py \
  --base-url http://192.168.40.207:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "Briefly describe the preferred and fallback routes on this host."

ZeroTier bind

cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_zerotier.sh