20 KiB
GenieHive LLM Demo
This runbook covers the first practical GenieHive LLM demo with three roles:
- master: the GenieHive control plane
- peer: a GenieHive node agent attached to one or more local LLM servers
- client: a demo client agent or Codex using GenieHive as the API front door
Current Readiness
GenieHive v1 is fully implemented and ready for live demo.
What works:
- Node registration and heartbeat with auto-re-registration on 404
- Role-aware route resolution with
fallback_roleschain and cycle protection - Three routing strategies:
scored(default),round_robin,least_loaded GET /v1/models— OpenAI-compatible catalog with rich GenieHive metadataPOST /v1/chat/completions— non-streaming and streaming (stream: true)POST /v1/embeddingsPOST /v1/audio/transcriptions— multipart audio proxy- Active health probing (
routing.probe_interval_sin control config) - Ollama dynamic model discovery:
discover_protocol: "ollama"in node config queries/api/tagsand/api/pseach heartbeat; corrects loaded state and populatesobserved.loaded_model_countandobserved.vram_used_bytes - OpenAI-compatible discovery:
discover_protocol: "openai"queries/v1/models - Reasoning-field stripping (
reasoning_content,reasoning) from both non-streaming and streaming responses - Request policy: body defaults, overrides, system prompt injection per asset or role
- Qwen3/Qwen3.5 auto-detection with
enable_thinking: falseapplied automatically
GenieHive does not launch upstream LLM servers for you. Treat it as a metadata-rich router over already-running local servers.
Smoke Test
After bringing up control + node + upstream, run:
python scripts/smoke_test.py \
--base-url http://127.0.0.1:8800 \
--api-key change-me-client-key
This validates in sequence: health, cluster state, model catalog, route resolution, non-streaming chat (role and direct asset), streaming chat, embeddings, Ollama discovery metrics, and reasoning-field stripping. Each check reports PASS / FAIL / SKIP with a short explanation.
Optional flags:
python scripts/smoke_test.py \
--base-url http://127.0.0.1:8800 \
--api-key change-me-client-key \
--chat-role mentor \
--chat-asset qwen3 \
--embed-asset nomic-embed-text
New Capabilities Since Initial Demo
Streaming chat
Add "stream": true to any chat request. GenieHive returns a standard
text/event-stream response with Cache-Control: no-cache and
X-Accel-Buffering: no headers set for nginx/proxy compatibility:
curl -sS http://127.0.0.1:8800/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "mentor",
"messages": [{"role":"user","content":"Count to five."}],
"stream": true
}'
Reasoning fields (reasoning_content, reasoning) are stripped from every
SSE chunk before forwarding, just as they are for non-streaming responses.
Routing strategy
Set routing.default_strategy in your control config:
routing:
default_strategy: "least_loaded" # scored | round_robin | least_loaded
least_loaded picks the service with the lowest queue_depth + in_flight.
When Ollama discovery is enabled, loaded_model_count and vram_used_bytes
are available in observed and visible via GET /v1/cluster/services.
Ollama dynamic model discovery
Add discover_protocol: "ollama" to any Ollama-backed service in your node
config. On each heartbeat the node queries /api/tags (available models) and
/api/ps (VRAM-loaded models) and merges the results into the service's asset
list. Stale loaded: true entries in static config are corrected automatically.
services:
- service_id: "singlebox/chat/qwen3"
kind: "chat"
endpoint: "http://127.0.0.1:11434"
discover_protocol: "ollama"
assets:
- asset_id: "qwen3" # static baseline; enriched each heartbeat
loaded: true
After the first enriched heartbeat, GET /v1/cluster/services will show
observed.loaded_model_count and observed.vram_used_bytes for that service.
Role catalogs
Five role catalog files are now included under configs/:
| File | Framework | Roles |
|---|---|---|
roles.surgical-team.example.yaml |
Brooks/Mills surgical team | 9 (surg_ prefix) |
roles.belbin.example.yaml |
Belbin team roles | 9 (belbin_ prefix) |
roles.sixhats.example.yaml |
De Bono Six Thinking Hats | 6 (sixhats_ prefix) |
roles.disney.example.yaml |
Disney creative strategy | 3 (disney_ prefix) |
roles.xp.example.yaml |
XP team roles | 5 (xp_ prefix) |
Point roles_path in your control config at any of these files to load that
catalog. Multiple catalogs can be merged by listing them in sequence — or
concatenate the roles: blocks manually into a single file.
Topologies
Smallest Demo
Run everything on one host:
- control plane on
127.0.0.1:8800 - node agent on
127.0.0.1:8891 - one or more upstream model servers on local ports
This is also the recommended setup for users who do not have a cluster. GenieHive still provides value as:
- a local router
- a metadata-rich local model catalog
- a role-to-model indirection layer
- a common front door for client tools
Two-Host Demo
- master host runs GenieHive control plane
- peer host runs GenieHive node agent and one or more local LLM servers
- client runs anywhere that can reach the master
Master Instructions
On the control-plane host:
- Create a repo-local Python environment if you want isolation.
- Start GenieHive control:
cd /home/netuser/bin/geniehive
bash scripts/run_control.sh
- Confirm health:
curl -sS http://127.0.0.1:8800/health
Expected result:
- JSON containing
{"status":"ok"}
- Keep note of the example client and node keys from
configs/control.example.yaml.
Single-Box Shortcut
If you are running control and node on the same machine, use:
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
For your P40 host, repo-provided external bind helpers now exist:
LAN:
bash scripts/run_control_p40_lan.sh
ZeroTier:
bash scripts/run_control_p40_zerotier.sh
Both use the P40-specific control config and only change the bind interface.
Peer Instructions
On each peer host you need:
- one or more local LLM servers already running
- one GenieHive node config that points at those servers
- the control-plane base URL and node API key
For a single-machine setup, the peer is simply another process on the same host.
The node agent should advertise upstream server roots, not endpoint suffixes. For example:
- good:
http://127.0.0.1:11434 - good:
http://127.0.0.1:18091 - not good:
http://127.0.0.1:11434/v1/chat/completions
Option A: Ollama
Use this when you want the lowest-friction chat and embeddings demo.
- Start Ollama if it is not already running:
ollama serve
- Pull the model or models you want:
ollama pull qwen3
ollama pull nomic-embed-text
- Example peer service config:
services:
- service_id: "peer1/chat/qwen3"
kind: "chat"
endpoint: "http://127.0.0.1:11434"
runtime:
engine: "ollama"
launcher: "external"
assets:
- asset_id: "qwen3"
loaded: true
state:
health: "healthy"
load_state: "loaded"
accept_requests: true
- service_id: "peer1/embeddings/nomic-embed-text"
kind: "embeddings"
endpoint: "http://127.0.0.1:11434"
runtime:
engine: "ollama"
launcher: "external"
assets:
- asset_id: "nomic-embed-text"
loaded: true
state:
health: "healthy"
load_state: "loaded"
accept_requests: true
- Start the node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml
Option B: llama.cpp
Use this when you want direct GGUF serving with llama-server.
- Start a chat server:
llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091
- Example peer service config:
services:
- service_id: "peer1/chat/qwen3-8b"
kind: "chat"
endpoint: "http://127.0.0.1:18091"
runtime:
engine: "llama.cpp"
launcher: "external"
assets:
- asset_id: "qwen3-8b-q4_k_m"
loaded: true
state:
health: "healthy"
load_state: "loaded"
accept_requests: true
Then start the node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml
Note:
- The official
llama.cppdocs clearly show OpenAI-compatible chat serving. - For embeddings, some
llama.cppbuilds document non-OpenAI embedding endpoints such as/embedding, so GenieHive’s currentPOST /v1/embeddingspath is safest with Ollama or vLLM unless you have verified your specific build.
Option C: llamafile
Use this when you want a single-file local server built around llama.cpp.
- Start a chat server:
./your-model.llamafile --server --host 127.0.0.1 --port 18091 --nobrowser
- Example peer service config:
services:
- service_id: "peer1/chat/llamafile-qwen3"
kind: "chat"
endpoint: "http://127.0.0.1:18091"
runtime:
engine: "llamafile"
launcher: "external"
assets:
- asset_id: "qwen3-8b-q4_k_m"
loaded: true
state:
health: "healthy"
load_state: "loaded"
accept_requests: true
Then start the node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamafile.example.yaml
Option D: vLLM
Use this when you want a more server-oriented OpenAI-compatible stack and you have the hardware budget for it.
- Start the server:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
- Example peer service config:
services:
- service_id: "peer1/chat/llama3-8b"
kind: "chat"
endpoint: "http://127.0.0.1:8000"
runtime:
engine: "vllm"
launcher: "external"
assets:
- asset_id: "NousResearch/Meta-Llama-3-8B-Instruct"
loaded: true
state:
health: "healthy"
load_state: "loaded"
accept_requests: true
- service_id: "peer1/embeddings/bge-base"
kind: "embeddings"
endpoint: "http://127.0.0.1:8001"
runtime:
engine: "vllm"
launcher: "external"
assets:
- asset_id: "BAAI/bge-base-en-v1.5"
loaded: true
state:
health: "healthy"
load_state: "loaded"
accept_requests: true
Minimal Node Config Pattern
For a real peer host, the fields you most likely need to edit in configs/node.example.yaml are:
node.host_idnode.display_namenode.addresscontrol_plane.base_urlcontrol_plane.node_api_keyinventory.capabilitiesservices
Client Instructions
You now have two simple ways to exercise GenieHive as a client.
Option 1: Inspect and call it manually
List models:
curl -sS http://127.0.0.1:8800/v1/models \
-H 'X-Api-Key: change-me-client-key'
Chat using a role:
curl -sS http://127.0.0.1:8800/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "mentor",
"messages": [{"role":"user","content":"Give me a 2-sentence summary of why SQLite is useful here."}]
}'
Embeddings using a direct embedding asset:
curl -sS http://127.0.0.1:8800/v1/embeddings \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "nomic-embed-text",
"input": "GenieHive is a local-first control plane."
}'
Option 2: Use the demo client agent
Run:
cd /home/netuser/bin/geniehive
python scripts/demo_client_agent.py \
--base-url http://127.0.0.1:8800 \
--api-key change-me-client-key \
--task "Summarize the current GenieHive demo in three bullets."
That script will:
- read
GET /v1/models - choose a chat-capable model automatically if you do not specify one
- prefer entries GenieHive marks as suitable for lower-complexity offload
- submit a chat request and print the answer
If you want to force a specific route:
python scripts/demo_client_agent.py \
--base-url http://127.0.0.1:8800 \
--api-key change-me-client-key \
--model mentor \
--task "State what host and route type you would expect for this demo."
Codex-As-Client
For Codex or another agentic client, the intended pattern is:
- Read
GET /v1/models. - Filter for
geniehive.operation == "chat". - Prefer:
geniehive.offload_hint.suitability == "good_for_low_complexity"geniehive.loaded_target_count > 0for role entries- lower
best_p50_latency_ms
- Send lower-complexity requests to GenieHive.
- Keep higher-complexity, high-context, or high-risk tasks local unless the catalog indicates a better remote fit.
Good First Live Demo
If you want the safest first success path:
- control plane on one host
- node agent on the same host
- Ollama upstream with one chat model
- role alias
mentor - demo client agent calling
mentor
That avoids GGUF-specific launch tuning while still exercising the full GenieHive master/peer/client path.
Single-Machine End-to-End Example
Ollama-backed single box
- Start Ollama:
ollama serve
- Pull models:
ollama pull qwen3
ollama pull nomic-embed-text
- Start GenieHive control:
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
- Start GenieHive node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml
- Inspect:
bash scripts/demo_inspect.sh
- Run the client agent:
python scripts/demo_client_agent.py \
--base-url http://127.0.0.1:8800 \
--api-key change-me-client-key \
--task "Explain in three bullets what GenieHive is doing in this single-machine demo."
llama.cpp-backed single box
- Start the local server:
llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091
- Start GenieHive control:
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
- Start GenieHive node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml
- Run the client agent:
python scripts/demo_client_agent.py \
--base-url http://127.0.0.1:8800 \
--api-key change-me-client-key \
--task "Summarize why a single-machine GenieHive setup can still be useful."
Host-Specific Note: Dual Tesla P40 + 128 GB RAM
For a machine with:
2 x Nvidia Tesla P40AMD Ryzen 5600G128 GB RAM
the most practical first GenieHive layout is:
- one chat model on
GPU0 - one chat or utility model on
GPU1 - one slower fallback chat model on CPU
This is now sketched in:
configs/node.singlebox.p40-triple.example.yamlconfigs/control.singlebox.p40.example.yamlconfigs/roles.singlebox.p40.example.yamlscripts/start_p40_triple_llamacpp.shscripts/launch_p40_triple.shscripts/p40_triple_gpu0.shscripts/p40_triple_gpu1.shscripts/p40_triple_cpu.sh
The current concrete defaults use models already present under /home/netuser/bin/models/llm:
GPU0:Qwen2.5-14B-Instruct-1M-Q5_K_M.ggufGPU1:Qwen3.5-9B-Q5_K_M.ggufCPU:rocket-3b.Q5_K_M.gguf
Why this layout works
- each P40 has enough VRAM for a quantized 7B to 14B model comfortably
- 128 GB RAM is enough to hold a separate CPU-served fallback model without much trouble
- the CPU route will be much slower, but it is still useful for low-priority offload or fallback handling
Suggested role usage
mentoror primary chat role ->GPU0general_assistantor alternate chat role ->GPU1fallback_writerorbackground_summarizer-> CPU route
The repo now includes a host-specific role catalog with exactly that intent.
Launch pattern
- Edit your model paths:
cd /home/netuser/bin/geniehive
bash scripts/start_p40_triple_llamacpp.sh
If the defaults look good, you do not need to edit them before trying the first run.
If tmux is available, you can also launch the three processes detached:
cd /home/netuser/bin/geniehive
bash scripts/launch_p40_triple.sh
Then inspect pane state without binding your current terminal to the session:
bash scripts/tmux_session_status.sh
That status helper checks whether the session exists and whether each pane's launcher process is still running or has already exited. If tmux is not installed, the combined launcher prints the three helper commands instead.
-
Start the three
llama-serverprocesses in separate shells. -
Start GenieHive control:
bash scripts/run_control_singlebox.sh configs/control.singlebox.p40.example.yaml
- Start GenieHive node with the host-specific config:
bash scripts/run_node_singlebox.sh configs/node.singlebox.p40-triple.example.yaml
- Inspect the catalog:
bash scripts/demo_inspect.sh
If something is not coming up cleanly, run:
bash scripts/check_singlebox_health.sh
That checks:
GPU0upstream healthGPU1upstream health- CPU fallback upstream health
- GenieHive control health
- GenieHive node health
- authenticated cluster and model-catalog endpoints
- Exercise the chat path:
python scripts/demo_client_agent.py \
--base-url http://127.0.0.1:8800 \
--api-key change-me-client-key \
--model mentor \
--task "State which route should be preferred for low-latency chat and which should be the slow fallback."
Practical expectations
GPU0andGPU1should be the preferred targets for normal chat work- the CPU route should mostly be treated as fallback or low-priority background work
- GenieHive metadata should make that visible to clients through latency and offload hints
Containerized Qwen3.5 probe
If the host-installed llama-server is too old for Qwen3.5, but the NVIDIA Container Toolkit is installed, you can test a newer CUDA-enabled llama.cpp without changing the host CUDA stack:
cd /home/netuser/bin/geniehive
bash scripts/test_qwen35_server_cuda_container.sh
Useful overrides:
GPU_INDEX=1 PORT=19092 bash scripts/test_qwen35_server_cuda_container.sh
MODEL_PATH=/home/netuser/bin/models/llm/Qwen3.5-9B-Q5_K_M.gguf bash scripts/test_qwen35_server_cuda_container.sh
That probe uses the official ghcr.io/ggml-org/llama.cpp:server-cuda image. If it loads the model and starts serving, then the remaining blocker is your host llama.cpp install, not GPU compatibility.
External Client Access
For your current host addresses:
- LAN:
192.168.40.207 - ZeroTier:
172.24.50.65
The cleanest rule is:
- keep upstream model servers on
127.0.0.1 - keep the GenieHive node on
127.0.0.1unless you specifically need remote node access - expose only the GenieHive control plane to LAN or ZeroTier clients
That gives remote clients a single stable endpoint without exposing the underlying model servers directly.
LAN bind
cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_lan.sh
Remote client example:
python scripts/demo_client_agent.py \
--base-url http://192.168.40.207:8800 \
--api-key change-me-client-key \
--model mentor \
--task "Briefly describe the preferred and fallback routes on this host."
ZeroTier bind
cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_zerotier.sh
Remote client example:
python scripts/demo_client_agent.py \
--base-url http://172.24.50.65:8800 \
--api-key change-me-client-key \
--model mentor \
--task "Briefly describe the preferred and fallback routes on this host."
Security note
Prefer ZeroTier over general LAN exposure when possible. In both cases:
- do not expose the upstream
llama-serverports - keep the client API key enabled
- if you later open this beyond trusted networks, add a reverse proxy or VPN-only boundary rather than binding GenieHive broadly
Role meanings for this host
mentorshould bias toward theGPU0Qwen2.5 14B routegeneral_assistantshould bias toward theGPU1Qwen3.5 9B routebackground_summarizershould bias toward the CPU Rocket 3B fallback route