16 KiB

Raw Blame History

GenieHive LLM Demo

This runbook covers the first practical GenieHive LLM demo with three roles:

master: the GenieHive control plane
peer: a GenieHive node agent attached to one or more local LLM servers
client: a demo client agent or Codex using GenieHive as the API front door

Current Readiness

GenieHive is ready for a first live chat demo now.

What works in GenieHive already:

node registration
heartbeat
role-aware route resolution
GET /v1/models
POST /v1/chat/completions
POST /v1/embeddings

What GenieHive does not do yet:

launch upstream LLM servers for you automatically
provide POST /v1/audio/transcriptions
maintain advanced benchmark history or queue-aware scheduling

For the first demo, treat GenieHive as a metadata-rich router over already-running local servers.

Topologies

Smallest Demo

Run everything on one host:

control plane on 127.0.0.1:8800
node agent on 127.0.0.1:8891
one or more upstream model servers on local ports

This is also the recommended setup for users who do not have a cluster. GenieHive still provides value as:

a local router
a metadata-rich local model catalog
a role-to-model indirection layer
a common front door for client tools

Two-Host Demo

master host runs GenieHive control plane
peer host runs GenieHive node agent and one or more local LLM servers
client runs anywhere that can reach the master

Master Instructions

On the control-plane host:

Create a repo-local Python environment if you want isolation.
Start GenieHive control:

cd /home/netuser/bin/geniehive
bash scripts/run_control.sh

Confirm health:

curl -sS http://127.0.0.1:8800/health

Expected result:

JSON containing {"status":"ok"}

Keep note of the example client and node keys from configs/control.example.yaml.

Single-Box Shortcut

If you are running control and node on the same machine, use:

cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh

For your P40 host, repo-provided external bind helpers now exist:

LAN:

bash scripts/run_control_p40_lan.sh

ZeroTier:

bash scripts/run_control_p40_zerotier.sh

Both use the P40-specific control config and only change the bind interface.

Peer Instructions

On each peer host you need:

one or more local LLM servers already running
one GenieHive node config that points at those servers
the control-plane base URL and node API key

For a single-machine setup, the peer is simply another process on the same host.

The node agent should advertise upstream server roots, not endpoint suffixes. For example:

good: http://127.0.0.1:11434
good: http://127.0.0.1:18091
not good: http://127.0.0.1:11434/v1/chat/completions

Option A: Ollama

Use this when you want the lowest-friction chat and embeddings demo.

Start Ollama if it is not already running:

ollama serve

Pull the model or models you want:

ollama pull qwen3
ollama pull nomic-embed-text

Example peer service config:

services:
  - service_id: "peer1/chat/qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "qwen3"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/nomic-embed-text"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "nomic-embed-text"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml

Option B: llama.cpp

Use this when you want direct GGUF serving with llama-server.

Start a chat server:

llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091

Example peer service config:

services:
  - service_id: "peer1/chat/qwen3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llama.cpp"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Then start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml

Note:

The official llama.cpp docs clearly show OpenAI-compatible chat serving.
For embeddings, some llama.cpp builds document non-OpenAI embedding endpoints such as /embedding, so GenieHive’s current POST /v1/embeddings path is safest with Ollama or vLLM unless you have verified your specific build.

Option C: llamafile

Use this when you want a single-file local server built around llama.cpp.

Start a chat server:

./your-model.llamafile --server --host 127.0.0.1 --port 18091 --nobrowser

Example peer service config:

services:
  - service_id: "peer1/chat/llamafile-qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llamafile"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Then start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamafile.example.yaml

Option D: vLLM

Use this when you want a more server-oriented OpenAI-compatible stack and you have the hardware budget for it.

Start the server:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123

Example peer service config:

services:
  - service_id: "peer1/chat/llama3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:8000"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "NousResearch/Meta-Llama-3-8B-Instruct"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/bge-base"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:8001"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "BAAI/bge-base-en-v1.5"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Minimal Node Config Pattern

For a real peer host, the fields you most likely need to edit in configs/node.example.yaml are:

node.host_id
node.display_name
node.address
control_plane.base_url
control_plane.node_api_key
inventory.capabilities
services

Client Instructions

You now have two simple ways to exercise GenieHive as a client.

Option 1: Inspect and call it manually

List models:

curl -sS http://127.0.0.1:8800/v1/models \
  -H 'X-Api-Key: change-me-client-key'

Chat using a role:

curl -sS http://127.0.0.1:8800/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "mentor",
    "messages": [{"role":"user","content":"Give me a 2-sentence summary of why SQLite is useful here."}]
  }'

Embeddings using a direct embedding asset:

curl -sS http://127.0.0.1:8800/v1/embeddings \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "nomic-embed-text",
    "input": "GenieHive is a local-first control plane."
  }'

Option 2: Use the demo client agent

Run:

cd /home/netuser/bin/geniehive
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize the current GenieHive demo in three bullets."

That script will:

read GET /v1/models
choose a chat-capable model automatically if you do not specify one
prefer entries GenieHive marks as suitable for lower-complexity offload
submit a chat request and print the answer

If you want to force a specific route:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State what host and route type you would expect for this demo."

Codex-As-Client

For Codex or another agentic client, the intended pattern is:

Read GET /v1/models.
Filter for geniehive.operation == "chat".
Prefer:
- geniehive.offload_hint.suitability == "good_for_low_complexity"
- geniehive.loaded_target_count > 0 for role entries
- lower best_p50_latency_ms
Send lower-complexity requests to GenieHive.
Keep higher-complexity, high-context, or high-risk tasks local unless the catalog indicates a better remote fit.

Good First Live Demo

If you want the safest first success path:

control plane on one host
node agent on the same host
Ollama upstream with one chat model
role alias mentor
demo client agent calling mentor

That avoids GGUF-specific launch tuning while still exercising the full GenieHive master/peer/client path.

Single-Machine End-to-End Example

Ollama-backed single box

Start Ollama:

ollama serve

Pull models:

ollama pull qwen3
ollama pull nomic-embed-text

Start GenieHive control:

cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh

Start GenieHive node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml

Inspect:

bash scripts/demo_inspect.sh

Run the client agent:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Explain in three bullets what GenieHive is doing in this single-machine demo."

llama.cpp-backed single box

Start the local server:

llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091

Start GenieHive control:

cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh

Start GenieHive node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml

Run the client agent:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize why a single-machine GenieHive setup can still be useful."

Host-Specific Note: Dual Tesla P40 + 128 GB RAM

For a machine with:

2 x Nvidia Tesla P40
AMD Ryzen 5600G
128 GB RAM

the most practical first GenieHive layout is:

one chat model on GPU0
one chat or utility model on GPU1
one slower fallback chat model on CPU

This is now sketched in:

configs/node.singlebox.p40-triple.example.yaml
configs/control.singlebox.p40.example.yaml
configs/roles.singlebox.p40.example.yaml
scripts/start_p40_triple_llamacpp.sh
scripts/launch_p40_triple.sh
scripts/p40_triple_gpu0.sh
scripts/p40_triple_gpu1.sh
scripts/p40_triple_cpu.sh

The current concrete defaults use models already present under /home/netuser/bin/models/llm:

GPU0: Qwen2.5-14B-Instruct-1M-Q5_K_M.gguf
GPU1: Qwen3.5-9B-Q5_K_M.gguf
CPU: rocket-3b.Q5_K_M.gguf

Why this layout works

each P40 has enough VRAM for a quantized 7B to 14B model comfortably
128 GB RAM is enough to hold a separate CPU-served fallback model without much trouble
the CPU route will be much slower, but it is still useful for low-priority offload or fallback handling

Suggested role usage

mentor or primary chat role -> GPU0
general_assistant or alternate chat role -> GPU1
fallback_writer or background_summarizer -> CPU route

The repo now includes a host-specific role catalog with exactly that intent.

Launch pattern

Edit your model paths:

cd /home/netuser/bin/geniehive
bash scripts/start_p40_triple_llamacpp.sh

If the defaults look good, you do not need to edit them before trying the first run.

If tmux is available, you can also launch the three processes detached:

cd /home/netuser/bin/geniehive
bash scripts/launch_p40_triple.sh

Then inspect pane state without binding your current terminal to the session:

bash scripts/tmux_session_status.sh

That status helper checks whether the session exists and whether each pane's launcher process is still running or has already exited. If tmux is not installed, the combined launcher prints the three helper commands instead.

Start the three llama-server processes in separate shells.
Start GenieHive control:

bash scripts/run_control_singlebox.sh configs/control.singlebox.p40.example.yaml

Start GenieHive node with the host-specific config:

bash scripts/run_node_singlebox.sh configs/node.singlebox.p40-triple.example.yaml

Inspect the catalog:

bash scripts/demo_inspect.sh

If something is not coming up cleanly, run:

bash scripts/check_singlebox_health.sh

That checks:

GPU0 upstream health
GPU1 upstream health
CPU fallback upstream health
GenieHive control health
GenieHive node health
authenticated cluster and model-catalog endpoints

Exercise the chat path:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State which route should be preferred for low-latency chat and which should be the slow fallback."

Practical expectations

GPU0 and GPU1 should be the preferred targets for normal chat work
the CPU route should mostly be treated as fallback or low-priority background work
GenieHive metadata should make that visible to clients through latency and offload hints

Containerized Qwen3.5 probe

If the host-installed llama-server is too old for Qwen3.5, but the NVIDIA Container Toolkit is installed, you can test a newer CUDA-enabled llama.cpp without changing the host CUDA stack:

cd /home/netuser/bin/geniehive
bash scripts/test_qwen35_server_cuda_container.sh

Useful overrides:

GPU_INDEX=1 PORT=19092 bash scripts/test_qwen35_server_cuda_container.sh
MODEL_PATH=/home/netuser/bin/models/llm/Qwen3.5-9B-Q5_K_M.gguf bash scripts/test_qwen35_server_cuda_container.sh

That probe uses the official ghcr.io/ggml-org/llama.cpp:server-cuda image. If it loads the model and starts serving, then the remaining blocker is your host llama.cpp install, not GPU compatibility.

External Client Access

For your current host addresses:

LAN: 192.168.40.207
ZeroTier: 172.24.50.65

The cleanest rule is:

keep upstream model servers on 127.0.0.1
keep the GenieHive node on 127.0.0.1 unless you specifically need remote node access
expose only the GenieHive control plane to LAN or ZeroTier clients

That gives remote clients a single stable endpoint without exposing the underlying model servers directly.

LAN bind

cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_lan.sh

Remote client example:

python scripts/demo_client_agent.py \
  --base-url http://192.168.40.207:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "Briefly describe the preferred and fallback routes on this host."

ZeroTier bind

cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_zerotier.sh