GenieHive/docs/llm_demo.md

16 KiB
Raw Blame History

GenieHive LLM Demo

This runbook covers the first practical GenieHive LLM demo with three roles:

  • master: the GenieHive control plane
  • peer: a GenieHive node agent attached to one or more local LLM servers
  • client: a demo client agent or Codex using GenieHive as the API front door

Current Readiness

GenieHive is ready for a first live chat demo now.

What works in GenieHive already:

  • node registration
  • heartbeat
  • role-aware route resolution
  • GET /v1/models
  • POST /v1/chat/completions
  • POST /v1/embeddings

What GenieHive does not do yet:

  • launch upstream LLM servers for you automatically
  • provide POST /v1/audio/transcriptions
  • maintain advanced benchmark history or queue-aware scheduling

For the first demo, treat GenieHive as a metadata-rich router over already-running local servers.

Topologies

Smallest Demo

Run everything on one host:

  • control plane on 127.0.0.1:8800
  • node agent on 127.0.0.1:8891
  • one or more upstream model servers on local ports

This is also the recommended setup for users who do not have a cluster. GenieHive still provides value as:

  • a local router
  • a metadata-rich local model catalog
  • a role-to-model indirection layer
  • a common front door for client tools

Two-Host Demo

  • master host runs GenieHive control plane
  • peer host runs GenieHive node agent and one or more local LLM servers
  • client runs anywhere that can reach the master

Master Instructions

On the control-plane host:

  1. Create a repo-local Python environment if you want isolation.
  2. Start GenieHive control:
cd /home/netuser/bin/geniehive
bash scripts/run_control.sh
  1. Confirm health:
curl -sS http://127.0.0.1:8800/health

Expected result:

  • JSON containing {"status":"ok"}
  1. Keep note of the example client and node keys from configs/control.example.yaml.

Single-Box Shortcut

If you are running control and node on the same machine, use:

cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh

For your P40 host, repo-provided external bind helpers now exist:

LAN:

bash scripts/run_control_p40_lan.sh

ZeroTier:

bash scripts/run_control_p40_zerotier.sh

Both use the P40-specific control config and only change the bind interface.

Peer Instructions

On each peer host you need:

  • one or more local LLM servers already running
  • one GenieHive node config that points at those servers
  • the control-plane base URL and node API key

For a single-machine setup, the peer is simply another process on the same host.

The node agent should advertise upstream server roots, not endpoint suffixes. For example:

  • good: http://127.0.0.1:11434
  • good: http://127.0.0.1:18091
  • not good: http://127.0.0.1:11434/v1/chat/completions

Option A: Ollama

Use this when you want the lowest-friction chat and embeddings demo.

  1. Start Ollama if it is not already running:
ollama serve
  1. Pull the model or models you want:
ollama pull qwen3
ollama pull nomic-embed-text
  1. Example peer service config:
services:
  - service_id: "peer1/chat/qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "qwen3"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/nomic-embed-text"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:11434"
    runtime:
      engine: "ollama"
      launcher: "external"
    assets:
      - asset_id: "nomic-embed-text"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true
  1. Start the node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml

Option B: llama.cpp

Use this when you want direct GGUF serving with llama-server.

  1. Start a chat server:
llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091
  1. Example peer service config:
services:
  - service_id: "peer1/chat/qwen3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llama.cpp"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Then start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml

Note:

  • The official llama.cpp docs clearly show OpenAI-compatible chat serving.
  • For embeddings, some llama.cpp builds document non-OpenAI embedding endpoints such as /embedding, so GenieHives current POST /v1/embeddings path is safest with Ollama or vLLM unless you have verified your specific build.

Option C: llamafile

Use this when you want a single-file local server built around llama.cpp.

  1. Start a chat server:
./your-model.llamafile --server --host 127.0.0.1 --port 18091 --nobrowser
  1. Example peer service config:
services:
  - service_id: "peer1/chat/llamafile-qwen3"
    kind: "chat"
    endpoint: "http://127.0.0.1:18091"
    runtime:
      engine: "llamafile"
      launcher: "external"
    assets:
      - asset_id: "qwen3-8b-q4_k_m"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Then start the node:

cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamafile.example.yaml

Option D: vLLM

Use this when you want a more server-oriented OpenAI-compatible stack and you have the hardware budget for it.

  1. Start the server:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
  1. Example peer service config:
services:
  - service_id: "peer1/chat/llama3-8b"
    kind: "chat"
    endpoint: "http://127.0.0.1:8000"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "NousResearch/Meta-Llama-3-8B-Instruct"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

  - service_id: "peer1/embeddings/bge-base"
    kind: "embeddings"
    endpoint: "http://127.0.0.1:8001"
    runtime:
      engine: "vllm"
      launcher: "external"
    assets:
      - asset_id: "BAAI/bge-base-en-v1.5"
        loaded: true
    state:
      health: "healthy"
      load_state: "loaded"
      accept_requests: true

Minimal Node Config Pattern

For a real peer host, the fields you most likely need to edit in configs/node.example.yaml are:

  • node.host_id
  • node.display_name
  • node.address
  • control_plane.base_url
  • control_plane.node_api_key
  • inventory.capabilities
  • services

Client Instructions

You now have two simple ways to exercise GenieHive as a client.

Option 1: Inspect and call it manually

List models:

curl -sS http://127.0.0.1:8800/v1/models \
  -H 'X-Api-Key: change-me-client-key'

Chat using a role:

curl -sS http://127.0.0.1:8800/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "mentor",
    "messages": [{"role":"user","content":"Give me a 2-sentence summary of why SQLite is useful here."}]
  }'

Embeddings using a direct embedding asset:

curl -sS http://127.0.0.1:8800/v1/embeddings \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "nomic-embed-text",
    "input": "GenieHive is a local-first control plane."
  }'

Option 2: Use the demo client agent

Run:

cd /home/netuser/bin/geniehive
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize the current GenieHive demo in three bullets."

That script will:

  • read GET /v1/models
  • choose a chat-capable model automatically if you do not specify one
  • prefer entries GenieHive marks as suitable for lower-complexity offload
  • submit a chat request and print the answer

If you want to force a specific route:

python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State what host and route type you would expect for this demo."

Codex-As-Client

For Codex or another agentic client, the intended pattern is:

  1. Read GET /v1/models.
  2. Filter for geniehive.operation == "chat".
  3. Prefer:
    • geniehive.offload_hint.suitability == "good_for_low_complexity"
    • geniehive.loaded_target_count > 0 for role entries
    • lower best_p50_latency_ms
  4. Send lower-complexity requests to GenieHive.
  5. Keep higher-complexity, high-context, or high-risk tasks local unless the catalog indicates a better remote fit.

Good First Live Demo

If you want the safest first success path:

  • control plane on one host
  • node agent on the same host
  • Ollama upstream with one chat model
  • role alias mentor
  • demo client agent calling mentor

That avoids GGUF-specific launch tuning while still exercising the full GenieHive master/peer/client path.

Single-Machine End-to-End Example

Ollama-backed single box

  1. Start Ollama:
ollama serve
  1. Pull models:
ollama pull qwen3
ollama pull nomic-embed-text
  1. Start GenieHive control:
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
  1. Start GenieHive node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.ollama.example.yaml
  1. Inspect:
bash scripts/demo_inspect.sh
  1. Run the client agent:
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Explain in three bullets what GenieHive is doing in this single-machine demo."

llama.cpp-backed single box

  1. Start the local server:
llama-server -m /path/to/model.gguf --host 127.0.0.1 --port 18091
  1. Start GenieHive control:
cd /home/netuser/bin/geniehive
bash scripts/run_control_singlebox.sh
  1. Start GenieHive node:
cd /home/netuser/bin/geniehive
bash scripts/run_node_singlebox.sh configs/node.singlebox.llamacpp.example.yaml
  1. Run the client agent:
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --task "Summarize why a single-machine GenieHive setup can still be useful."

Host-Specific Note: Dual Tesla P40 + 128 GB RAM

For a machine with:

  • 2 x Nvidia Tesla P40
  • AMD Ryzen 5600G
  • 128 GB RAM

the most practical first GenieHive layout is:

  • one chat model on GPU0
  • one chat or utility model on GPU1
  • one slower fallback chat model on CPU

This is now sketched in:

  • configs/node.singlebox.p40-triple.example.yaml
  • configs/control.singlebox.p40.example.yaml
  • configs/roles.singlebox.p40.example.yaml
  • scripts/start_p40_triple_llamacpp.sh
  • scripts/launch_p40_triple.sh
  • scripts/p40_triple_gpu0.sh
  • scripts/p40_triple_gpu1.sh
  • scripts/p40_triple_cpu.sh

The current concrete defaults use models already present under /home/netuser/bin/models/llm:

  • GPU0: Qwen2.5-14B-Instruct-1M-Q5_K_M.gguf
  • GPU1: Qwen3.5-9B-Q5_K_M.gguf
  • CPU: rocket-3b.Q5_K_M.gguf

Why this layout works

  • each P40 has enough VRAM for a quantized 7B to 14B model comfortably
  • 128 GB RAM is enough to hold a separate CPU-served fallback model without much trouble
  • the CPU route will be much slower, but it is still useful for low-priority offload or fallback handling

Suggested role usage

  • mentor or primary chat role -> GPU0
  • general_assistant or alternate chat role -> GPU1
  • fallback_writer or background_summarizer -> CPU route

The repo now includes a host-specific role catalog with exactly that intent.

Launch pattern

  1. Edit your model paths:
cd /home/netuser/bin/geniehive
bash scripts/start_p40_triple_llamacpp.sh

If the defaults look good, you do not need to edit them before trying the first run.

If tmux is available, you can also launch the three processes detached:

cd /home/netuser/bin/geniehive
bash scripts/launch_p40_triple.sh

Then inspect pane state without binding your current terminal to the session:

bash scripts/tmux_session_status.sh

That status helper checks whether the session exists and whether each pane's launcher process is still running or has already exited. If tmux is not installed, the combined launcher prints the three helper commands instead.

  1. Start the three llama-server processes in separate shells.

  2. Start GenieHive control:

bash scripts/run_control_singlebox.sh configs/control.singlebox.p40.example.yaml
  1. Start GenieHive node with the host-specific config:
bash scripts/run_node_singlebox.sh configs/node.singlebox.p40-triple.example.yaml
  1. Inspect the catalog:
bash scripts/demo_inspect.sh

If something is not coming up cleanly, run:

bash scripts/check_singlebox_health.sh

That checks:

  • GPU0 upstream health
  • GPU1 upstream health
  • CPU fallback upstream health
  • GenieHive control health
  • GenieHive node health
  • authenticated cluster and model-catalog endpoints
  1. Exercise the chat path:
python scripts/demo_client_agent.py \
  --base-url http://127.0.0.1:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "State which route should be preferred for low-latency chat and which should be the slow fallback."

Practical expectations

  • GPU0 and GPU1 should be the preferred targets for normal chat work
  • the CPU route should mostly be treated as fallback or low-priority background work
  • GenieHive metadata should make that visible to clients through latency and offload hints

Containerized Qwen3.5 probe

If the host-installed llama-server is too old for Qwen3.5, but the NVIDIA Container Toolkit is installed, you can test a newer CUDA-enabled llama.cpp without changing the host CUDA stack:

cd /home/netuser/bin/geniehive
bash scripts/test_qwen35_server_cuda_container.sh

Useful overrides:

GPU_INDEX=1 PORT=19092 bash scripts/test_qwen35_server_cuda_container.sh
MODEL_PATH=/home/netuser/bin/models/llm/Qwen3.5-9B-Q5_K_M.gguf bash scripts/test_qwen35_server_cuda_container.sh

That probe uses the official ghcr.io/ggml-org/llama.cpp:server-cuda image. If it loads the model and starts serving, then the remaining blocker is your host llama.cpp install, not GPU compatibility.

External Client Access

For your current host addresses:

  • LAN: 192.168.40.207
  • ZeroTier: 172.24.50.65

The cleanest rule is:

  • keep upstream model servers on 127.0.0.1
  • keep the GenieHive node on 127.0.0.1 unless you specifically need remote node access
  • expose only the GenieHive control plane to LAN or ZeroTier clients

That gives remote clients a single stable endpoint without exposing the underlying model servers directly.

LAN bind

cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_lan.sh

Remote client example:

python scripts/demo_client_agent.py \
  --base-url http://192.168.40.207:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "Briefly describe the preferred and fallback routes on this host."

ZeroTier bind

cd /home/netuser/bin/geniehive
bash scripts/run_control_p40_zerotier.sh

Remote client example:

python scripts/demo_client_agent.py \
  --base-url http://172.24.50.65:8800 \
  --api-key change-me-client-key \
  --model mentor \
  --task "Briefly describe the preferred and fallback routes on this host."

Security note

Prefer ZeroTier over general LAN exposure when possible. In both cases:

  • do not expose the upstream llama-server ports
  • keep the client API key enabled
  • if you later open this beyond trusted networks, add a reverse proxy or VPN-only boundary rather than binding GenieHive broadly

Role meanings for this host

  • mentor should bias toward the GPU0 Qwen2.5 14B route
  • general_assistant should bias toward the GPU1 Qwen3.5 9B route
  • background_summarizer should bias toward the CPU Rocket 3B fallback route