RoleMesh-Gateway/docs/EXAMPLE_MULTI_NODE.md

7.5 KiB

Example: Two GPUs on One Host, One Remote Host

This example shows a concrete RoleMesh layout with:

  • a gateway on 192.168.1.100
  • two node-agent processes on 192.168.1.101
  • one node-agent process on 192.168.1.102
  • three project-defined roles: planner, writer, and critic

The intended topology is:

  • planner on 192.168.1.101:8091
  • writer on 192.168.1.101:8092
  • critic on 192.168.1.102:8091

This is a good pattern when:

  • one host has multiple GPUs and you want separate node identities or role-specific configs
  • another host contributes an additional model for a different role
  • the gateway should route by role without the client needing to know which machine serves which model

Before you start

Assumptions:

  • gateway host: 192.168.1.100
  • dual-GPU host: 192.168.1.101
  • second model host: 192.168.1.102
  • all machines can reach the gateway over the LAN
  • each node host has a working llama-server binary
  • each node host has readable GGUF model files on disk

Important current limitation:

  • the node agent does not yet expose a strict per-process GPU allowlist
  • separate node-agent processes on 192.168.1.101 are still useful for separate ports, node IDs, and model catalogs
  • but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically

So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change.

1. Gateway config on 192.168.1.100

Save as configs/models.yaml on the gateway host:

version: 1
default_model: planner

auth:
  client_api_keys:
    - "change-me-client-key"
  node_api_keys:
    - "change-me-node-key"

models:
  planner:
    type: discovered
    role: planner
    strategy: round_robin

  writer:
    type: discovered
    role: writer
    strategy: round_robin

  critic:
    type: discovered
    role: critic
    strategy: round_robin

Run the gateway:

ROLE_MESH_CONFIG=configs/models.yaml \
uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080

2. Node-agent config for planner on 192.168.1.101

Save as planner-node.yaml:

node_id: "gpu101-planner"
listen_host: "192.168.1.101"
listen_port: 8091

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["planner"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "planner-main"
    path: "/models/planner-model.Q5_K_M.gguf"
    roles: ["planner"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true

3. Node-agent config for writer on 192.168.1.101

Save as writer-node.yaml:

node_id: "gpu101-writer"
listen_host: "192.168.1.101"
listen_port: 8092

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["writer"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "writer-main"
    path: "/models/writer-model.Q5_K_M.gguf"
    roles: ["writer"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true

4. Node-agent config for critic on 192.168.1.102

Save as critic-node.yaml:

node_id: "gpu102-critic"
listen_host: "192.168.1.102"
listen_port: 8091

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["critic"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "critic-main"
    path: "/models/critic-model.Q5_K_M.gguf"
    roles: ["critic"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true

The path field is where you point to the actual GGUF weight file on that machine. That is the concrete model-weight binding for node-agent mode.

5. Start the three node agents

On 192.168.1.101:

PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml

In a second shell on 192.168.1.101:

PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml

On 192.168.1.102:

PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml

6. Registration behavior

The current node agent now registers itself automatically on startup when dispatcher_base_url is set. It also keeps heartbeating after registration.

For inspection, each node exposes:

curl -sS http://192.168.1.101:8091/v1/node/registration

That endpoint returns the exact served_models payload the node agent will send to the gateway.

Manual POST /v1/nodes/register is still supported, but it is now mainly useful for custom tooling or debugging.

7. Verify the topology

List the currently healthy role aliases:

curl -sS http://192.168.1.100:8080/v1/models \
  -H 'X-Api-Key: change-me-client-key'

Expected result:

  • planner, writer, and critic should appear once each
  • gateway metadata should show the registered nodes and their freshness

Check one node directly:

curl -sS http://192.168.1.101:8091/v1/node/inventory

That endpoint shows:

  • discovered devices
  • current model inventory
  • device metrics
  • queue depth and in-flight work

You can also inspect the registration payload directly:

curl -sS http://192.168.1.101:8091/v1/node/registration

8. Send requests by role

Planner request through the gateway:

curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "planner",
    "messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}]
  }'

Writer request through the gateway:

curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "writer",
    "messages": [{"role":"user","content":"Rewrite this as a concise status update."}]
  }'

Critic request through the gateway:

curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "critic",
    "messages": [{"role":"user","content":"List the top two flaws in this plan."}]
  }'

Operational notes

  • If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout.
  • GET /ready on the gateway only returns 200 when the configured default_model is currently routable.
  • The first request for a cold model may take longer because the node agent has to start or switch llama-server.
  • The node agent now waits for local llama-server readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start.

When to use proxy mode instead

If you do not need node-level inventory, heartbeats, or on-demand llama-server management, proxy mode is simpler:

  • run one inference server per role yourself
  • point the gateway at each server with type: proxy
  • let the gateway route aliases directly by proxy_url

Use node-agent mode when you want RoleMesh to manage local llama-server processes and expose node inventory to the gateway.