7.5 KiB

Raw Blame History

Example: Two GPUs on One Host, One Remote Host

This example shows a concrete RoleMesh layout with:

a gateway on 192.168.1.100
two node-agent processes on 192.168.1.101
one node-agent process on 192.168.1.102
three project-defined roles: planner, writer, and critic

The intended topology is:

planner on 192.168.1.101:8091
writer on 192.168.1.101:8092
critic on 192.168.1.102:8091

This is a good pattern when:

one host has multiple GPUs and you want separate node identities or role-specific configs
another host contributes an additional model for a different role
the gateway should route by role without the client needing to know which machine serves which model

Before you start

Assumptions:

gateway host: 192.168.1.100
dual-GPU host: 192.168.1.101
second model host: 192.168.1.102
all machines can reach the gateway over the LAN
each node host has a working llama-server binary
each node host has readable GGUF model files on disk

Important current limitation:

the node agent does not yet expose a strict per-process GPU allowlist
separate node-agent processes on 192.168.1.101 are still useful for separate ports, node IDs, and model catalogs
but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically

So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change.

1. Gateway config on 192.168.1.100

Save as configs/models.yaml on the gateway host:

version: 1
default_model: planner

auth:
  client_api_keys:
    - "change-me-client-key"
  node_api_keys:
    - "change-me-node-key"

models:
  planner:
    type: discovered
    role: planner
    strategy: round_robin

  writer:
    type: discovered
    role: writer
    strategy: round_robin

  critic:
    type: discovered
    role: critic
    strategy: round_robin

Run the gateway:

ROLE_MESH_CONFIG=configs/models.yaml \
uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080

2. Node-agent config for planner on 192.168.1.101

Save as planner-node.yaml:

node_id: "gpu101-planner"
listen_host: "192.168.1.101"
listen_port: 8091

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["planner"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "planner-main"
    path: "/models/planner-model.Q5_K_M.gguf"
    roles: ["planner"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true

3. Node-agent config for writer on 192.168.1.101

Save as writer-node.yaml:

node_id: "gpu101-writer"
listen_host: "192.168.1.101"
listen_port: 8092

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["writer"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "writer-main"
    path: "/models/writer-model.Q5_K_M.gguf"
    roles: ["writer"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true

4. Node-agent config for critic on 192.168.1.102

Save as critic-node.yaml:

node_id: "gpu102-critic"
listen_host: "192.168.1.102"
listen_port: 8091

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["critic"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "critic-main"
    path: "/models/critic-model.Q5_K_M.gguf"
    roles: ["critic"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true

The path field is where you point to the actual GGUF weight file on that machine. That is the concrete model-weight binding for node-agent mode.

5. Start the three node agents

On 192.168.1.101:

PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml

In a second shell on 192.168.1.101:

PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml

On 192.168.1.102:

PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml

6. Registration behavior

The current node agent now registers itself automatically on startup when dispatcher_base_url is set. It also keeps heartbeating after registration.

For inspection, each node exposes:

curl -sS http://192.168.1.101:8091/v1/node/registration

That endpoint returns the exact served_models payload the node agent will send to the gateway.

Manual POST /v1/nodes/register is still supported, but it is now mainly useful for custom tooling or debugging.

7. Verify the topology

List the currently healthy role aliases:

curl -sS http://192.168.1.100:8080/v1/models \
  -H 'X-Api-Key: change-me-client-key'

Expected result:

planner, writer, and critic should appear once each
gateway metadata should show the registered nodes and their freshness

Check one node directly:

curl -sS http://192.168.1.101:8091/v1/node/inventory

That endpoint shows:

discovered devices
current model inventory
device metrics
queue depth and in-flight work

You can also inspect the registration payload directly:

curl -sS http://192.168.1.101:8091/v1/node/registration

8. Send requests by role

Planner request through the gateway:

curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "planner",
    "messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}]
  }'

Writer request through the gateway:

curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "writer",
    "messages": [{"role":"user","content":"Rewrite this as a concise status update."}]
  }'

Critic request through the gateway:

curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "critic",
    "messages": [{"role":"user","content":"List the top two flaws in this plan."}]
  }'

Operational notes

If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout.
GET /ready on the gateway only returns 200 when the configured default_model is currently routable.
The first request for a cold model may take longer because the node agent has to start or switch llama-server.
The node agent now waits for local llama-server readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start.

When to use proxy mode instead

If you do not need node-level inventory, heartbeats, or on-demand llama-server management, proxy mode is simpler:

run one inference server per role yourself
point the gateway at each server with type: proxy
let the gateway route aliases directly by proxy_url

Use node-agent mode when you want RoleMesh to manage local llama-server processes and expose node inventory to the gateway.

7.5 KiB Raw Blame History