8.0 KiB
Example: Two GPUs on One Host, One Remote Host
This example shows a concrete RoleMesh layout with:
- a gateway on
192.168.1.100 - two node-agent processes on
192.168.1.101 - one node-agent process on
192.168.1.102 - three project-defined roles:
planner,writer, andcritic
The intended topology is:
planneron192.168.1.101:8091writeron192.168.1.101:8092criticon192.168.1.102:8091
This is a good pattern when:
- one host has multiple GPUs and you want separate node identities or role-specific configs
- another host contributes an additional model for a different role
- the gateway should route by role without the client needing to know which machine serves which model
Before you start
Assumptions:
- gateway host:
192.168.1.100 - dual-GPU host:
192.168.1.101 - second model host:
192.168.1.102 - all machines can reach the gateway over the LAN
- each node host has a working
llama-serverbinary - each node host has readable GGUF model files on disk
Important current limitation:
- the node agent does not yet expose a strict per-process GPU allowlist
- separate node-agent processes on
192.168.1.101are still useful for separate ports, node IDs, and model catalogs - but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically
So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change.
1. Gateway config on 192.168.1.100
Save as configs/models.yaml on the gateway host:
version: 1
default_model: planner
auth:
client_api_keys:
- "change-me-client-key"
node_api_keys:
- "change-me-node-key"
models:
planner:
type: discovered
role: planner
strategy: round_robin
writer:
type: discovered
role: writer
strategy: round_robin
critic:
type: discovered
role: critic
strategy: round_robin
Run the gateway:
ROLE_MESH_CONFIG=configs/models.yaml \
uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080
2. Node-agent config for planner on 192.168.1.101
Save as planner-node.yaml:
node_id: "gpu101-planner"
listen_host: "192.168.1.101"
listen_port: 8091
dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["planner"]
heartbeat_interval_sec: 5
llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5
model_roots:
- "/models"
models:
- model_id: "planner-main"
path: "/models/planner-model.Q5_K_M.gguf"
roles: ["planner"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
3. Node-agent config for writer on 192.168.1.101
Save as writer-node.yaml:
node_id: "gpu101-writer"
listen_host: "192.168.1.101"
listen_port: 8092
dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["writer"]
heartbeat_interval_sec: 5
llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5
model_roots:
- "/models"
models:
- model_id: "writer-main"
path: "/models/writer-model.Q5_K_M.gguf"
roles: ["writer"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
4. Node-agent config for critic on 192.168.1.102
Save as critic-node.yaml:
node_id: "gpu102-critic"
listen_host: "192.168.1.102"
listen_port: 8091
dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["critic"]
heartbeat_interval_sec: 5
llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5
model_roots:
- "/models"
models:
- model_id: "critic-main"
path: "/models/critic-model.Q5_K_M.gguf"
roles: ["critic"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
The path field is where you point to the actual GGUF weight file on that machine.
That is the concrete model-weight binding for node-agent mode.
5. Start the three node agents
On 192.168.1.101:
PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml
In a second shell on 192.168.1.101:
PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml
On 192.168.1.102:
PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml
6. Register each node once
The current node agent sends heartbeats automatically, but registration is still a one-time explicit step.
Register planner:
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
-H 'Content-Type: application/json' \
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
-d '{
"node_id": "gpu101-planner",
"base_url": "http://192.168.1.101:8091",
"roles": ["planner"]
}'
Register writer:
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
-H 'Content-Type: application/json' \
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
-d '{
"node_id": "gpu101-writer",
"base_url": "http://192.168.1.101:8092",
"roles": ["writer"]
}'
Register critic:
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
-H 'Content-Type: application/json' \
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
-d '{
"node_id": "gpu102-critic",
"base_url": "http://192.168.1.102:8091",
"roles": ["critic"]
}'
After that, the heartbeat loop on each node agent keeps the registry entry fresh.
7. Verify the topology
List the currently healthy role aliases:
curl -sS http://192.168.1.100:8080/v1/models \
-H 'X-Api-Key: change-me-client-key'
Expected result:
planner,writer, andcriticshould appear once each- gateway metadata should show the registered nodes and their freshness
Check one node directly:
curl -sS http://192.168.1.101:8091/v1/node/inventory
That endpoint shows:
- discovered devices
- current model inventory
- device metrics
- queue depth and in-flight work
8. Send requests by role
Planner request through the gateway:
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "planner",
"messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}]
}'
Writer request through the gateway:
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "writer",
"messages": [{"role":"user","content":"Rewrite this as a concise status update."}]
}'
Critic request through the gateway:
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "critic",
"messages": [{"role":"user","content":"List the top two flaws in this plan."}]
}'
Operational notes
- If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout.
GET /readyon the gateway only returns200when the configureddefault_modelis currently routable.- The first request for a cold model may take longer because the node agent has to start or switch
llama-server. - The node agent now waits for local
llama-serverreadiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start.
When to use proxy mode instead
If you do not need node-level inventory, heartbeats, or on-demand llama-server management, proxy mode is simpler:
- run one inference server per role yourself
- point the gateway at each server with
type: proxy - let the gateway route aliases directly by
proxy_url
Use node-agent mode when you want RoleMesh to manage local llama-server processes and expose node inventory to the gateway.