Adding multi-host node agent deployment example.

2026-03-16 23:12:51 -04:00 · 2026-03-16 23:12:51 -04:00 · 10926b5558
parent 11cab90efa
commit 10926b5558
3 changed files with 344 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -169,6 +169,16 @@ curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \

 If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values.

+## Worked Deployment Example
+
+For a concrete multi-machine example, including:
+
+- two node-agent processes on a dual-GPU host
+- another node-agent on a second host
+- project-defined roles `planner`, `writer`, and `critic`
+
+see [docs/EXAMPLE_MULTI_NODE.md](docs/EXAMPLE_MULTI_NODE.md).
+
 ## Known Good Inference Backends

 The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and
--- a/docs/DEPLOYMENT.md
+++ b/docs/DEPLOYMENT.md
@ -69,6 +69,17 @@ Registered nodes age out of discovered-role routing after a heartbeat timeout.
 - configure with `ROLE_MESH_NODE_STALE_AFTER_S`
 - stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic

+## Worked example: multi-node role routing
+
+For a concrete topology with:
+
+- gateway on one host
+- two node-agent processes on a dual-GPU machine
+- one additional node-agent on a second machine
+- roles `planner`, `writer`, and `critic`
+
+see [EXAMPLE_MULTI_NODE.md](EXAMPLE_MULTI_NODE.md).
+
 ## Network binding and exposure (Step 2 hardening)

 **Defaults are safe-by-default:** the gateway and node-agent CLIs default to binding on `127.0.0.1` (localhost).
--- a/docs/EXAMPLE_MULTI_NODE.md
+++ b/docs/EXAMPLE_MULTI_NODE.md
@ -0,0 +1,323 @@
+# Example: Two GPUs on One Host, One Remote Host
+
+This example shows a concrete RoleMesh layout with:
+
+- a gateway on `192.168.1.100`
+- two node-agent processes on `192.168.1.101`
+- one node-agent process on `192.168.1.102`
+- three project-defined roles: `planner`, `writer`, and `critic`
+
+The intended topology is:
+
+- `planner` on `192.168.1.101:8091`
+- `writer` on `192.168.1.101:8092`
+- `critic` on `192.168.1.102:8091`
+
+This is a good pattern when:
+
+- one host has multiple GPUs and you want separate node identities or role-specific configs
+- another host contributes an additional model for a different role
+- the gateway should route by role without the client needing to know which machine serves which model
+
+## Before you start
+
+Assumptions:
+
+- gateway host: `192.168.1.100`
+- dual-GPU host: `192.168.1.101`
+- second model host: `192.168.1.102`
+- all machines can reach the gateway over the LAN
+- each node host has a working `llama-server` binary
+- each node host has readable GGUF model files on disk
+
+Important current limitation:
+
+- the node agent does not yet expose a strict per-process GPU allowlist
+- separate node-agent processes on `192.168.1.101` are still useful for separate ports, node IDs, and model catalogs
+- but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically
+
+So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change.
+
+## 1. Gateway config on 192.168.1.100
+
+Save as `configs/models.yaml` on the gateway host:
+
+```yaml
+version: 1
+default_model: planner
+
+auth:
+  client_api_keys:
+    - "change-me-client-key"
+  node_api_keys:
+    - "change-me-node-key"
+
+models:
+  planner:
+    type: discovered
+    role: planner
+    strategy: round_robin
+
+  writer:
+    type: discovered
+    role: writer
+    strategy: round_robin
+
+  critic:
+    type: discovered
+    role: critic
+    strategy: round_robin
+```
+
+Run the gateway:
+
+```bash
+ROLE_MESH_CONFIG=configs/models.yaml \
+uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080
+```
+
+## 2. Node-agent config for planner on 192.168.1.101
+
+Save as `planner-node.yaml`:
+
+```yaml
+node_id: "gpu101-planner"
+listen_host: "192.168.1.101"
+listen_port: 8091
+
+dispatcher_base_url: "http://192.168.1.100:8080"
+dispatcher_node_key: "change-me-node-key"
+dispatcher_roles: ["planner"]
+heartbeat_interval_sec: 5
+
+llama_server_bin: "/path/to/llama-server"
+llama_server_startup_timeout_s: 45
+llama_server_probe_interval_s: 0.5
+
+model_roots:
+  - "/models"
+
+models:
+  - model_id: "planner-main"
+    path: "/models/planner-model.Q5_K_M.gguf"
+    roles: ["planner"]
+    ctx_size: 8192
+    gpu_layers: 60
+    threads: 8
+    batch_size: 1024
+    flash_attn: true
+```
+
+## 3. Node-agent config for writer on 192.168.1.101
+
+Save as `writer-node.yaml`:
+
+```yaml
+node_id: "gpu101-writer"
+listen_host: "192.168.1.101"
+listen_port: 8092
+
+dispatcher_base_url: "http://192.168.1.100:8080"
+dispatcher_node_key: "change-me-node-key"
+dispatcher_roles: ["writer"]
+heartbeat_interval_sec: 5
+
+llama_server_bin: "/path/to/llama-server"
+llama_server_startup_timeout_s: 45
+llama_server_probe_interval_s: 0.5
+
+model_roots:
+  - "/models"
+
+models:
+  - model_id: "writer-main"
+    path: "/models/writer-model.Q5_K_M.gguf"
+    roles: ["writer"]
+    ctx_size: 8192
+    gpu_layers: 60
+    threads: 8
+    batch_size: 1024
+    flash_attn: true
+```
+
+## 4. Node-agent config for critic on 192.168.1.102
+
+Save as `critic-node.yaml`:
+
+```yaml
+node_id: "gpu102-critic"
+listen_host: "192.168.1.102"
+listen_port: 8091
+
+dispatcher_base_url: "http://192.168.1.100:8080"
+dispatcher_node_key: "change-me-node-key"
+dispatcher_roles: ["critic"]
+heartbeat_interval_sec: 5
+
+llama_server_bin: "/path/to/llama-server"
+llama_server_startup_timeout_s: 45
+llama_server_probe_interval_s: 0.5
+
+model_roots:
+  - "/models"
+
+models:
+  - model_id: "critic-main"
+    path: "/models/critic-model.Q5_K_M.gguf"
+    roles: ["critic"]
+    ctx_size: 8192
+    gpu_layers: 60
+    threads: 8
+    batch_size: 1024
+    flash_attn: true
+```
+
+The `path` field is where you point to the actual GGUF weight file on that machine.
+That is the concrete model-weight binding for node-agent mode.
+
+## 5. Start the three node agents
+
+On `192.168.1.101`:
+
+```bash
+PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml
+```
+
+In a second shell on `192.168.1.101`:
+
+```bash
+PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml
+```
+
+On `192.168.1.102`:
+
+```bash
+PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml
+```
+
+## 6. Register each node once
+
+The current node agent sends heartbeats automatically, but registration is still a one-time explicit step.
+
+Register planner:
+
+```bash
+curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
+  -H 'Content-Type: application/json' \
+  -H 'X-RoleMesh-Node-Key: change-me-node-key' \
+  -d '{
+    "node_id": "gpu101-planner",
+    "base_url": "http://192.168.1.101:8091",
+    "roles": ["planner"]
+  }'
+```
+
+Register writer:
+
+```bash
+curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
+  -H 'Content-Type: application/json' \
+  -H 'X-RoleMesh-Node-Key: change-me-node-key' \
+  -d '{
+    "node_id": "gpu101-writer",
+    "base_url": "http://192.168.1.101:8092",
+    "roles": ["writer"]
+  }'
+```
+
+Register critic:
+
+```bash
+curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
+  -H 'Content-Type: application/json' \
+  -H 'X-RoleMesh-Node-Key: change-me-node-key' \
+  -d '{
+    "node_id": "gpu102-critic",
+    "base_url": "http://192.168.1.102:8091",
+    "roles": ["critic"]
+  }'
+```
+
+After that, the heartbeat loop on each node agent keeps the registry entry fresh.
+
+## 7. Verify the topology
+
+List the currently healthy role aliases:
+
+```bash
+curl -sS http://192.168.1.100:8080/v1/models \
+  -H 'X-Api-Key: change-me-client-key'
+```
+
+Expected result:
+
+- `planner`, `writer`, and `critic` should appear once each
+- gateway metadata should show the registered nodes and their freshness
+
+Check one node directly:
+
+```bash
+curl -sS http://192.168.1.101:8091/v1/node/inventory
+```
+
+That endpoint shows:
+
+- discovered devices
+- current model inventory
+- device metrics
+- queue depth and in-flight work
+
+## 8. Send requests by role
+
+Planner request through the gateway:
+
+```bash
+curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -H 'X-Api-Key: change-me-client-key' \
+  -d '{
+    "model": "planner",
+    "messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}]
+  }'
+```
+
+Writer request through the gateway:
+
+```bash
+curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -H 'X-Api-Key: change-me-client-key' \
+  -d '{
+    "model": "writer",
+    "messages": [{"role":"user","content":"Rewrite this as a concise status update."}]
+  }'
+```
+
+Critic request through the gateway:
+
+```bash
+curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -H 'X-Api-Key: change-me-client-key' \
+  -d '{
+    "model": "critic",
+    "messages": [{"role":"user","content":"List the top two flaws in this plan."}]
+  }'
+```
+
+## Operational notes
+
+- If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout.
+- `GET /ready` on the gateway only returns `200` when the configured `default_model` is currently routable.
+- The first request for a cold model may take longer because the node agent has to start or switch `llama-server`.
+- The node agent now waits for local `llama-server` readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start.
+
+## When to use proxy mode instead
+
+If you do not need node-level inventory, heartbeats, or on-demand `llama-server` management, proxy mode is simpler:
+
+- run one inference server per role yourself
+- point the gateway at each server with `type: proxy`
+- let the gateway route aliases directly by `proxy_url`
+
+Use node-agent mode when you want RoleMesh to manage local `llama-server` processes and expose node inventory to the gateway.