Adding multi-host node agent deployment example.

This commit is contained in:
welsberr 2026-03-16 23:12:51 -04:00
parent 11cab90efa
commit 10926b5558
3 changed files with 344 additions and 0 deletions

View File

@ -169,6 +169,16 @@ curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values.
## Worked Deployment Example
For a concrete multi-machine example, including:
- two node-agent processes on a dual-GPU host
- another node-agent on a second host
- project-defined roles `planner`, `writer`, and `critic`
see [docs/EXAMPLE_MULTI_NODE.md](docs/EXAMPLE_MULTI_NODE.md).
## Known Good Inference Backends
The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and

View File

@ -69,6 +69,17 @@ Registered nodes age out of discovered-role routing after a heartbeat timeout.
- configure with `ROLE_MESH_NODE_STALE_AFTER_S`
- stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic
## Worked example: multi-node role routing
For a concrete topology with:
- gateway on one host
- two node-agent processes on a dual-GPU machine
- one additional node-agent on a second machine
- roles `planner`, `writer`, and `critic`
see [EXAMPLE_MULTI_NODE.md](EXAMPLE_MULTI_NODE.md).
## Network binding and exposure (Step 2 hardening)
**Defaults are safe-by-default:** the gateway and node-agent CLIs default to binding on `127.0.0.1` (localhost).

323
docs/EXAMPLE_MULTI_NODE.md Normal file
View File

@ -0,0 +1,323 @@
# Example: Two GPUs on One Host, One Remote Host
This example shows a concrete RoleMesh layout with:
- a gateway on `192.168.1.100`
- two node-agent processes on `192.168.1.101`
- one node-agent process on `192.168.1.102`
- three project-defined roles: `planner`, `writer`, and `critic`
The intended topology is:
- `planner` on `192.168.1.101:8091`
- `writer` on `192.168.1.101:8092`
- `critic` on `192.168.1.102:8091`
This is a good pattern when:
- one host has multiple GPUs and you want separate node identities or role-specific configs
- another host contributes an additional model for a different role
- the gateway should route by role without the client needing to know which machine serves which model
## Before you start
Assumptions:
- gateway host: `192.168.1.100`
- dual-GPU host: `192.168.1.101`
- second model host: `192.168.1.102`
- all machines can reach the gateway over the LAN
- each node host has a working `llama-server` binary
- each node host has readable GGUF model files on disk
Important current limitation:
- the node agent does not yet expose a strict per-process GPU allowlist
- separate node-agent processes on `192.168.1.101` are still useful for separate ports, node IDs, and model catalogs
- but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically
So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change.
## 1. Gateway config on 192.168.1.100
Save as `configs/models.yaml` on the gateway host:
```yaml
version: 1
default_model: planner
auth:
client_api_keys:
- "change-me-client-key"
node_api_keys:
- "change-me-node-key"
models:
planner:
type: discovered
role: planner
strategy: round_robin
writer:
type: discovered
role: writer
strategy: round_robin
critic:
type: discovered
role: critic
strategy: round_robin
```
Run the gateway:
```bash
ROLE_MESH_CONFIG=configs/models.yaml \
uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080
```
## 2. Node-agent config for planner on 192.168.1.101
Save as `planner-node.yaml`:
```yaml
node_id: "gpu101-planner"
listen_host: "192.168.1.101"
listen_port: 8091
dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["planner"]
heartbeat_interval_sec: 5
llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5
model_roots:
- "/models"
models:
- model_id: "planner-main"
path: "/models/planner-model.Q5_K_M.gguf"
roles: ["planner"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
```
## 3. Node-agent config for writer on 192.168.1.101
Save as `writer-node.yaml`:
```yaml
node_id: "gpu101-writer"
listen_host: "192.168.1.101"
listen_port: 8092
dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["writer"]
heartbeat_interval_sec: 5
llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5
model_roots:
- "/models"
models:
- model_id: "writer-main"
path: "/models/writer-model.Q5_K_M.gguf"
roles: ["writer"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
```
## 4. Node-agent config for critic on 192.168.1.102
Save as `critic-node.yaml`:
```yaml
node_id: "gpu102-critic"
listen_host: "192.168.1.102"
listen_port: 8091
dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["critic"]
heartbeat_interval_sec: 5
llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5
model_roots:
- "/models"
models:
- model_id: "critic-main"
path: "/models/critic-model.Q5_K_M.gguf"
roles: ["critic"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
```
The `path` field is where you point to the actual GGUF weight file on that machine.
That is the concrete model-weight binding for node-agent mode.
## 5. Start the three node agents
On `192.168.1.101`:
```bash
PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml
```
In a second shell on `192.168.1.101`:
```bash
PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml
```
On `192.168.1.102`:
```bash
PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml
```
## 6. Register each node once
The current node agent sends heartbeats automatically, but registration is still a one-time explicit step.
Register planner:
```bash
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
-H 'Content-Type: application/json' \
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
-d '{
"node_id": "gpu101-planner",
"base_url": "http://192.168.1.101:8091",
"roles": ["planner"]
}'
```
Register writer:
```bash
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
-H 'Content-Type: application/json' \
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
-d '{
"node_id": "gpu101-writer",
"base_url": "http://192.168.1.101:8092",
"roles": ["writer"]
}'
```
Register critic:
```bash
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
-H 'Content-Type: application/json' \
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
-d '{
"node_id": "gpu102-critic",
"base_url": "http://192.168.1.102:8091",
"roles": ["critic"]
}'
```
After that, the heartbeat loop on each node agent keeps the registry entry fresh.
## 7. Verify the topology
List the currently healthy role aliases:
```bash
curl -sS http://192.168.1.100:8080/v1/models \
-H 'X-Api-Key: change-me-client-key'
```
Expected result:
- `planner`, `writer`, and `critic` should appear once each
- gateway metadata should show the registered nodes and their freshness
Check one node directly:
```bash
curl -sS http://192.168.1.101:8091/v1/node/inventory
```
That endpoint shows:
- discovered devices
- current model inventory
- device metrics
- queue depth and in-flight work
## 8. Send requests by role
Planner request through the gateway:
```bash
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "planner",
"messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}]
}'
```
Writer request through the gateway:
```bash
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "writer",
"messages": [{"role":"user","content":"Rewrite this as a concise status update."}]
}'
```
Critic request through the gateway:
```bash
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "critic",
"messages": [{"role":"user","content":"List the top two flaws in this plan."}]
}'
```
## Operational notes
- If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout.
- `GET /ready` on the gateway only returns `200` when the configured `default_model` is currently routable.
- The first request for a cold model may take longer because the node agent has to start or switch `llama-server`.
- The node agent now waits for local `llama-server` readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start.
## When to use proxy mode instead
If you do not need node-level inventory, heartbeats, or on-demand `llama-server` management, proxy mode is simpler:
- run one inference server per role yourself
- point the gateway at each server with `type: proxy`
- let the gateway route aliases directly by `proxy_url`
Use node-agent mode when you want RoleMesh to manage local `llama-server` processes and expose node inventory to the gateway.