From 10926b55588935bf5deb55329cd6717298a613f7 Mon Sep 17 00:00:00 2001 From: welsberr Date: Mon, 16 Mar 2026 23:12:51 -0400 Subject: [PATCH] Adding multi-host node agent deployment example. --- README.md | 10 ++ docs/DEPLOYMENT.md | 11 ++ docs/EXAMPLE_MULTI_NODE.md | 323 +++++++++++++++++++++++++++++++++++++ 3 files changed, 344 insertions(+) create mode 100644 docs/EXAMPLE_MULTI_NODE.md diff --git a/README.md b/README.md index 3a7f3f8..273f345 100644 --- a/README.md +++ b/README.md @@ -169,6 +169,16 @@ curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \ If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values. +## Worked Deployment Example + +For a concrete multi-machine example, including: + +- two node-agent processes on a dual-GPU host +- another node-agent on a second host +- project-defined roles `planner`, `writer`, and `critic` + +see [docs/EXAMPLE_MULTI_NODE.md](docs/EXAMPLE_MULTI_NODE.md). + ## Known Good Inference Backends The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and diff --git a/docs/DEPLOYMENT.md b/docs/DEPLOYMENT.md index 02c9aee..1b0ba29 100644 --- a/docs/DEPLOYMENT.md +++ b/docs/DEPLOYMENT.md @@ -69,6 +69,17 @@ Registered nodes age out of discovered-role routing after a heartbeat timeout. - configure with `ROLE_MESH_NODE_STALE_AFTER_S` - stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic +## Worked example: multi-node role routing + +For a concrete topology with: + +- gateway on one host +- two node-agent processes on a dual-GPU machine +- one additional node-agent on a second machine +- roles `planner`, `writer`, and `critic` + +see [EXAMPLE_MULTI_NODE.md](EXAMPLE_MULTI_NODE.md). + ## Network binding and exposure (Step 2 hardening) **Defaults are safe-by-default:** the gateway and node-agent CLIs default to binding on `127.0.0.1` (localhost). diff --git a/docs/EXAMPLE_MULTI_NODE.md b/docs/EXAMPLE_MULTI_NODE.md new file mode 100644 index 0000000..c8443ac --- /dev/null +++ b/docs/EXAMPLE_MULTI_NODE.md @@ -0,0 +1,323 @@ +# Example: Two GPUs on One Host, One Remote Host + +This example shows a concrete RoleMesh layout with: + +- a gateway on `192.168.1.100` +- two node-agent processes on `192.168.1.101` +- one node-agent process on `192.168.1.102` +- three project-defined roles: `planner`, `writer`, and `critic` + +The intended topology is: + +- `planner` on `192.168.1.101:8091` +- `writer` on `192.168.1.101:8092` +- `critic` on `192.168.1.102:8091` + +This is a good pattern when: + +- one host has multiple GPUs and you want separate node identities or role-specific configs +- another host contributes an additional model for a different role +- the gateway should route by role without the client needing to know which machine serves which model + +## Before you start + +Assumptions: + +- gateway host: `192.168.1.100` +- dual-GPU host: `192.168.1.101` +- second model host: `192.168.1.102` +- all machines can reach the gateway over the LAN +- each node host has a working `llama-server` binary +- each node host has readable GGUF model files on disk + +Important current limitation: + +- the node agent does not yet expose a strict per-process GPU allowlist +- separate node-agent processes on `192.168.1.101` are still useful for separate ports, node IDs, and model catalogs +- but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically + +So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change. + +## 1. Gateway config on 192.168.1.100 + +Save as `configs/models.yaml` on the gateway host: + +```yaml +version: 1 +default_model: planner + +auth: + client_api_keys: + - "change-me-client-key" + node_api_keys: + - "change-me-node-key" + +models: + planner: + type: discovered + role: planner + strategy: round_robin + + writer: + type: discovered + role: writer + strategy: round_robin + + critic: + type: discovered + role: critic + strategy: round_robin +``` + +Run the gateway: + +```bash +ROLE_MESH_CONFIG=configs/models.yaml \ +uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080 +``` + +## 2. Node-agent config for planner on 192.168.1.101 + +Save as `planner-node.yaml`: + +```yaml +node_id: "gpu101-planner" +listen_host: "192.168.1.101" +listen_port: 8091 + +dispatcher_base_url: "http://192.168.1.100:8080" +dispatcher_node_key: "change-me-node-key" +dispatcher_roles: ["planner"] +heartbeat_interval_sec: 5 + +llama_server_bin: "/path/to/llama-server" +llama_server_startup_timeout_s: 45 +llama_server_probe_interval_s: 0.5 + +model_roots: + - "/models" + +models: + - model_id: "planner-main" + path: "/models/planner-model.Q5_K_M.gguf" + roles: ["planner"] + ctx_size: 8192 + gpu_layers: 60 + threads: 8 + batch_size: 1024 + flash_attn: true +``` + +## 3. Node-agent config for writer on 192.168.1.101 + +Save as `writer-node.yaml`: + +```yaml +node_id: "gpu101-writer" +listen_host: "192.168.1.101" +listen_port: 8092 + +dispatcher_base_url: "http://192.168.1.100:8080" +dispatcher_node_key: "change-me-node-key" +dispatcher_roles: ["writer"] +heartbeat_interval_sec: 5 + +llama_server_bin: "/path/to/llama-server" +llama_server_startup_timeout_s: 45 +llama_server_probe_interval_s: 0.5 + +model_roots: + - "/models" + +models: + - model_id: "writer-main" + path: "/models/writer-model.Q5_K_M.gguf" + roles: ["writer"] + ctx_size: 8192 + gpu_layers: 60 + threads: 8 + batch_size: 1024 + flash_attn: true +``` + +## 4. Node-agent config for critic on 192.168.1.102 + +Save as `critic-node.yaml`: + +```yaml +node_id: "gpu102-critic" +listen_host: "192.168.1.102" +listen_port: 8091 + +dispatcher_base_url: "http://192.168.1.100:8080" +dispatcher_node_key: "change-me-node-key" +dispatcher_roles: ["critic"] +heartbeat_interval_sec: 5 + +llama_server_bin: "/path/to/llama-server" +llama_server_startup_timeout_s: 45 +llama_server_probe_interval_s: 0.5 + +model_roots: + - "/models" + +models: + - model_id: "critic-main" + path: "/models/critic-model.Q5_K_M.gguf" + roles: ["critic"] + ctx_size: 8192 + gpu_layers: 60 + threads: 8 + batch_size: 1024 + flash_attn: true +``` + +The `path` field is where you point to the actual GGUF weight file on that machine. +That is the concrete model-weight binding for node-agent mode. + +## 5. Start the three node agents + +On `192.168.1.101`: + +```bash +PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml +``` + +In a second shell on `192.168.1.101`: + +```bash +PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml +``` + +On `192.168.1.102`: + +```bash +PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml +``` + +## 6. Register each node once + +The current node agent sends heartbeats automatically, but registration is still a one-time explicit step. + +Register planner: + +```bash +curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \ + -H 'Content-Type: application/json' \ + -H 'X-RoleMesh-Node-Key: change-me-node-key' \ + -d '{ + "node_id": "gpu101-planner", + "base_url": "http://192.168.1.101:8091", + "roles": ["planner"] + }' +``` + +Register writer: + +```bash +curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \ + -H 'Content-Type: application/json' \ + -H 'X-RoleMesh-Node-Key: change-me-node-key' \ + -d '{ + "node_id": "gpu101-writer", + "base_url": "http://192.168.1.101:8092", + "roles": ["writer"] + }' +``` + +Register critic: + +```bash +curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \ + -H 'Content-Type: application/json' \ + -H 'X-RoleMesh-Node-Key: change-me-node-key' \ + -d '{ + "node_id": "gpu102-critic", + "base_url": "http://192.168.1.102:8091", + "roles": ["critic"] + }' +``` + +After that, the heartbeat loop on each node agent keeps the registry entry fresh. + +## 7. Verify the topology + +List the currently healthy role aliases: + +```bash +curl -sS http://192.168.1.100:8080/v1/models \ + -H 'X-Api-Key: change-me-client-key' +``` + +Expected result: + +- `planner`, `writer`, and `critic` should appear once each +- gateway metadata should show the registered nodes and their freshness + +Check one node directly: + +```bash +curl -sS http://192.168.1.101:8091/v1/node/inventory +``` + +That endpoint shows: + +- discovered devices +- current model inventory +- device metrics +- queue depth and in-flight work + +## 8. Send requests by role + +Planner request through the gateway: + +```bash +curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'X-Api-Key: change-me-client-key' \ + -d '{ + "model": "planner", + "messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}] + }' +``` + +Writer request through the gateway: + +```bash +curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'X-Api-Key: change-me-client-key' \ + -d '{ + "model": "writer", + "messages": [{"role":"user","content":"Rewrite this as a concise status update."}] + }' +``` + +Critic request through the gateway: + +```bash +curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -H 'X-Api-Key: change-me-client-key' \ + -d '{ + "model": "critic", + "messages": [{"role":"user","content":"List the top two flaws in this plan."}] + }' +``` + +## Operational notes + +- If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout. +- `GET /ready` on the gateway only returns `200` when the configured `default_model` is currently routable. +- The first request for a cold model may take longer because the node agent has to start or switch `llama-server`. +- The node agent now waits for local `llama-server` readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start. + +## When to use proxy mode instead + +If you do not need node-level inventory, heartbeats, or on-demand `llama-server` management, proxy mode is simpler: + +- run one inference server per role yourself +- point the gateway at each server with `type: proxy` +- let the gateway route aliases directly by `proxy_url` + +Use node-agent mode when you want RoleMesh to manage local `llama-server` processes and expose node inventory to the gateway.