Adding multi-host node agent deployment example.
This commit is contained in:
parent
11cab90efa
commit
10926b5558
10
README.md
10
README.md
|
|
@ -169,6 +169,16 @@ curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
|
|||
|
||||
If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values.
|
||||
|
||||
## Worked Deployment Example
|
||||
|
||||
For a concrete multi-machine example, including:
|
||||
|
||||
- two node-agent processes on a dual-GPU host
|
||||
- another node-agent on a second host
|
||||
- project-defined roles `planner`, `writer`, and `critic`
|
||||
|
||||
see [docs/EXAMPLE_MULTI_NODE.md](docs/EXAMPLE_MULTI_NODE.md).
|
||||
|
||||
## Known Good Inference Backends
|
||||
|
||||
The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and
|
||||
|
|
|
|||
|
|
@ -69,6 +69,17 @@ Registered nodes age out of discovered-role routing after a heartbeat timeout.
|
|||
- configure with `ROLE_MESH_NODE_STALE_AFTER_S`
|
||||
- stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic
|
||||
|
||||
## Worked example: multi-node role routing
|
||||
|
||||
For a concrete topology with:
|
||||
|
||||
- gateway on one host
|
||||
- two node-agent processes on a dual-GPU machine
|
||||
- one additional node-agent on a second machine
|
||||
- roles `planner`, `writer`, and `critic`
|
||||
|
||||
see [EXAMPLE_MULTI_NODE.md](EXAMPLE_MULTI_NODE.md).
|
||||
|
||||
## Network binding and exposure (Step 2 hardening)
|
||||
|
||||
**Defaults are safe-by-default:** the gateway and node-agent CLIs default to binding on `127.0.0.1` (localhost).
|
||||
|
|
|
|||
|
|
@ -0,0 +1,323 @@
|
|||
# Example: Two GPUs on One Host, One Remote Host
|
||||
|
||||
This example shows a concrete RoleMesh layout with:
|
||||
|
||||
- a gateway on `192.168.1.100`
|
||||
- two node-agent processes on `192.168.1.101`
|
||||
- one node-agent process on `192.168.1.102`
|
||||
- three project-defined roles: `planner`, `writer`, and `critic`
|
||||
|
||||
The intended topology is:
|
||||
|
||||
- `planner` on `192.168.1.101:8091`
|
||||
- `writer` on `192.168.1.101:8092`
|
||||
- `critic` on `192.168.1.102:8091`
|
||||
|
||||
This is a good pattern when:
|
||||
|
||||
- one host has multiple GPUs and you want separate node identities or role-specific configs
|
||||
- another host contributes an additional model for a different role
|
||||
- the gateway should route by role without the client needing to know which machine serves which model
|
||||
|
||||
## Before you start
|
||||
|
||||
Assumptions:
|
||||
|
||||
- gateway host: `192.168.1.100`
|
||||
- dual-GPU host: `192.168.1.101`
|
||||
- second model host: `192.168.1.102`
|
||||
- all machines can reach the gateway over the LAN
|
||||
- each node host has a working `llama-server` binary
|
||||
- each node host has readable GGUF model files on disk
|
||||
|
||||
Important current limitation:
|
||||
|
||||
- the node agent does not yet expose a strict per-process GPU allowlist
|
||||
- separate node-agent processes on `192.168.1.101` are still useful for separate ports, node IDs, and model catalogs
|
||||
- but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically
|
||||
|
||||
So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change.
|
||||
|
||||
## 1. Gateway config on 192.168.1.100
|
||||
|
||||
Save as `configs/models.yaml` on the gateway host:
|
||||
|
||||
```yaml
|
||||
version: 1
|
||||
default_model: planner
|
||||
|
||||
auth:
|
||||
client_api_keys:
|
||||
- "change-me-client-key"
|
||||
node_api_keys:
|
||||
- "change-me-node-key"
|
||||
|
||||
models:
|
||||
planner:
|
||||
type: discovered
|
||||
role: planner
|
||||
strategy: round_robin
|
||||
|
||||
writer:
|
||||
type: discovered
|
||||
role: writer
|
||||
strategy: round_robin
|
||||
|
||||
critic:
|
||||
type: discovered
|
||||
role: critic
|
||||
strategy: round_robin
|
||||
```
|
||||
|
||||
Run the gateway:
|
||||
|
||||
```bash
|
||||
ROLE_MESH_CONFIG=configs/models.yaml \
|
||||
uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080
|
||||
```
|
||||
|
||||
## 2. Node-agent config for planner on 192.168.1.101
|
||||
|
||||
Save as `planner-node.yaml`:
|
||||
|
||||
```yaml
|
||||
node_id: "gpu101-planner"
|
||||
listen_host: "192.168.1.101"
|
||||
listen_port: 8091
|
||||
|
||||
dispatcher_base_url: "http://192.168.1.100:8080"
|
||||
dispatcher_node_key: "change-me-node-key"
|
||||
dispatcher_roles: ["planner"]
|
||||
heartbeat_interval_sec: 5
|
||||
|
||||
llama_server_bin: "/path/to/llama-server"
|
||||
llama_server_startup_timeout_s: 45
|
||||
llama_server_probe_interval_s: 0.5
|
||||
|
||||
model_roots:
|
||||
- "/models"
|
||||
|
||||
models:
|
||||
- model_id: "planner-main"
|
||||
path: "/models/planner-model.Q5_K_M.gguf"
|
||||
roles: ["planner"]
|
||||
ctx_size: 8192
|
||||
gpu_layers: 60
|
||||
threads: 8
|
||||
batch_size: 1024
|
||||
flash_attn: true
|
||||
```
|
||||
|
||||
## 3. Node-agent config for writer on 192.168.1.101
|
||||
|
||||
Save as `writer-node.yaml`:
|
||||
|
||||
```yaml
|
||||
node_id: "gpu101-writer"
|
||||
listen_host: "192.168.1.101"
|
||||
listen_port: 8092
|
||||
|
||||
dispatcher_base_url: "http://192.168.1.100:8080"
|
||||
dispatcher_node_key: "change-me-node-key"
|
||||
dispatcher_roles: ["writer"]
|
||||
heartbeat_interval_sec: 5
|
||||
|
||||
llama_server_bin: "/path/to/llama-server"
|
||||
llama_server_startup_timeout_s: 45
|
||||
llama_server_probe_interval_s: 0.5
|
||||
|
||||
model_roots:
|
||||
- "/models"
|
||||
|
||||
models:
|
||||
- model_id: "writer-main"
|
||||
path: "/models/writer-model.Q5_K_M.gguf"
|
||||
roles: ["writer"]
|
||||
ctx_size: 8192
|
||||
gpu_layers: 60
|
||||
threads: 8
|
||||
batch_size: 1024
|
||||
flash_attn: true
|
||||
```
|
||||
|
||||
## 4. Node-agent config for critic on 192.168.1.102
|
||||
|
||||
Save as `critic-node.yaml`:
|
||||
|
||||
```yaml
|
||||
node_id: "gpu102-critic"
|
||||
listen_host: "192.168.1.102"
|
||||
listen_port: 8091
|
||||
|
||||
dispatcher_base_url: "http://192.168.1.100:8080"
|
||||
dispatcher_node_key: "change-me-node-key"
|
||||
dispatcher_roles: ["critic"]
|
||||
heartbeat_interval_sec: 5
|
||||
|
||||
llama_server_bin: "/path/to/llama-server"
|
||||
llama_server_startup_timeout_s: 45
|
||||
llama_server_probe_interval_s: 0.5
|
||||
|
||||
model_roots:
|
||||
- "/models"
|
||||
|
||||
models:
|
||||
- model_id: "critic-main"
|
||||
path: "/models/critic-model.Q5_K_M.gguf"
|
||||
roles: ["critic"]
|
||||
ctx_size: 8192
|
||||
gpu_layers: 60
|
||||
threads: 8
|
||||
batch_size: 1024
|
||||
flash_attn: true
|
||||
```
|
||||
|
||||
The `path` field is where you point to the actual GGUF weight file on that machine.
|
||||
That is the concrete model-weight binding for node-agent mode.
|
||||
|
||||
## 5. Start the three node agents
|
||||
|
||||
On `192.168.1.101`:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml
|
||||
```
|
||||
|
||||
In a second shell on `192.168.1.101`:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml
|
||||
```
|
||||
|
||||
On `192.168.1.102`:
|
||||
|
||||
```bash
|
||||
PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml
|
||||
```
|
||||
|
||||
## 6. Register each node once
|
||||
|
||||
The current node agent sends heartbeats automatically, but registration is still a one-time explicit step.
|
||||
|
||||
Register planner:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
|
||||
-d '{
|
||||
"node_id": "gpu101-planner",
|
||||
"base_url": "http://192.168.1.101:8091",
|
||||
"roles": ["planner"]
|
||||
}'
|
||||
```
|
||||
|
||||
Register writer:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
|
||||
-d '{
|
||||
"node_id": "gpu101-writer",
|
||||
"base_url": "http://192.168.1.101:8092",
|
||||
"roles": ["writer"]
|
||||
}'
|
||||
```
|
||||
|
||||
Register critic:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-RoleMesh-Node-Key: change-me-node-key' \
|
||||
-d '{
|
||||
"node_id": "gpu102-critic",
|
||||
"base_url": "http://192.168.1.102:8091",
|
||||
"roles": ["critic"]
|
||||
}'
|
||||
```
|
||||
|
||||
After that, the heartbeat loop on each node agent keeps the registry entry fresh.
|
||||
|
||||
## 7. Verify the topology
|
||||
|
||||
List the currently healthy role aliases:
|
||||
|
||||
```bash
|
||||
curl -sS http://192.168.1.100:8080/v1/models \
|
||||
-H 'X-Api-Key: change-me-client-key'
|
||||
```
|
||||
|
||||
Expected result:
|
||||
|
||||
- `planner`, `writer`, and `critic` should appear once each
|
||||
- gateway metadata should show the registered nodes and their freshness
|
||||
|
||||
Check one node directly:
|
||||
|
||||
```bash
|
||||
curl -sS http://192.168.1.101:8091/v1/node/inventory
|
||||
```
|
||||
|
||||
That endpoint shows:
|
||||
|
||||
- discovered devices
|
||||
- current model inventory
|
||||
- device metrics
|
||||
- queue depth and in-flight work
|
||||
|
||||
## 8. Send requests by role
|
||||
|
||||
Planner request through the gateway:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Api-Key: change-me-client-key' \
|
||||
-d '{
|
||||
"model": "planner",
|
||||
"messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}]
|
||||
}'
|
||||
```
|
||||
|
||||
Writer request through the gateway:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Api-Key: change-me-client-key' \
|
||||
-d '{
|
||||
"model": "writer",
|
||||
"messages": [{"role":"user","content":"Rewrite this as a concise status update."}]
|
||||
}'
|
||||
```
|
||||
|
||||
Critic request through the gateway:
|
||||
|
||||
```bash
|
||||
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Api-Key: change-me-client-key' \
|
||||
-d '{
|
||||
"model": "critic",
|
||||
"messages": [{"role":"user","content":"List the top two flaws in this plan."}]
|
||||
}'
|
||||
```
|
||||
|
||||
## Operational notes
|
||||
|
||||
- If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout.
|
||||
- `GET /ready` on the gateway only returns `200` when the configured `default_model` is currently routable.
|
||||
- The first request for a cold model may take longer because the node agent has to start or switch `llama-server`.
|
||||
- The node agent now waits for local `llama-server` readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start.
|
||||
|
||||
## When to use proxy mode instead
|
||||
|
||||
If you do not need node-level inventory, heartbeats, or on-demand `llama-server` management, proxy mode is simpler:
|
||||
|
||||
- run one inference server per role yourself
|
||||
- point the gateway at each server with `type: proxy`
|
||||
- let the gateway route aliases directly by `proxy_url`
|
||||
|
||||
Use node-agent mode when you want RoleMesh to manage local `llama-server` processes and expose node inventory to the gateway.
|
||||
Loading…
Reference in New Issue