# Example: Two GPUs on One Host, One Remote Host This example shows a concrete RoleMesh layout with: - a gateway on `192.168.1.100` - two node-agent processes on `192.168.1.101` - one node-agent process on `192.168.1.102` - three project-defined roles: `planner`, `writer`, and `critic` The intended topology is: - `planner` on `192.168.1.101:8091` - `writer` on `192.168.1.101:8092` - `critic` on `192.168.1.102:8091` This is a good pattern when: - one host has multiple GPUs and you want separate node identities or role-specific configs - another host contributes an additional model for a different role - the gateway should route by role without the client needing to know which machine serves which model ## Before you start Assumptions: - gateway host: `192.168.1.100` - dual-GPU host: `192.168.1.101` - second model host: `192.168.1.102` - all machines can reach the gateway over the LAN - each node host has a working `llama-server` binary - each node host has readable GGUF model files on disk Important current limitation: - the node agent does not yet expose a strict per-process GPU allowlist - separate node-agent processes on `192.168.1.101` are still useful for separate ports, node IDs, and model catalogs - but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change. ## 1. Gateway config on 192.168.1.100 Save as `configs/models.yaml` on the gateway host: ```yaml version: 1 default_model: planner auth: client_api_keys: - "change-me-client-key" node_api_keys: - "change-me-node-key" models: planner: type: discovered role: planner strategy: round_robin writer: type: discovered role: writer strategy: round_robin critic: type: discovered role: critic strategy: round_robin ``` Run the gateway: ```bash ROLE_MESH_CONFIG=configs/models.yaml \ uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080 ``` ## 2. Node-agent config for planner on 192.168.1.101 Save as `planner-node.yaml`: ```yaml node_id: "gpu101-planner" listen_host: "192.168.1.101" listen_port: 8091 dispatcher_base_url: "http://192.168.1.100:8080" dispatcher_node_key: "change-me-node-key" dispatcher_roles: ["planner"] heartbeat_interval_sec: 5 llama_server_bin: "/path/to/llama-server" llama_server_startup_timeout_s: 45 llama_server_probe_interval_s: 0.5 model_roots: - "/models" models: - model_id: "planner-main" path: "/models/planner-model.Q5_K_M.gguf" roles: ["planner"] ctx_size: 8192 gpu_layers: 60 threads: 8 batch_size: 1024 flash_attn: true ``` ## 3. Node-agent config for writer on 192.168.1.101 Save as `writer-node.yaml`: ```yaml node_id: "gpu101-writer" listen_host: "192.168.1.101" listen_port: 8092 dispatcher_base_url: "http://192.168.1.100:8080" dispatcher_node_key: "change-me-node-key" dispatcher_roles: ["writer"] heartbeat_interval_sec: 5 llama_server_bin: "/path/to/llama-server" llama_server_startup_timeout_s: 45 llama_server_probe_interval_s: 0.5 model_roots: - "/models" models: - model_id: "writer-main" path: "/models/writer-model.Q5_K_M.gguf" roles: ["writer"] ctx_size: 8192 gpu_layers: 60 threads: 8 batch_size: 1024 flash_attn: true ``` ## 4. Node-agent config for critic on 192.168.1.102 Save as `critic-node.yaml`: ```yaml node_id: "gpu102-critic" listen_host: "192.168.1.102" listen_port: 8091 dispatcher_base_url: "http://192.168.1.100:8080" dispatcher_node_key: "change-me-node-key" dispatcher_roles: ["critic"] heartbeat_interval_sec: 5 llama_server_bin: "/path/to/llama-server" llama_server_startup_timeout_s: 45 llama_server_probe_interval_s: 0.5 model_roots: - "/models" models: - model_id: "critic-main" path: "/models/critic-model.Q5_K_M.gguf" roles: ["critic"] ctx_size: 8192 gpu_layers: 60 threads: 8 batch_size: 1024 flash_attn: true ``` The `path` field is where you point to the actual GGUF weight file on that machine. That is the concrete model-weight binding for node-agent mode. ## 5. Start the three node agents On `192.168.1.101`: ```bash PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml ``` In a second shell on `192.168.1.101`: ```bash PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml ``` On `192.168.1.102`: ```bash PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml ``` ## 6. Register each node once The current node agent sends heartbeats automatically, but registration is still a one-time explicit step. Register planner: ```bash curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \ -H 'Content-Type: application/json' \ -H 'X-RoleMesh-Node-Key: change-me-node-key' \ -d '{ "node_id": "gpu101-planner", "base_url": "http://192.168.1.101:8091", "roles": ["planner"] }' ``` Register writer: ```bash curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \ -H 'Content-Type: application/json' \ -H 'X-RoleMesh-Node-Key: change-me-node-key' \ -d '{ "node_id": "gpu101-writer", "base_url": "http://192.168.1.101:8092", "roles": ["writer"] }' ``` Register critic: ```bash curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \ -H 'Content-Type: application/json' \ -H 'X-RoleMesh-Node-Key: change-me-node-key' \ -d '{ "node_id": "gpu102-critic", "base_url": "http://192.168.1.102:8091", "roles": ["critic"] }' ``` After that, the heartbeat loop on each node agent keeps the registry entry fresh. ## 7. Verify the topology List the currently healthy role aliases: ```bash curl -sS http://192.168.1.100:8080/v1/models \ -H 'X-Api-Key: change-me-client-key' ``` Expected result: - `planner`, `writer`, and `critic` should appear once each - gateway metadata should show the registered nodes and their freshness Check one node directly: ```bash curl -sS http://192.168.1.101:8091/v1/node/inventory ``` That endpoint shows: - discovered devices - current model inventory - device metrics - queue depth and in-flight work ## 8. Send requests by role Planner request through the gateway: ```bash curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'X-Api-Key: change-me-client-key' \ -d '{ "model": "planner", "messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}] }' ``` Writer request through the gateway: ```bash curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'X-Api-Key: change-me-client-key' \ -d '{ "model": "writer", "messages": [{"role":"user","content":"Rewrite this as a concise status update."}] }' ``` Critic request through the gateway: ```bash curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'X-Api-Key: change-me-client-key' \ -d '{ "model": "critic", "messages": [{"role":"user","content":"List the top two flaws in this plan."}] }' ``` ## Operational notes - If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout. - `GET /ready` on the gateway only returns `200` when the configured `default_model` is currently routable. - The first request for a cold model may take longer because the node agent has to start or switch `llama-server`. - The node agent now waits for local `llama-server` readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start. ## When to use proxy mode instead If you do not need node-level inventory, heartbeats, or on-demand `llama-server` management, proxy mode is simpler: - run one inference server per role yourself - point the gateway at each server with `type: proxy` - let the gateway route aliases directly by `proxy_url` Use node-agent mode when you want RoleMesh to manage local `llama-server` processes and expose node inventory to the gateway.