# Example: Two GPUs on One Host, One Remote Host

This example shows a concrete RoleMesh layout with:

- a gateway on `192.168.1.100`
- two node-agent processes on `192.168.1.101`
- one node-agent process on `192.168.1.102`
- three project-defined roles: `planner`, `writer`, and `critic`

The intended topology is:

- `planner` on `192.168.1.101:8091`
- `writer` on `192.168.1.101:8092`
- `critic` on `192.168.1.102:8091`

This is a good pattern when:

- one host has multiple GPUs and you want separate node identities or role-specific configs
- another host contributes an additional model for a different role
- the gateway should route by role without the client needing to know which machine serves which model

## Before you start

Assumptions:

- gateway host: `192.168.1.100`
- dual-GPU host: `192.168.1.101`
- second model host: `192.168.1.102`
- all machines can reach the gateway over the LAN
- each node host has a working `llama-server` binary
- each node host has readable GGUF model files on disk

Important current limitation:

- the node agent does not yet expose a strict per-process GPU allowlist
- separate node-agent processes on `192.168.1.101` are still useful for separate ports, node IDs, and model catalogs
- but the current scheduler still discovers all visible local CUDA GPUs and chooses among them heuristically

So this example is a valid deployment shape, but if you need hard process-to-GPU partitioning, that still needs a follow-up code change.

## 1. Gateway config on 192.168.1.100

Save as `configs/models.yaml` on the gateway host:

```yaml
version: 1
default_model: planner

auth:
  client_api_keys:
    - "change-me-client-key"
  node_api_keys:
    - "change-me-node-key"

models:
  planner:
    type: discovered
    role: planner
    strategy: round_robin

  writer:
    type: discovered
    role: writer
    strategy: round_robin

  critic:
    type: discovered
    role: critic
    strategy: round_robin
```

Run the gateway:

```bash
ROLE_MESH_CONFIG=configs/models.yaml \
uvicorn rolemesh_gateway.main:app --host 192.168.1.100 --port 8080
```

## 2. Node-agent config for planner on 192.168.1.101

Save as `planner-node.yaml`:

```yaml
node_id: "gpu101-planner"
listen_host: "192.168.1.101"
listen_port: 8091

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["planner"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "planner-main"
    path: "/models/planner-model.Q5_K_M.gguf"
    roles: ["planner"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true
```

## 3. Node-agent config for writer on 192.168.1.101

Save as `writer-node.yaml`:

```yaml
node_id: "gpu101-writer"
listen_host: "192.168.1.101"
listen_port: 8092

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["writer"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "writer-main"
    path: "/models/writer-model.Q5_K_M.gguf"
    roles: ["writer"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true
```

## 4. Node-agent config for critic on 192.168.1.102

Save as `critic-node.yaml`:

```yaml
node_id: "gpu102-critic"
listen_host: "192.168.1.102"
listen_port: 8091

dispatcher_base_url: "http://192.168.1.100:8080"
dispatcher_node_key: "change-me-node-key"
dispatcher_roles: ["critic"]
heartbeat_interval_sec: 5

llama_server_bin: "/path/to/llama-server"
llama_server_startup_timeout_s: 45
llama_server_probe_interval_s: 0.5

model_roots:
  - "/models"

models:
  - model_id: "critic-main"
    path: "/models/critic-model.Q5_K_M.gguf"
    roles: ["critic"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true
```

The `path` field is where you point to the actual GGUF weight file on that machine.
That is the concrete model-weight binding for node-agent mode.

## 5. Start the three node agents

On `192.168.1.101`:

```bash
PYTHONPATH=src python -m rolemesh_node_agent.cli --config planner-node.yaml
```

In a second shell on `192.168.1.101`:

```bash
PYTHONPATH=src python -m rolemesh_node_agent.cli --config writer-node.yaml
```

On `192.168.1.102`:

```bash
PYTHONPATH=src python -m rolemesh_node_agent.cli --config critic-node.yaml
```

## 6. Register each node once

The current node agent sends heartbeats automatically, but registration is still a one-time explicit step.

Register planner:

```bash
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
  -H 'Content-Type: application/json' \
  -H 'X-RoleMesh-Node-Key: change-me-node-key' \
  -d '{
    "node_id": "gpu101-planner",
    "base_url": "http://192.168.1.101:8091",
    "roles": ["planner"]
  }'
```

Register writer:

```bash
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
  -H 'Content-Type: application/json' \
  -H 'X-RoleMesh-Node-Key: change-me-node-key' \
  -d '{
    "node_id": "gpu101-writer",
    "base_url": "http://192.168.1.101:8092",
    "roles": ["writer"]
  }'
```

Register critic:

```bash
curl -sS -X POST http://192.168.1.100:8080/v1/nodes/register \
  -H 'Content-Type: application/json' \
  -H 'X-RoleMesh-Node-Key: change-me-node-key' \
  -d '{
    "node_id": "gpu102-critic",
    "base_url": "http://192.168.1.102:8091",
    "roles": ["critic"]
  }'
```

After that, the heartbeat loop on each node agent keeps the registry entry fresh.

## 7. Verify the topology

List the currently healthy role aliases:

```bash
curl -sS http://192.168.1.100:8080/v1/models \
  -H 'X-Api-Key: change-me-client-key'
```

Expected result:

- `planner`, `writer`, and `critic` should appear once each
- gateway metadata should show the registered nodes and their freshness

Check one node directly:

```bash
curl -sS http://192.168.1.101:8091/v1/node/inventory
```

That endpoint shows:

- discovered devices
- current model inventory
- device metrics
- queue depth and in-flight work

## 8. Send requests by role

Planner request through the gateway:

```bash
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "planner",
    "messages": [{"role":"user","content":"Outline a release plan in 3 bullets."}]
  }'
```

Writer request through the gateway:

```bash
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "writer",
    "messages": [{"role":"user","content":"Rewrite this as a concise status update."}]
  }'
```

Critic request through the gateway:

```bash
curl -sS -X POST http://192.168.1.100:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "critic",
    "messages": [{"role":"user","content":"List the top two flaws in this plan."}]
  }'
```

## Operational notes

- If a node stops heartbeating, the gateway marks it stale and removes it from discovered routing after the configured timeout.
- `GET /ready` on the gateway only returns `200` when the configured `default_model` is currently routable.
- The first request for a cold model may take longer because the node agent has to start or switch `llama-server`.
- The node agent now waits for local `llama-server` readiness before forwarding the first request, so clients should not see transient upstream "Loading model" errors from a normal cold start.

## When to use proxy mode instead

If you do not need node-level inventory, heartbeats, or on-demand `llama-server` management, proxy mode is simpler:

- run one inference server per role yourself
- point the gateway at each server with `type: proxy`
- let the gateway route aliases directly by `proxy_url`

Use node-agent mode when you want RoleMesh to manage local `llama-server` processes and expose node inventory to the gateway.