RoleMesh-Gateway/docs/NODE_AGENT.md

# Node Agent

The **RoleMesh Node Agent** runs on each compute host and manages **persistent** `llama.cpp` servers
(one per device, e.g. one per GPU). It can:

- expose OpenAI-compatible endpoints locally (`/v1/models`, `/v1/chat/completions`)
- register + heartbeat to the Dispatcher/Gateway (`/v1/nodes/register`, `/v1/nodes/heartbeat`)
- report inventory + utilization (`/v1/node/inventory`)

## Persistent server model

For each GPU device, the node agent starts a dedicated `llama-server` process, pinned via
environment variables (e.g. `CUDA_VISIBLE_DEVICES=0` for `gpu:0`) and bound to `127.0.0.1:<port>`.

Model switching is handled by **restart** in the scaffold.
The agent now waits for the replacement `llama-server` to report readiness before proxying the first request.
If startup or switching takes too long, the request fails with a `503` instead of passing through a transient upstream
"Loading model" error.
Device selection is still simple, but it is no longer hard-coded to the first GPU:

- first preference: a device that already has the requested model loaded
- otherwise: the device with the most free VRAM and least queue pressure
- requests are serialized per device
- each device has a bounded pending-request limit for backpressure

## Backends

Adapters are implemented as runtime backends:

- `cuda`: scaffold implementation (NVIDIA via `nvidia-smi`)
- `metal`, `rocm`, `sycl`, `vulkan`: stubs with placeholders for device discovery and metrics

The framework keeps scheduling decisions backend-agnostic by standardizing on:
`DeviceRef` + `DeviceMetrics` + `ensure_server(...)`.

## Running

```bash
pip install -e .
rolemesh-node-agent --config configs/node_agent.example.yaml
```

### Startup timing guards

Two config knobs control how long the node agent waits for a managed `llama-server` to become ready:

```yaml
llama_server_startup_timeout_s: 30.0
llama_server_probe_interval_s: 0.5
max_pending_requests_per_device: 2
```

- `llama_server_startup_timeout_s`: maximum time to wait for a newly started or switched model
- `llama_server_probe_interval_s`: polling interval for readiness checks
- `max_pending_requests_per_device`: maximum in-flight plus queued requests allowed per device before new requests are rejected

The readiness probe checks the managed server's local `GET /health` and `GET /v1/models` endpoints.

## Registering

If `dispatcher_base_url` is set in the node-agent config, the node agent will periodically call:

- `POST <dispatcher>/v1/nodes/heartbeat` with latest device metrics.

Registration is currently manual from the node side (or can be added as a startup step).

### Binding

By default the node agent listens on `127.0.0.1`. If the dispatcher is on another machine, set:

- `listen_host` to a LAN/private IP (preferred), or `0.0.0.0` only when combined with strict firewalling.
- Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).