# Node Agent The **RoleMesh Node Agent** runs on each compute host and manages **persistent** `llama.cpp` servers (one per device, e.g. one per GPU). It can: - expose OpenAI-compatible endpoints locally (`/v1/models`, `/v1/chat/completions`) - register + heartbeat to the Dispatcher/Gateway (`/v1/nodes/register`, `/v1/nodes/heartbeat`) - report inventory + utilization (`/v1/node/inventory`) ## Where the weight file is configured For the node agent, the actual model weights are specified directly in the node-agent config under `models[].path`: ```yaml models: - model_id: "planner-gguf" path: "/models/SomePlannerModel.Q5_K_M.gguf" roles: ["planner"] ``` - `model_id`: name exposed by the node agent API - `path`: exact GGUF file to load - `roles`: role labels this model can satisfy when the node registers with a gateway ## Persistent server model For each GPU device, the node agent starts a dedicated `llama-server` process, pinned via environment variables (e.g. `CUDA_VISIBLE_DEVICES=0` for `gpu:0`) and bound to `127.0.0.1:`. Model switching is handled by **restart** in the scaffold. The agent now waits for the replacement `llama-server` to report readiness before proxying the first request. If startup or switching takes too long, the request fails with a `503` instead of passing through a transient upstream "Loading model" error. Device selection is still simple, but it is no longer hard-coded to the first GPU: - first preference: a device that already has the requested model loaded - otherwise: the device with the most free VRAM and least queue pressure - requests are serialized per device - each device has a bounded pending-request limit for backpressure ## Backends Adapters are implemented as runtime backends: - `cuda`: scaffold implementation (NVIDIA via `nvidia-smi`) - `metal`, `rocm`, `sycl`, `vulkan`: stubs with placeholders for device discovery and metrics The framework keeps scheduling decisions backend-agnostic by standardizing on: `DeviceRef` + `DeviceMetrics` + `ensure_server(...)`. ## Running ```bash pip install -e . rolemesh-node-agent --config configs/node_agent.example.yaml ``` ### Startup timing guards Two config knobs control how long the node agent waits for a managed `llama-server` to become ready: ```yaml llama_server_startup_timeout_s: 30.0 llama_server_probe_interval_s: 0.5 max_pending_requests_per_device: 2 ``` - `llama_server_startup_timeout_s`: maximum time to wait for a newly started or switched model - `llama_server_probe_interval_s`: polling interval for readiness checks - `max_pending_requests_per_device`: maximum in-flight plus queued requests allowed per device before new requests are rejected The readiness probe checks the managed server's local `GET /health` and `GET /v1/models` endpoints. ## Registering If `dispatcher_base_url` is set in the node-agent config, the node agent will periodically call: - `POST /v1/nodes/heartbeat` with latest device metrics. Registration is currently manual from the node side (or can be added as a startup step). ### Binding By default the node agent listens on `127.0.0.1`. If the dispatcher is on another machine, set: - `listen_host` to a LAN/private IP (preferred), or `0.0.0.0` only when combined with strict firewalling. - Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).