# Node Agent The **RoleMesh Node Agent** runs on each compute host and manages **persistent** `llama.cpp` servers (one per device, e.g. one per GPU). It can: - expose OpenAI-compatible endpoints locally (`/v1/models`, `/v1/chat/completions`) - register + heartbeat to the Dispatcher/Gateway (`/v1/nodes/register`, `/v1/nodes/heartbeat`) - report inventory + utilization (`/v1/node/inventory`) ## Where the weight file is configured For the node agent, the actual model weights are specified directly in the node-agent config under `models[].path`: ```yaml models: - model_id: "planner-gguf" path: "/models/SomePlannerModel.Q5_K_M.gguf" roles: ["planner"] ``` - `model_id`: name exposed by the node agent API - `path`: exact GGUF file to load - `roles`: role labels this model can satisfy when the node registers with a gateway ## Common llama-server options as structured config For common runtime tuning, the node agent supports structured model fields instead of requiring everything to go through raw `server_args`. Supported structured fields: - `ctx_size` - `batch_size` - `ubatch_size` - `threads` - `threads_batch` - `gpu_layers` - `main_gpu` - `tensor_split` - `flash_attn` - `alias` Example: ```yaml models: - model_id: "planner-gguf" path: "/models/SomePlannerModel.Q5_K_M.gguf" roles: ["planner"] ctx_size: 8192 gpu_layers: 60 threads: 8 batch_size: 1024 flash_attn: true server_args: parallel: 1 ``` `server_args` is still supported as an escape hatch for less common `llama-server` flags. ## Persistent server model For each GPU device, the node agent starts a dedicated `llama-server` process, pinned via environment variables (e.g. `CUDA_VISIBLE_DEVICES=0` for `gpu:0`) and bound to `127.0.0.1:`. Model switching is handled by **restart** in the scaffold. The agent now waits for the replacement `llama-server` to report readiness before proxying the first request. If startup or switching takes too long, the request fails with a `503` instead of passing through a transient upstream "Loading model" error. Device selection is still simple, but it is no longer hard-coded to the first GPU: - first preference: a device that already has the requested model loaded - otherwise: the device with the most free VRAM and least queue pressure - requests are serialized per device - each device has a bounded pending-request limit for backpressure ## Backends Adapters are implemented as runtime backends: - `cuda`: scaffold implementation (NVIDIA via `nvidia-smi`) - `metal`, `rocm`, `sycl`, `vulkan`: stubs with placeholders for device discovery and metrics The framework keeps scheduling decisions backend-agnostic by standardizing on: `DeviceRef` + `DeviceMetrics` + `ensure_server(...)`. ## Running ```bash pip install -e . rolemesh-node-agent --config configs/node_agent.example.yaml ``` To keep local paths and host-specific values out of the tracked config, use an override file: ```bash rolemesh-node-agent \ --config configs/node_agent.example.yaml \ --config-override configs/node_agent.local.yaml ``` Tracked base config should contain placeholders such as: - `/path/to/model-weights` - `/path/to/llama-server` Your local override should contain the real machine-specific values. ### Startup timing guards Two config knobs control how long the node agent waits for a managed `llama-server` to become ready: ```yaml llama_server_startup_timeout_s: 30.0 llama_server_probe_interval_s: 0.5 max_pending_requests_per_device: 2 ``` - `llama_server_startup_timeout_s`: maximum time to wait for a newly started or switched model - `llama_server_probe_interval_s`: polling interval for readiness checks - `max_pending_requests_per_device`: maximum in-flight plus queued requests allowed per device before new requests are rejected The readiness probe checks the managed server's local `GET /health` and `GET /v1/models` endpoints. ## Registering If `dispatcher_base_url` is set in the node-agent config, the node agent will: - register itself on startup via `POST /v1/nodes/register` - `POST /v1/nodes/heartbeat` with latest device metrics. The registration payload is derived from local `models[]` and advertises `served_models`, where each local `model_id` lists the roles it can satisfy. For operator inspection, the node agent also exposes: - `GET /v1/node/registration` That endpoint returns the exact registration payload the node would send to the dispatcher. ### Binding By default the node agent listens on `127.0.0.1`. If the dispatcher is on another machine, set: - `listen_host` to a LAN/private IP (preferred), or `0.0.0.0` only when combined with strict firewalling. - Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).