4.7 KiB
Node Agent
The RoleMesh Node Agent runs on each compute host and manages persistent llama.cpp servers
(one per device, e.g. one per GPU). It can:
- expose OpenAI-compatible endpoints locally (
/v1/models,/v1/chat/completions) - register + heartbeat to the Dispatcher/Gateway (
/v1/nodes/register,/v1/nodes/heartbeat) - report inventory + utilization (
/v1/node/inventory)
Where the weight file is configured
For the node agent, the actual model weights are specified directly in the node-agent config under models[].path:
models:
- model_id: "planner-gguf"
path: "/models/SomePlannerModel.Q5_K_M.gguf"
roles: ["planner"]
model_id: name exposed by the node agent APIpath: exact GGUF file to loadroles: role labels this model can satisfy when the node registers with a gateway
Common llama-server options as structured config
For common runtime tuning, the node agent supports structured model fields instead of requiring everything to go
through raw server_args.
Supported structured fields:
ctx_sizebatch_sizeubatch_sizethreadsthreads_batchgpu_layersmain_gputensor_splitflash_attnalias
Example:
models:
- model_id: "planner-gguf"
path: "/models/SomePlannerModel.Q5_K_M.gguf"
roles: ["planner"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
server_args:
parallel: 1
server_args is still supported as an escape hatch for less common llama-server flags.
Persistent server model
For each GPU device, the node agent starts a dedicated llama-server process, pinned via
environment variables (e.g. CUDA_VISIBLE_DEVICES=0 for gpu:0) and bound to 127.0.0.1:<port>.
Model switching is handled by restart in the scaffold.
The agent now waits for the replacement llama-server to report readiness before proxying the first request.
If startup or switching takes too long, the request fails with a 503 instead of passing through a transient upstream
"Loading model" error.
Device selection is still simple, but it is no longer hard-coded to the first GPU:
- first preference: a device that already has the requested model loaded
- otherwise: the device with the most free VRAM and least queue pressure
- requests are serialized per device
- each device has a bounded pending-request limit for backpressure
Backends
Adapters are implemented as runtime backends:
cuda: scaffold implementation (NVIDIA vianvidia-smi)metal,rocm,sycl,vulkan: stubs with placeholders for device discovery and metrics
The framework keeps scheduling decisions backend-agnostic by standardizing on:
DeviceRef + DeviceMetrics + ensure_server(...).
Running
pip install -e .
rolemesh-node-agent --config configs/node_agent.example.yaml
To keep local paths and host-specific values out of the tracked config, use an override file:
rolemesh-node-agent \
--config configs/node_agent.example.yaml \
--config-override configs/node_agent.local.yaml
Tracked base config should contain placeholders such as:
/path/to/model-weights/path/to/llama-server
Your local override should contain the real machine-specific values.
Startup timing guards
Two config knobs control how long the node agent waits for a managed llama-server to become ready:
llama_server_startup_timeout_s: 30.0
llama_server_probe_interval_s: 0.5
max_pending_requests_per_device: 2
llama_server_startup_timeout_s: maximum time to wait for a newly started or switched modelllama_server_probe_interval_s: polling interval for readiness checksmax_pending_requests_per_device: maximum in-flight plus queued requests allowed per device before new requests are rejected
The readiness probe checks the managed server's local GET /health and GET /v1/models endpoints.
Registering
If dispatcher_base_url is set in the node-agent config, the node agent will:
- register itself on startup via
POST <dispatcher>/v1/nodes/register POST <dispatcher>/v1/nodes/heartbeatwith latest device metrics.
The registration payload is derived from local models[] and advertises served_models, where each local
model_id lists the roles it can satisfy.
For operator inspection, the node agent also exposes:
GET /v1/node/registration
That endpoint returns the exact registration payload the node would send to the dispatcher.
Binding
By default the node agent listens on 127.0.0.1. If the dispatcher is on another machine, set:
listen_hostto a LAN/private IP (preferred), or0.0.0.0only when combined with strict firewalling.- Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).