4.0 KiB
Node Agent
The RoleMesh Node Agent runs on each compute host and manages persistent llama.cpp servers
(one per device, e.g. one per GPU). It can:
- expose OpenAI-compatible endpoints locally (
/v1/models,/v1/chat/completions) - register + heartbeat to the Dispatcher/Gateway (
/v1/nodes/register,/v1/nodes/heartbeat) - report inventory + utilization (
/v1/node/inventory)
Where the weight file is configured
For the node agent, the actual model weights are specified directly in the node-agent config under models[].path:
models:
- model_id: "planner-gguf"
path: "/models/SomePlannerModel.Q5_K_M.gguf"
roles: ["planner"]
model_id: name exposed by the node agent APIpath: exact GGUF file to loadroles: role labels this model can satisfy when the node registers with a gateway
Common llama-server options as structured config
For common runtime tuning, the node agent supports structured model fields instead of requiring everything to go
through raw server_args.
Supported structured fields:
ctx_sizebatch_sizeubatch_sizethreadsthreads_batchgpu_layersmain_gputensor_splitflash_attnalias
Example:
models:
- model_id: "planner-gguf"
path: "/models/SomePlannerModel.Q5_K_M.gguf"
roles: ["planner"]
ctx_size: 8192
gpu_layers: 60
threads: 8
batch_size: 1024
flash_attn: true
server_args:
parallel: 1
server_args is still supported as an escape hatch for less common llama-server flags.
Persistent server model
For each GPU device, the node agent starts a dedicated llama-server process, pinned via
environment variables (e.g. CUDA_VISIBLE_DEVICES=0 for gpu:0) and bound to 127.0.0.1:<port>.
Model switching is handled by restart in the scaffold.
The agent now waits for the replacement llama-server to report readiness before proxying the first request.
If startup or switching takes too long, the request fails with a 503 instead of passing through a transient upstream
"Loading model" error.
Device selection is still simple, but it is no longer hard-coded to the first GPU:
- first preference: a device that already has the requested model loaded
- otherwise: the device with the most free VRAM and least queue pressure
- requests are serialized per device
- each device has a bounded pending-request limit for backpressure
Backends
Adapters are implemented as runtime backends:
cuda: scaffold implementation (NVIDIA vianvidia-smi)metal,rocm,sycl,vulkan: stubs with placeholders for device discovery and metrics
The framework keeps scheduling decisions backend-agnostic by standardizing on:
DeviceRef + DeviceMetrics + ensure_server(...).
Running
pip install -e .
rolemesh-node-agent --config configs/node_agent.example.yaml
Startup timing guards
Two config knobs control how long the node agent waits for a managed llama-server to become ready:
llama_server_startup_timeout_s: 30.0
llama_server_probe_interval_s: 0.5
max_pending_requests_per_device: 2
llama_server_startup_timeout_s: maximum time to wait for a newly started or switched modelllama_server_probe_interval_s: polling interval for readiness checksmax_pending_requests_per_device: maximum in-flight plus queued requests allowed per device before new requests are rejected
The readiness probe checks the managed server's local GET /health and GET /v1/models endpoints.
Registering
If dispatcher_base_url is set in the node-agent config, the node agent will periodically call:
POST <dispatcher>/v1/nodes/heartbeatwith latest device metrics.
Registration is currently manual from the node side (or can be added as a startup step).
Binding
By default the node agent listens on 127.0.0.1. If the dispatcher is on another machine, set:
listen_hostto a LAN/private IP (preferred), or0.0.0.0only when combined with strict firewalling.- Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).