4.0 KiB

Raw Blame History

Node Agent

The RoleMesh Node Agent runs on each compute host and manages persistent llama.cpp servers (one per device, e.g. one per GPU). It can:

expose OpenAI-compatible endpoints locally (/v1/models, /v1/chat/completions)
register + heartbeat to the Dispatcher/Gateway (/v1/nodes/register, /v1/nodes/heartbeat)
report inventory + utilization (/v1/node/inventory)

Where the weight file is configured

For the node agent, the actual model weights are specified directly in the node-agent config under models[].path:

models:
  - model_id: "planner-gguf"
    path: "/models/SomePlannerModel.Q5_K_M.gguf"
    roles: ["planner"]

model_id: name exposed by the node agent API
path: exact GGUF file to load
roles: role labels this model can satisfy when the node registers with a gateway

Common llama-server options as structured config

For common runtime tuning, the node agent supports structured model fields instead of requiring everything to go through raw server_args.

Supported structured fields:

ctx_size
batch_size
ubatch_size
threads
threads_batch
gpu_layers
main_gpu
tensor_split
flash_attn
alias

Example:

models:
  - model_id: "planner-gguf"
    path: "/models/SomePlannerModel.Q5_K_M.gguf"
    roles: ["planner"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true
    server_args:
      parallel: 1

server_args is still supported as an escape hatch for less common llama-server flags.

Persistent server model

For each GPU device, the node agent starts a dedicated llama-server process, pinned via environment variables (e.g. CUDA_VISIBLE_DEVICES=0 for gpu:0) and bound to 127.0.0.1:<port>.

Model switching is handled by restart in the scaffold. The agent now waits for the replacement llama-server to report readiness before proxying the first request. If startup or switching takes too long, the request fails with a 503 instead of passing through a transient upstream "Loading model" error. Device selection is still simple, but it is no longer hard-coded to the first GPU:

first preference: a device that already has the requested model loaded
otherwise: the device with the most free VRAM and least queue pressure
requests are serialized per device
each device has a bounded pending-request limit for backpressure

Backends

Adapters are implemented as runtime backends:

cuda: scaffold implementation (NVIDIA via nvidia-smi)
metal, rocm, sycl, vulkan: stubs with placeholders for device discovery and metrics

The framework keeps scheduling decisions backend-agnostic by standardizing on: DeviceRef + DeviceMetrics + ensure_server(...).

Running

pip install -e .
rolemesh-node-agent --config configs/node_agent.example.yaml

Startup timing guards

Two config knobs control how long the node agent waits for a managed llama-server to become ready:

llama_server_startup_timeout_s: 30.0
llama_server_probe_interval_s: 0.5
max_pending_requests_per_device: 2

llama_server_startup_timeout_s: maximum time to wait for a newly started or switched model
llama_server_probe_interval_s: polling interval for readiness checks
max_pending_requests_per_device: maximum in-flight plus queued requests allowed per device before new requests are rejected

The readiness probe checks the managed server's local GET /health and GET /v1/models endpoints.

Registering

If dispatcher_base_url is set in the node-agent config, the node agent will periodically call:

POST <dispatcher>/v1/nodes/heartbeat with latest device metrics.

Registration is currently manual from the node side (or can be added as a startup step).

Binding

By default the node agent listens on 127.0.0.1. If the dispatcher is on another machine, set:

listen_host to a LAN/private IP (preferred), or 0.0.0.0 only when combined with strict firewalling.
Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).

4.0 KiB Raw Blame History