RoleMesh-Gateway/docs/NODE_AGENT.md

4.7 KiB

Node Agent

The RoleMesh Node Agent runs on each compute host and manages persistent llama.cpp servers (one per device, e.g. one per GPU). It can:

  • expose OpenAI-compatible endpoints locally (/v1/models, /v1/chat/completions)
  • register + heartbeat to the Dispatcher/Gateway (/v1/nodes/register, /v1/nodes/heartbeat)
  • report inventory + utilization (/v1/node/inventory)

Where the weight file is configured

For the node agent, the actual model weights are specified directly in the node-agent config under models[].path:

models:
  - model_id: "planner-gguf"
    path: "/models/SomePlannerModel.Q5_K_M.gguf"
    roles: ["planner"]
  • model_id: name exposed by the node agent API
  • path: exact GGUF file to load
  • roles: role labels this model can satisfy when the node registers with a gateway

Common llama-server options as structured config

For common runtime tuning, the node agent supports structured model fields instead of requiring everything to go through raw server_args.

Supported structured fields:

  • ctx_size
  • batch_size
  • ubatch_size
  • threads
  • threads_batch
  • gpu_layers
  • main_gpu
  • tensor_split
  • flash_attn
  • alias

Example:

models:
  - model_id: "planner-gguf"
    path: "/models/SomePlannerModel.Q5_K_M.gguf"
    roles: ["planner"]
    ctx_size: 8192
    gpu_layers: 60
    threads: 8
    batch_size: 1024
    flash_attn: true
    server_args:
      parallel: 1

server_args is still supported as an escape hatch for less common llama-server flags.

Persistent server model

For each GPU device, the node agent starts a dedicated llama-server process, pinned via environment variables (e.g. CUDA_VISIBLE_DEVICES=0 for gpu:0) and bound to 127.0.0.1:<port>.

Model switching is handled by restart in the scaffold. The agent now waits for the replacement llama-server to report readiness before proxying the first request. If startup or switching takes too long, the request fails with a 503 instead of passing through a transient upstream "Loading model" error. Device selection is still simple, but it is no longer hard-coded to the first GPU:

  • first preference: a device that already has the requested model loaded
  • otherwise: the device with the most free VRAM and least queue pressure
  • requests are serialized per device
  • each device has a bounded pending-request limit for backpressure

Backends

Adapters are implemented as runtime backends:

  • cuda: scaffold implementation (NVIDIA via nvidia-smi)
  • metal, rocm, sycl, vulkan: stubs with placeholders for device discovery and metrics

The framework keeps scheduling decisions backend-agnostic by standardizing on: DeviceRef + DeviceMetrics + ensure_server(...).

Running

pip install -e .
rolemesh-node-agent --config configs/node_agent.example.yaml

To keep local paths and host-specific values out of the tracked config, use an override file:

rolemesh-node-agent \
  --config configs/node_agent.example.yaml \
  --config-override configs/node_agent.local.yaml

Tracked base config should contain placeholders such as:

  • /path/to/model-weights
  • /path/to/llama-server

Your local override should contain the real machine-specific values.

Startup timing guards

Two config knobs control how long the node agent waits for a managed llama-server to become ready:

llama_server_startup_timeout_s: 30.0
llama_server_probe_interval_s: 0.5
max_pending_requests_per_device: 2
  • llama_server_startup_timeout_s: maximum time to wait for a newly started or switched model
  • llama_server_probe_interval_s: polling interval for readiness checks
  • max_pending_requests_per_device: maximum in-flight plus queued requests allowed per device before new requests are rejected

The readiness probe checks the managed server's local GET /health and GET /v1/models endpoints.

Registering

If dispatcher_base_url is set in the node-agent config, the node agent will:

  • register itself on startup via POST <dispatcher>/v1/nodes/register
  • POST <dispatcher>/v1/nodes/heartbeat with latest device metrics.

The registration payload is derived from local models[] and advertises served_models, where each local model_id lists the roles it can satisfy.

For operator inspection, the node agent also exposes:

  • GET /v1/node/registration

That endpoint returns the exact registration payload the node would send to the dispatcher.

Binding

By default the node agent listens on 127.0.0.1. If the dispatcher is on another machine, set:

  • listen_host to a LAN/private IP (preferred), or 0.0.0.0 only when combined with strict firewalling.
  • Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).