123 lines
4.0 KiB
Markdown
123 lines
4.0 KiB
Markdown
# Node Agent
|
|
|
|
The **RoleMesh Node Agent** runs on each compute host and manages **persistent** `llama.cpp` servers
|
|
(one per device, e.g. one per GPU). It can:
|
|
|
|
- expose OpenAI-compatible endpoints locally (`/v1/models`, `/v1/chat/completions`)
|
|
- register + heartbeat to the Dispatcher/Gateway (`/v1/nodes/register`, `/v1/nodes/heartbeat`)
|
|
- report inventory + utilization (`/v1/node/inventory`)
|
|
|
|
## Where the weight file is configured
|
|
|
|
For the node agent, the actual model weights are specified directly in the node-agent config under `models[].path`:
|
|
|
|
```yaml
|
|
models:
|
|
- model_id: "planner-gguf"
|
|
path: "/models/SomePlannerModel.Q5_K_M.gguf"
|
|
roles: ["planner"]
|
|
```
|
|
|
|
- `model_id`: name exposed by the node agent API
|
|
- `path`: exact GGUF file to load
|
|
- `roles`: role labels this model can satisfy when the node registers with a gateway
|
|
|
|
## Common llama-server options as structured config
|
|
|
|
For common runtime tuning, the node agent supports structured model fields instead of requiring everything to go
|
|
through raw `server_args`.
|
|
|
|
Supported structured fields:
|
|
- `ctx_size`
|
|
- `batch_size`
|
|
- `ubatch_size`
|
|
- `threads`
|
|
- `threads_batch`
|
|
- `gpu_layers`
|
|
- `main_gpu`
|
|
- `tensor_split`
|
|
- `flash_attn`
|
|
- `alias`
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
models:
|
|
- model_id: "planner-gguf"
|
|
path: "/models/SomePlannerModel.Q5_K_M.gguf"
|
|
roles: ["planner"]
|
|
ctx_size: 8192
|
|
gpu_layers: 60
|
|
threads: 8
|
|
batch_size: 1024
|
|
flash_attn: true
|
|
server_args:
|
|
parallel: 1
|
|
```
|
|
|
|
`server_args` is still supported as an escape hatch for less common `llama-server` flags.
|
|
|
|
## Persistent server model
|
|
|
|
For each GPU device, the node agent starts a dedicated `llama-server` process, pinned via
|
|
environment variables (e.g. `CUDA_VISIBLE_DEVICES=0` for `gpu:0`) and bound to `127.0.0.1:<port>`.
|
|
|
|
Model switching is handled by **restart** in the scaffold.
|
|
The agent now waits for the replacement `llama-server` to report readiness before proxying the first request.
|
|
If startup or switching takes too long, the request fails with a `503` instead of passing through a transient upstream
|
|
"Loading model" error.
|
|
Device selection is still simple, but it is no longer hard-coded to the first GPU:
|
|
|
|
- first preference: a device that already has the requested model loaded
|
|
- otherwise: the device with the most free VRAM and least queue pressure
|
|
- requests are serialized per device
|
|
- each device has a bounded pending-request limit for backpressure
|
|
|
|
## Backends
|
|
|
|
Adapters are implemented as runtime backends:
|
|
|
|
- `cuda`: scaffold implementation (NVIDIA via `nvidia-smi`)
|
|
- `metal`, `rocm`, `sycl`, `vulkan`: stubs with placeholders for device discovery and metrics
|
|
|
|
The framework keeps scheduling decisions backend-agnostic by standardizing on:
|
|
`DeviceRef` + `DeviceMetrics` + `ensure_server(...)`.
|
|
|
|
## Running
|
|
|
|
```bash
|
|
pip install -e .
|
|
rolemesh-node-agent --config configs/node_agent.example.yaml
|
|
```
|
|
|
|
### Startup timing guards
|
|
|
|
Two config knobs control how long the node agent waits for a managed `llama-server` to become ready:
|
|
|
|
```yaml
|
|
llama_server_startup_timeout_s: 30.0
|
|
llama_server_probe_interval_s: 0.5
|
|
max_pending_requests_per_device: 2
|
|
```
|
|
|
|
- `llama_server_startup_timeout_s`: maximum time to wait for a newly started or switched model
|
|
- `llama_server_probe_interval_s`: polling interval for readiness checks
|
|
- `max_pending_requests_per_device`: maximum in-flight plus queued requests allowed per device before new requests are rejected
|
|
|
|
The readiness probe checks the managed server's local `GET /health` and `GET /v1/models` endpoints.
|
|
|
|
## Registering
|
|
|
|
If `dispatcher_base_url` is set in the node-agent config, the node agent will periodically call:
|
|
|
|
- `POST <dispatcher>/v1/nodes/heartbeat` with latest device metrics.
|
|
|
|
Registration is currently manual from the node side (or can be added as a startup step).
|
|
|
|
### Binding
|
|
|
|
By default the node agent listens on `127.0.0.1`. If the dispatcher is on another machine, set:
|
|
|
|
- `listen_host` to a LAN/private IP (preferred), or `0.0.0.0` only when combined with strict firewalling.
|
|
- Keep llama.cpp servers local-only (this is enforced by the CUDA adapter).
|