# Node Agent The **RoleMesh Node Agent** runs on each compute host and manages **persistent** `llama.cpp` servers (one per device, e.g. one per GPU). It can: - expose OpenAI-compatible endpoints locally (`/v1/models`, `/v1/chat/completions`) - register + heartbeat to the Dispatcher/Gateway (`/v1/nodes/register`, `/v1/nodes/heartbeat`) - report inventory + utilization (`/v1/node/inventory`) ## Persistent server model For each GPU device, the node agent starts a dedicated `llama-server` process, pinned via environment variables (e.g. `CUDA_VISIBLE_DEVICES=0` for `gpu:0`) and bound to `127.0.0.1:`. Model switching is handled by **restart** in the scaffold. ## Backends Adapters are implemented as runtime backends: - `cuda`: scaffold implementation (NVIDIA via `nvidia-smi`) - `metal`, `rocm`, `sycl`, `vulkan`: stubs with placeholders for device discovery and metrics The framework keeps scheduling decisions backend-agnostic by standardizing on: `DeviceRef` + `DeviceMetrics` + `ensure_server(...)`. ## Running ```bash pip install -e . rolemesh-node-agent --config configs/node_agent.example.yaml ``` ## Registering If `dispatcher_base_url` is set in the node-agent config, the node agent will periodically call: - `POST /v1/nodes/heartbeat` with latest device metrics. Registration is currently manual from the node side (or can be added as a startup step).