41 lines
1.4 KiB
Markdown
41 lines
1.4 KiB
Markdown
# Node Agent
|
|
|
|
The **RoleMesh Node Agent** runs on each compute host and manages **persistent** `llama.cpp` servers
|
|
(one per device, e.g. one per GPU). It can:
|
|
|
|
- expose OpenAI-compatible endpoints locally (`/v1/models`, `/v1/chat/completions`)
|
|
- register + heartbeat to the Dispatcher/Gateway (`/v1/nodes/register`, `/v1/nodes/heartbeat`)
|
|
- report inventory + utilization (`/v1/node/inventory`)
|
|
|
|
## Persistent server model
|
|
|
|
For each GPU device, the node agent starts a dedicated `llama-server` process, pinned via
|
|
environment variables (e.g. `CUDA_VISIBLE_DEVICES=0` for `gpu:0`) and bound to `127.0.0.1:<port>`.
|
|
|
|
Model switching is handled by **restart** in the scaffold.
|
|
|
|
## Backends
|
|
|
|
Adapters are implemented as runtime backends:
|
|
|
|
- `cuda`: scaffold implementation (NVIDIA via `nvidia-smi`)
|
|
- `metal`, `rocm`, `sycl`, `vulkan`: stubs with placeholders for device discovery and metrics
|
|
|
|
The framework keeps scheduling decisions backend-agnostic by standardizing on:
|
|
`DeviceRef` + `DeviceMetrics` + `ensure_server(...)`.
|
|
|
|
## Running
|
|
|
|
```bash
|
|
pip install -e .
|
|
rolemesh-node-agent --config configs/node_agent.example.yaml
|
|
```
|
|
|
|
## Registering
|
|
|
|
If `dispatcher_base_url` is set in the node-agent config, the node agent will periodically call:
|
|
|
|
- `POST <dispatcher>/v1/nodes/heartbeat` with latest device metrics.
|
|
|
|
Registration is currently manual from the node side (or can be added as a startup step).
|