151 lines
4.9 KiB
Markdown
151 lines
4.9 KiB
Markdown
# Deployment
|
|
|
|
## Quick start: single host
|
|
|
|
This is the recommended first deployment because it lets you verify the gateway before introducing discovery or node agents.
|
|
|
|
### 1. Start local backends
|
|
|
|
Example using `llamafile`:
|
|
|
|
```bash
|
|
llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
|
|
llamafile --server -m /path/to/writer-model.gguf --host 127.0.0.1 --port 8012 --nobrowser
|
|
```
|
|
|
|
### 2. Point the gateway at those backends
|
|
|
|
```yaml
|
|
version: 1
|
|
default_model: planner
|
|
auth:
|
|
client_api_keys:
|
|
- "change-me-client-key"
|
|
models:
|
|
planner:
|
|
type: proxy
|
|
openai_model_name: planner
|
|
proxy_url: http://127.0.0.1:8011
|
|
writer:
|
|
type: proxy
|
|
openai_model_name: writer
|
|
proxy_url: http://127.0.0.1:8012
|
|
```
|
|
|
|
### 3. Run the gateway
|
|
|
|
```bash
|
|
ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000
|
|
```
|
|
|
|
### 4. Smoke test
|
|
|
|
```bash
|
|
curl -sS http://127.0.0.1:8000/v1/models \
|
|
-H 'X-Api-Key: change-me-client-key'
|
|
|
|
curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'X-Api-Key: change-me-client-key' \
|
|
-d '{
|
|
"model": "planner",
|
|
"messages": [{"role":"user","content":"Say hello in 3 words."}]
|
|
}'
|
|
```
|
|
|
|
### Readiness and model advertisement
|
|
|
|
- `GET /health` only checks that the gateway process is up
|
|
- `GET /ready` checks whether the configured default route is actually usable
|
|
- `GET /v1/models` only lists aliases with a currently reachable upstream
|
|
- aliases that are configured but currently unavailable are reported in `rolemesh.unavailable_models`
|
|
- discovered nodes that have not checked in recently are marked stale and excluded from routing
|
|
|
|
### Stale node timeout
|
|
|
|
Registered nodes age out of discovered-role routing after a heartbeat timeout.
|
|
|
|
- default timeout: `30` seconds
|
|
- configure with `ROLE_MESH_NODE_STALE_AFTER_S`
|
|
- stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic
|
|
|
|
## Worked example: multi-node role routing
|
|
|
|
For a concrete topology with:
|
|
|
|
- gateway on one host
|
|
- two node-agent processes on a dual-GPU machine
|
|
- one additional node-agent on a second machine
|
|
- roles `planner`, `writer`, and `critic`
|
|
|
|
see [EXAMPLE_MULTI_NODE.md](EXAMPLE_MULTI_NODE.md).
|
|
|
|
## Network binding and exposure (Step 2 hardening)
|
|
|
|
**Defaults are safe-by-default:** the gateway and node-agent CLIs default to binding on `127.0.0.1` (localhost).
|
|
This prevents accidental public exposure during development.
|
|
|
|
If you need remote access:
|
|
|
|
- Bind **only** to a **LAN/private** interface (e.g. `192.168.x.y`, `10.x.y.z`) and restrict ingress with a firewall/VPN.
|
|
- Do **not** bind to `0.0.0.0` on an Internet-routable host.
|
|
|
|
### Recommended firewall policy (examples)
|
|
|
|
Linux (UFW), allow only a private subnet to reach the gateway (8080) and node agents (8091):
|
|
|
|
```bash
|
|
sudo ufw allow from 192.168.0.0/16 to any port 8080 proto tcp
|
|
sudo ufw allow from 192.168.0.0/16 to any port 8091 proto tcp
|
|
sudo ufw deny 8080/tcp
|
|
sudo ufw deny 8091/tcp
|
|
```
|
|
|
|
If you're using Tailscale/WireGuard, prefer binding to the VPN interface address and limiting rules to that interface/subnet.
|
|
|
|
### Llama.cpp servers
|
|
|
|
The node agent starts persistent `llama-server` processes bound to **localhost only** (`127.0.0.1`).
|
|
This is intentional: the llama servers should never be reachable directly from the network; only the node agent should proxy to them.
|
|
|
|
|
|
This scaffold supports two patterns.
|
|
|
|
## Pattern A: Single host, proxy to localhost backends
|
|
|
|
- Run `llama-server` (or other OpenAI-compatible servers) on the host:
|
|
- planner → `http://127.0.0.1:8011`
|
|
- writer → `http://127.0.0.1:8012`
|
|
- Run gateway:
|
|
- either directly on host (recommended for simplicity), or
|
|
- in Docker with `network_mode: host` (Linux) if upstream binds to 127.0.0.1
|
|
|
|
## Pattern B: Multi-host (roles distributed across machines)
|
|
|
|
- Choose one machine to run the gateway (or run multiple gateways)
|
|
- Each backend host exposes an OpenAI-compatible server on LAN, e.g.:
|
|
- `http://10.0.0.12:8012` (writer)
|
|
- `http://10.0.0.13:8011` (planner)
|
|
- Update `proxy_url` entries to those LAN URLs, **or** use discovery:
|
|
- Set model to `type: discovered` with `role: writer`, etc.
|
|
- Choose `strategy: round_robin` or `strategy: random` per discovered alias
|
|
- Each host registers itself with the gateway.
|
|
|
|
### Minimal registration call
|
|
|
|
```bash
|
|
curl -sS -X POST http://GATEWAY:8000/v1/nodes/register \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'X-RoleMesh-Node-Key: <node-key>' \
|
|
-d '{"node_id":"gpu-box-1","base_url":"http://10.0.0.12:8012","roles":["writer"]}'
|
|
```
|
|
|
|
### Hardening checklist (recommended)
|
|
|
|
- Bind gateway to localhost by default, and explicitly expose it when needed
|
|
- Configure API keys for:
|
|
- inference endpoints via `auth.client_api_keys`
|
|
- node registration and heartbeat via `auth.node_api_keys`
|
|
- Tune `ROLE_MESH_NODE_STALE_AFTER_S` for your heartbeat interval and failure tolerance
|
|
- Consider mTLS if registration happens over untrusted networks
|