4.6 KiB
Deployment
Quick start: single host
This is the recommended first deployment because it lets you verify the gateway before introducing discovery or node agents.
1. Start local backends
Example using llamafile:
llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
llamafile --server -m /path/to/writer-model.gguf --host 127.0.0.1 --port 8012 --nobrowser
2. Point the gateway at those backends
version: 1
default_model: planner
auth:
client_api_keys:
- "change-me-client-key"
models:
planner:
type: proxy
openai_model_name: planner
proxy_url: http://127.0.0.1:8011
writer:
type: proxy
openai_model_name: writer
proxy_url: http://127.0.0.1:8012
3. Run the gateway
ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000
4. Smoke test
curl -sS http://127.0.0.1:8000/v1/models \
-H 'X-Api-Key: change-me-client-key'
curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "planner",
"messages": [{"role":"user","content":"Say hello in 3 words."}]
}'
Readiness and model advertisement
GET /healthonly checks that the gateway process is upGET /readychecks whether the configured default route is actually usableGET /v1/modelsonly lists aliases with a currently reachable upstream- aliases that are configured but currently unavailable are reported in
rolemesh.unavailable_models - discovered nodes that have not checked in recently are marked stale and excluded from routing
Stale node timeout
Registered nodes age out of discovered-role routing after a heartbeat timeout.
- default timeout:
30seconds - configure with
ROLE_MESH_NODE_STALE_AFTER_S - stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic
Network binding and exposure (Step 2 hardening)
Defaults are safe-by-default: the gateway and node-agent CLIs default to binding on 127.0.0.1 (localhost).
This prevents accidental public exposure during development.
If you need remote access:
- Bind only to a LAN/private interface (e.g.
192.168.x.y,10.x.y.z) and restrict ingress with a firewall/VPN. - Do not bind to
0.0.0.0on an Internet-routable host.
Recommended firewall policy (examples)
Linux (UFW), allow only a private subnet to reach the gateway (8080) and node agents (8091):
sudo ufw allow from 192.168.0.0/16 to any port 8080 proto tcp
sudo ufw allow from 192.168.0.0/16 to any port 8091 proto tcp
sudo ufw deny 8080/tcp
sudo ufw deny 8091/tcp
If you're using Tailscale/WireGuard, prefer binding to the VPN interface address and limiting rules to that interface/subnet.
Llama.cpp servers
The node agent starts persistent llama-server processes bound to localhost only (127.0.0.1).
This is intentional: the llama servers should never be reachable directly from the network; only the node agent should proxy to them.
This scaffold supports two patterns.
Pattern A: Single host, proxy to localhost backends
- Run
llama-server(or other OpenAI-compatible servers) on the host:- planner →
http://127.0.0.1:8011 - writer →
http://127.0.0.1:8012
- planner →
- Run gateway:
- either directly on host (recommended for simplicity), or
- in Docker with
network_mode: host(Linux) if upstream binds to 127.0.0.1
Pattern B: Multi-host (roles distributed across machines)
- Choose one machine to run the gateway (or run multiple gateways)
- Each backend host exposes an OpenAI-compatible server on LAN, e.g.:
http://10.0.0.12:8012(writer)http://10.0.0.13:8011(planner)
- Update
proxy_urlentries to those LAN URLs, or use discovery:- Set model to
type: discoveredwithrole: writer, etc. - Choose
strategy: round_robinorstrategy: randomper discovered alias - Each host registers itself with the gateway.
- Set model to
Minimal registration call
curl -sS -X POST http://GATEWAY:8000/v1/nodes/register \
-H 'Content-Type: application/json' \
-H 'X-RoleMesh-Node-Key: <node-key>' \
-d '{"node_id":"gpu-box-1","base_url":"http://10.0.0.12:8012","roles":["writer"]}'
Hardening checklist (recommended)
- Bind gateway to localhost by default, and explicitly expose it when needed
- Configure API keys for:
- inference endpoints via
auth.client_api_keys - node registration and heartbeat via
auth.node_api_keys
- inference endpoints via
- Tune
ROLE_MESH_NODE_STALE_AFTER_Sfor your heartbeat interval and failure tolerance - Consider mTLS if registration happens over untrusted networks