RoleMesh-Gateway/docs/DEPLOYMENT.md

4.9 KiB

Deployment

Quick start: single host

This is the recommended first deployment because it lets you verify the gateway before introducing discovery or node agents.

1. Start local backends

Example using llamafile:

llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
llamafile --server -m /path/to/writer-model.gguf  --host 127.0.0.1 --port 8012 --nobrowser

2. Point the gateway at those backends

version: 1
default_model: planner
auth:
  client_api_keys:
    - "change-me-client-key"
models:
  planner:
    type: proxy
    openai_model_name: planner
    proxy_url: http://127.0.0.1:8011
  writer:
    type: proxy
    openai_model_name: writer
    proxy_url: http://127.0.0.1:8012

3. Run the gateway

ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000

4. Smoke test

curl -sS http://127.0.0.1:8000/v1/models \
  -H 'X-Api-Key: change-me-client-key'

curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "planner",
    "messages": [{"role":"user","content":"Say hello in 3 words."}]
  }'

Readiness and model advertisement

  • GET /health only checks that the gateway process is up
  • GET /ready checks whether the configured default route is actually usable
  • GET /v1/models only lists aliases with a currently reachable upstream
  • aliases that are configured but currently unavailable are reported in rolemesh.unavailable_models
  • discovered nodes that have not checked in recently are marked stale and excluded from routing

Stale node timeout

Registered nodes age out of discovered-role routing after a heartbeat timeout.

  • default timeout: 30 seconds
  • configure with ROLE_MESH_NODE_STALE_AFTER_S
  • stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic

Worked example: multi-node role routing

For a concrete topology with:

  • gateway on one host
  • two node-agent processes on a dual-GPU machine
  • one additional node-agent on a second machine
  • roles planner, writer, and critic

see EXAMPLE_MULTI_NODE.md.

Network binding and exposure (Step 2 hardening)

Defaults are safe-by-default: the gateway and node-agent CLIs default to binding on 127.0.0.1 (localhost). This prevents accidental public exposure during development.

If you need remote access:

  • Bind only to a LAN/private interface (e.g. 192.168.x.y, 10.x.y.z) and restrict ingress with a firewall/VPN.
  • Do not bind to 0.0.0.0 on an Internet-routable host.

Linux (UFW), allow only a private subnet to reach the gateway (8080) and node agents (8091):

sudo ufw allow from 192.168.0.0/16 to any port 8080 proto tcp
sudo ufw allow from 192.168.0.0/16 to any port 8091 proto tcp
sudo ufw deny 8080/tcp
sudo ufw deny 8091/tcp

If you're using Tailscale/WireGuard, prefer binding to the VPN interface address and limiting rules to that interface/subnet.

Llama.cpp servers

The node agent starts persistent llama-server processes bound to localhost only (127.0.0.1). This is intentional: the llama servers should never be reachable directly from the network; only the node agent should proxy to them.

This scaffold supports two patterns.

Pattern A: Single host, proxy to localhost backends

  • Run llama-server (or other OpenAI-compatible servers) on the host:
    • planner → http://127.0.0.1:8011
    • writer → http://127.0.0.1:8012
  • Run gateway:
    • either directly on host (recommended for simplicity), or
    • in Docker with network_mode: host (Linux) if upstream binds to 127.0.0.1

Pattern B: Multi-host (roles distributed across machines)

  • Choose one machine to run the gateway (or run multiple gateways)
  • Each backend host exposes an OpenAI-compatible server on LAN, e.g.:
    • http://10.0.0.12:8012 (writer)
    • http://10.0.0.13:8011 (planner)
  • Update proxy_url entries to those LAN URLs, or use discovery:
    • Set model to type: discovered with role: writer, etc.
    • Choose strategy: round_robin or strategy: random per discovered alias
    • Each host registers itself with the gateway.

Minimal registration call

curl -sS -X POST http://GATEWAY:8000/v1/nodes/register \
  -H 'Content-Type: application/json' \
  -H 'X-RoleMesh-Node-Key: <node-key>' \
  -d '{"node_id":"gpu-box-1","base_url":"http://10.0.0.12:8012","roles":["writer"]}'
  • Bind gateway to localhost by default, and explicitly expose it when needed
  • Configure API keys for:
    • inference endpoints via auth.client_api_keys
    • node registration and heartbeat via auth.node_api_keys
  • Tune ROLE_MESH_NODE_STALE_AFTER_S for your heartbeat interval and failure tolerance
  • Consider mTLS if registration happens over untrusted networks