# Deployment ## Quick start: single host This is the recommended first deployment because it lets you verify the gateway before introducing discovery or node agents. ### 1. Start local backends Example using `llamafile`: ```bash llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser llamafile --server -m /path/to/writer-model.gguf --host 127.0.0.1 --port 8012 --nobrowser ``` ### 2. Point the gateway at those backends ```yaml version: 1 default_model: planner auth: client_api_keys: - "change-me-client-key" models: planner: type: proxy openai_model_name: planner proxy_url: http://127.0.0.1:8011 writer: type: proxy openai_model_name: writer proxy_url: http://127.0.0.1:8012 ``` ### 3. Run the gateway ```bash ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000 ``` ### 4. Smoke test ```bash curl -sS http://127.0.0.1:8000/v1/models \ -H 'X-Api-Key: change-me-client-key' curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'X-Api-Key: change-me-client-key' \ -d '{ "model": "planner", "messages": [{"role":"user","content":"Say hello in 3 words."}] }' ``` ### Readiness and model advertisement - `GET /health` only checks that the gateway process is up - `GET /ready` checks whether the configured default route is actually usable - `GET /v1/models` only lists aliases with a currently reachable upstream - aliases that are configured but currently unavailable are reported in `rolemesh.unavailable_models` - discovered nodes that have not checked in recently are marked stale and excluded from routing ### Stale node timeout Registered nodes age out of discovered-role routing after a heartbeat timeout. - default timeout: `30` seconds - configure with `ROLE_MESH_NODE_STALE_AFTER_S` - stale nodes remain visible for operators in the gateway metadata, but they no longer receive traffic ## Network binding and exposure (Step 2 hardening) **Defaults are safe-by-default:** the gateway and node-agent CLIs default to binding on `127.0.0.1` (localhost). This prevents accidental public exposure during development. If you need remote access: - Bind **only** to a **LAN/private** interface (e.g. `192.168.x.y`, `10.x.y.z`) and restrict ingress with a firewall/VPN. - Do **not** bind to `0.0.0.0` on an Internet-routable host. ### Recommended firewall policy (examples) Linux (UFW), allow only a private subnet to reach the gateway (8080) and node agents (8091): ```bash sudo ufw allow from 192.168.0.0/16 to any port 8080 proto tcp sudo ufw allow from 192.168.0.0/16 to any port 8091 proto tcp sudo ufw deny 8080/tcp sudo ufw deny 8091/tcp ``` If you're using Tailscale/WireGuard, prefer binding to the VPN interface address and limiting rules to that interface/subnet. ### Llama.cpp servers The node agent starts persistent `llama-server` processes bound to **localhost only** (`127.0.0.1`). This is intentional: the llama servers should never be reachable directly from the network; only the node agent should proxy to them. This scaffold supports two patterns. ## Pattern A: Single host, proxy to localhost backends - Run `llama-server` (or other OpenAI-compatible servers) on the host: - planner → `http://127.0.0.1:8011` - writer → `http://127.0.0.1:8012` - Run gateway: - either directly on host (recommended for simplicity), or - in Docker with `network_mode: host` (Linux) if upstream binds to 127.0.0.1 ## Pattern B: Multi-host (roles distributed across machines) - Choose one machine to run the gateway (or run multiple gateways) - Each backend host exposes an OpenAI-compatible server on LAN, e.g.: - `http://10.0.0.12:8012` (writer) - `http://10.0.0.13:8011` (planner) - Update `proxy_url` entries to those LAN URLs, **or** use discovery: - Set model to `type: discovered` with `role: writer`, etc. - Choose `strategy: round_robin` or `strategy: random` per discovered alias - Each host registers itself with the gateway. ### Minimal registration call ```bash curl -sS -X POST http://GATEWAY:8000/v1/nodes/register \ -H 'Content-Type: application/json' \ -H 'X-RoleMesh-Node-Key: ' \ -d '{"node_id":"gpu-box-1","base_url":"http://10.0.0.12:8012","roles":["writer"]}' ``` ### Hardening checklist (recommended) - Bind gateway to localhost by default, and explicitly expose it when needed - Configure API keys for: - inference endpoints via `auth.client_api_keys` - node registration and heartbeat via `auth.node_api_keys` - Tune `ROLE_MESH_NODE_STALE_AFTER_S` for your heartbeat interval and failure tolerance - Consider mTLS if registration happens over untrusted networks