# RoleMesh Gateway ![RoleMesh Gateway logo](artwork/rolemesh_gateway_logo.png) RoleMesh Gateway is a lightweight **OpenAI-compatible** API gateway for routing chat-completions requests to multiple locally hosted LLM backends (e.g., `llama.cpp` `llama-server`) **by role** (planner, writer, coder, reviewer, …). It is designed for **agentic workflows** that benefit from using different models for different steps, and for deployments where **different machines host different models** (e.g., GPU box for fast inference, big RAM CPU box for large models). ## What you get - OpenAI-compatible endpoints: - `GET /v1/models` - `POST /v1/chat/completions` (streaming and non-streaming) - `GET /health` and `GET /ready` - Model registry from `configs/models.yaml` - Optional **node registration** so remote machines can announce role backends to the gateway - Robust proxying with **explicit httpx timeouts** (no “hang forever”) - Structured logging with request IDs ## Quick Start This is the fastest path to a working local setup. ### 1. Install ```bash python -m venv .venv source .venv/bin/activate pip install -e . ``` ### 2. Start two OpenAI-compatible backends Any backend that exposes `GET /v1/models` and `POST /v1/chat/completions` will work. One practical option is `llamafile` in server mode: ```bash llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser llamafile --server -m /path/to/writer-model.gguf --host 127.0.0.1 --port 8012 --nobrowser ``` ### 3. Create a gateway config ```yaml version: 1 default_model: planner auth: client_api_keys: - "change-me-client-key" models: planner: type: proxy openai_model_name: planner proxy_url: http://127.0.0.1:8011 defaults: temperature: 0 max_tokens: 128 writer: type: proxy openai_model_name: writer proxy_url: http://127.0.0.1:8012 defaults: temperature: 0.6 max_tokens: 256 ``` Save that as `configs/models.yaml`. ### 4. Run the gateway ```bash ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000 ``` ### 5. Verify it ```bash curl -sS http://127.0.0.1:8000/v1/models \ -H 'X-Api-Key: change-me-client-key' ``` ```bash curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'X-Api-Key: change-me-client-key' \ -d '{ "model": "planner", "messages": [{"role":"user","content":"Say hello in 3 words."}] }' ``` If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values. ## Known Good Inference Backends The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and `POST /v1/chat/completions` endpoints. The following applications have been exercised successfully in this repository. ### Ollama - Verified directly against `http://127.0.0.1:11434` - Verified through RoleMesh Gateway proxy routing - Tested with model `dolphin3:latest` Example upstream: ```yaml models: planner: type: proxy openai_model_name: planner proxy_url: http://127.0.0.1:11434 defaults: model: dolphin3:latest ``` Note: when proxying to Ollama's OpenAI-compatible API, the upstream Ollama model name still needs to be supplied. One simple pattern is to set it in `defaults.model` and let the gateway inject it. ### Llamafile - Verified directly with the newer `llamafile` runner in `tmp-codex/llamafile` - Verified through RoleMesh Gateway proxy routing - Verified role switching between two live backends - Tested successfully with: - `phi-2.Q5_K_M.llamafile` - `rocket-3b.Q5_K_M.llamafile` Example launch: ```bash ./llamafile --server -m /path/to/model.gguf --host 127.0.0.1 --port 8011 --nobrowser ``` ### llama.cpp / llama-server - Verified live through the RoleMesh Node Agent on NVIDIA GPUs - Tested with `/home/netuser/bin/llama.cpp/build/bin/llama-server` - Tested model load and model switching on Tesla P40 GPUs - Tested successfully with: - `gemma-2b-it-q8_0.gguf` - `Mistral-7B-Instruct-v0.3-Q5_K_M.gguf` The node agent now waits for `llama-server` readiness during model load or model switch before proxying the first request, which avoids transient "Loading model" failures on cold start. ## Multi-host (node registration) If you want machines to host backends and “register” them dynamically, run a tiny node agent on each backend host (or just call the registration endpoint from your own tooling). - Gateway endpoint: `POST /v1/nodes/register` - Node payload describes which **roles** it serves and the base URL to reach its OpenAI-compatible backend. See: `docs/DEPLOYMENT.md` and `docs/CONFIG.md`. ## Status This repository is a **preliminary scaffold**: - Proxying to OpenAI-compatible upstreams works. - Registration and load-selection are implemented (basic round-robin). - API-key auth for clients and nodes is available. - Persistence is basic JSON-backed state, not a full service registry. - Gateway proxying has been exercised live with Ollama and `llamafile`. - Node-agent managed inference has been exercised live with `llama-server` on CUDA hardware. ## License MIT. See `LICENSE`. ## Node Agent (per-host) This repo also includes a **RoleMesh Node Agent** (`rolemesh-node-agent`) that can manage **persistent** `llama.cpp` servers (one per GPU) and report inventory/metrics back to the gateway. - Sample config: `configs/node_agent.example.yaml` - Docs: `docs/NODE_AGENT.md` ## Safe-by-default binding Gateway and node-agent default to binding on `127.0.0.1` to avoid accidental exposure. Bind only to private/LAN or VPN interfaces and firewall ports if you need remote access.