|
|
||
|---|---|---|
| artwork | ||
| configs | ||
| docker | ||
| docs | ||
| scripts | ||
| src | ||
| tests | ||
| .editorconfig | ||
| .gitignore | ||
| LICENSE | ||
| README.md | ||
| docker-compose.yml | ||
| pyproject.toml | ||
README.md
RoleMesh Gateway
RoleMesh Gateway is a lightweight OpenAI-compatible API gateway for routing chat-completions requests to multiple
locally hosted LLM backends (e.g., llama.cpp llama-server) by role (planner, writer, coder, reviewer, …).
It is designed for agentic workflows that benefit from using different models for different steps, and for deployments where different machines host different models (e.g., GPU box for fast inference, big RAM CPU box for large models).
What you get
- OpenAI-compatible endpoints:
GET /v1/modelsPOST /v1/chat/completions(streaming and non-streaming)GET /healthandGET /ready
- Model registry from
configs/models.yaml - Optional node registration so remote machines can announce role backends to the gateway
- Robust proxying with explicit httpx timeouts (no “hang forever”)
- Structured logging with request IDs
Roles Are Project-Defined
The role names in this repository are examples, not a fixed taxonomy.
planner,writer,coder, andreviewerare only sample aliases- you can add, remove, or rename roles per project
- a role is simply the
modelalias clients send to RoleMesh Gateway - each role can point at any OpenAI-compatible backend that fits that project's workflow
Examples of project-specific roles:
researchersummarizertool-userswe-backendswe-frontendtest-writersecurity-reviewer
If your workflow changes, update the models: section in config rather than treating the example roles as required.
Quick Start
This is the fastest path to a working local setup.
1. Install
python -m venv .venv
source .venv/bin/activate
pip install -e .
2. Start two OpenAI-compatible backends
Any backend that exposes GET /v1/models and POST /v1/chat/completions will work.
One practical option is llamafile in server mode:
llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
llamafile --server -m /path/to/writer-model.gguf --host 127.0.0.1 --port 8012 --nobrowser
3. Create a gateway config
version: 1
default_model: planner
auth:
client_api_keys:
- "change-me-client-key"
models:
planner:
type: proxy
openai_model_name: planner
proxy_url: http://127.0.0.1:8011
defaults:
temperature: 0
max_tokens: 128
writer:
type: proxy
openai_model_name: writer
proxy_url: http://127.0.0.1:8012
defaults:
temperature: 0.6
max_tokens: 256
Save that as configs/models.yaml.
You are not limited to planner and writer. Those are just placeholders for whatever roles your project needs.
4. Run the gateway
ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000
5. Verify it
curl -sS http://127.0.0.1:8000/v1/models \
-H 'X-Api-Key: change-me-client-key'
curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'X-Api-Key: change-me-client-key' \
-d '{
"model": "planner",
"messages": [{"role":"user","content":"Say hello in 3 words."}]
}'
If you prefer the provided example file, copy configs/models.example.yaml and adjust the proxy_url values.
Known Good Inference Backends
The gateway is designed to work with any backend that exposes OpenAI-compatible GET /v1/models and
POST /v1/chat/completions endpoints. The following applications have been exercised successfully in this repository.
Ollama
- Verified directly against
http://127.0.0.1:11434 - Verified through RoleMesh Gateway proxy routing
- Tested with model
dolphin3:latest
Example upstream:
models:
planner:
type: proxy
openai_model_name: planner
proxy_url: http://127.0.0.1:11434
defaults:
model: dolphin3:latest
Note: when proxying to Ollama's OpenAI-compatible API, the upstream Ollama model name still needs to be supplied.
One simple pattern is to set it in defaults.model and let the gateway inject it.
Llamafile
- Verified directly with the newer
llamafilerunner intmp-codex/llamafile - Verified through RoleMesh Gateway proxy routing
- Verified role switching between two live backends
- Tested successfully with:
phi-2.Q5_K_M.llamafilerocket-3b.Q5_K_M.llamafile
Example launch:
./llamafile --server -m /path/to/model.gguf --host 127.0.0.1 --port 8011 --nobrowser
llama.cpp / llama-server
- Verified live through the RoleMesh Node Agent on NVIDIA GPUs
- Tested with
/home/netuser/bin/llama.cpp/build/bin/llama-server - Tested model load and model switching on Tesla P40 GPUs
- Tested successfully with:
gemma-2b-it-q8_0.ggufMistral-7B-Instruct-v0.3-Q5_K_M.gguf
The node agent now waits for llama-server readiness during model load or model switch before proxying the first
request, which avoids transient "Loading model" failures on cold start.
Multi-host (node registration)
If you want machines to host backends and “register” them dynamically, run a tiny node agent on each backend host (or just call the registration endpoint from your own tooling).
- Gateway endpoint:
POST /v1/nodes/register - Node payload describes which roles it serves and the base URL to reach its OpenAI-compatible backend.
See: docs/DEPLOYMENT.md and docs/CONFIG.md.
Status
This repository is a preliminary scaffold:
- Proxying to OpenAI-compatible upstreams works.
- Registration and load-selection are implemented (basic round-robin).
- API-key auth for clients and nodes is available.
- Persistence is basic JSON-backed state, not a full service registry.
- Gateway proxying has been exercised live with Ollama and
llamafile. - Node-agent managed inference has been exercised live with
llama-serveron CUDA hardware.
Availability Semantics
RoleMesh Gateway now distinguishes between configured aliases and currently usable aliases.
GET /v1/modelsadvertises only aliases whose upstreams are reachable right now- unavailable aliases are reported under
rolemesh.unavailable_models GET /readyreturns200only when the configureddefault_modelis currently usableGET /healthremains a lightweight process health check and does not probe upstreams- discovered nodes are removed from routing once they become stale
This makes the API surface more truthful for clients that rely on the advertised role list.
By default, registered nodes become stale after 30 seconds without a fresh heartbeat or registration event.
You can change that with ROLE_MESH_NODE_STALE_AFTER_S.
License
MIT. See LICENSE.
Node Agent (per-host)
This repo also includes a RoleMesh Node Agent (rolemesh-node-agent) that can manage persistent llama.cpp servers (one per GPU) and report inventory/metrics back to the gateway.
- Sample config:
configs/node_agent.example.yaml - Docs:
docs/NODE_AGENT.md
Safe-by-default binding
Gateway and node-agent default to binding on 127.0.0.1 to avoid accidental exposure. Bind only to private/LAN or VPN interfaces and firewall ports if you need remote access.
