276 lines
9.1 KiB
Markdown
276 lines
9.1 KiB
Markdown
# RoleMesh Gateway
|
|
|
|

|
|
|
|
RoleMesh Gateway is a lightweight **OpenAI-compatible** API gateway for routing chat-completions requests to multiple
|
|
locally hosted LLM backends (e.g., `llama.cpp` `llama-server`) **by role** (planner, writer, coder, reviewer, …).
|
|
|
|
It is designed for **agentic workflows** that benefit from using different models for different steps, and for
|
|
deployments where **different machines host different models** (e.g., GPU box for fast inference, big RAM CPU box for large models).
|
|
|
|
## What you get
|
|
|
|
- OpenAI-compatible endpoints:
|
|
- `GET /v1/models`
|
|
- `POST /v1/chat/completions` (streaming and non-streaming)
|
|
- `GET /health` and `GET /ready`
|
|
- Model registry from `configs/models.yaml`
|
|
- Optional **node registration** so remote machines can announce role backends to the gateway
|
|
- Robust proxying with **explicit httpx timeouts** (no “hang forever”)
|
|
- Structured logging with request IDs
|
|
|
|
## Roles Are Project-Defined
|
|
|
|
The role names in this repository are examples, not a fixed taxonomy.
|
|
|
|
- `planner`, `writer`, `coder`, and `reviewer` are only sample aliases
|
|
- you can add, remove, or rename roles per project
|
|
- a role is simply the `model` alias clients send to RoleMesh Gateway
|
|
- each role can point at any OpenAI-compatible backend that fits that project's workflow
|
|
|
|
Examples of project-specific roles:
|
|
- `researcher`
|
|
- `summarizer`
|
|
- `tool-user`
|
|
- `swe-backend`
|
|
- `swe-frontend`
|
|
- `test-writer`
|
|
- `security-reviewer`
|
|
|
|
If your workflow changes, update the `models:` section in config rather than treating the example roles as required.
|
|
|
|
## Where model weights are defined
|
|
|
|
There are two different patterns in this project, and the model-weight location is defined in different places depending on which one you use.
|
|
|
|
### Proxy mode
|
|
|
|
In gateway proxy mode, the gateway does **not** point directly to a GGUF or other weight file.
|
|
It only points to an upstream inference server:
|
|
|
|
```yaml
|
|
models:
|
|
planner:
|
|
type: proxy
|
|
proxy_url: http://127.0.0.1:8011
|
|
```
|
|
|
|
In that setup, the actual model weights are chosen by the upstream server itself.
|
|
Examples:
|
|
|
|
- `llamafile --server -m /path/to/model.gguf ...`
|
|
- `llama-server -m /path/to/model.gguf ...`
|
|
- Ollama with `defaults.model: some-model-name`
|
|
|
|
So in proxy mode:
|
|
- RoleMesh alias `planner` -> upstream server at `proxy_url`
|
|
- upstream server -> actual weight file or model name
|
|
|
|
| Upstream type | Where weights/model are chosen | What RoleMesh config provides |
|
|
| --- | --- | --- |
|
|
| `llamafile --server` | CLI `-m /path/to/model.gguf` when the server starts | `proxy_url` |
|
|
| `llama-server` | CLI `-m /path/to/model.gguf` when the server starts | `proxy_url` |
|
|
| Ollama OpenAI-compatible API | request body `model`, often injected via `defaults.model` | `proxy_url` plus optional `defaults.model` |
|
|
|
|
### Node-agent mode
|
|
|
|
In node-agent mode, the weight file is defined explicitly in the node-agent config:
|
|
|
|
```yaml
|
|
models:
|
|
- model_id: "planner-gguf"
|
|
path: "/models/SomePlannerModel.Q5_K_M.gguf"
|
|
roles: ["planner"]
|
|
```
|
|
|
|
In that setup:
|
|
- `model_id` is the model name exposed by the node agent
|
|
- `path` is the actual GGUF weight file to load
|
|
- `roles` are the role labels that node can serve if used with discovery
|
|
|
|
So in node-agent mode:
|
|
- node-agent `model_id` -> exact weight file path via `path`
|
|
- gateway discovered alias -> node role -> node-agent model load
|
|
|
|
## Quick Start
|
|
|
|
This is the fastest path to a working local setup.
|
|
|
|
### 1. Install
|
|
|
|
```bash
|
|
python -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -e .
|
|
```
|
|
|
|
### 2. Start two OpenAI-compatible backends
|
|
|
|
Any backend that exposes `GET /v1/models` and `POST /v1/chat/completions` will work.
|
|
One practical option is `llamafile` in server mode:
|
|
|
|
```bash
|
|
llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
|
|
llamafile --server -m /path/to/writer-model.gguf --host 127.0.0.1 --port 8012 --nobrowser
|
|
```
|
|
|
|
### 3. Create a gateway config
|
|
|
|
```yaml
|
|
version: 1
|
|
default_model: planner
|
|
auth:
|
|
client_api_keys:
|
|
- "change-me-client-key"
|
|
models:
|
|
planner:
|
|
type: proxy
|
|
openai_model_name: planner
|
|
proxy_url: http://127.0.0.1:8011
|
|
defaults:
|
|
temperature: 0
|
|
max_tokens: 128
|
|
writer:
|
|
type: proxy
|
|
openai_model_name: writer
|
|
proxy_url: http://127.0.0.1:8012
|
|
defaults:
|
|
temperature: 0.6
|
|
max_tokens: 256
|
|
```
|
|
|
|
Save that as `configs/models.yaml`.
|
|
|
|
You are not limited to `planner` and `writer`. Those are just placeholders for whatever roles your project needs.
|
|
In this proxy example, the actual weight files are defined by the two backend processes started in step 2, not by the gateway config.
|
|
|
|
### 4. Run the gateway
|
|
|
|
```bash
|
|
ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000
|
|
```
|
|
|
|
### 5. Verify it
|
|
|
|
```bash
|
|
curl -sS http://127.0.0.1:8000/v1/models \
|
|
-H 'X-Api-Key: change-me-client-key'
|
|
```
|
|
|
|
```bash
|
|
curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'X-Api-Key: change-me-client-key' \
|
|
-d '{
|
|
"model": "planner",
|
|
"messages": [{"role":"user","content":"Say hello in 3 words."}]
|
|
}'
|
|
```
|
|
|
|
If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values.
|
|
|
|
## Known Good Inference Backends
|
|
|
|
The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and
|
|
`POST /v1/chat/completions` endpoints. The following applications have been exercised successfully in this repository.
|
|
|
|
### Ollama
|
|
|
|
- Verified directly against `http://127.0.0.1:11434`
|
|
- Verified through RoleMesh Gateway proxy routing
|
|
- Tested with model `dolphin3:latest`
|
|
|
|
Example upstream:
|
|
|
|
```yaml
|
|
models:
|
|
planner:
|
|
type: proxy
|
|
openai_model_name: planner
|
|
proxy_url: http://127.0.0.1:11434
|
|
defaults:
|
|
model: dolphin3:latest
|
|
```
|
|
|
|
Note: when proxying to Ollama's OpenAI-compatible API, the upstream Ollama model name still needs to be supplied.
|
|
One simple pattern is to set it in `defaults.model` and let the gateway inject it.
|
|
|
|
### Llamafile
|
|
|
|
- Verified directly with the newer `llamafile` runner in `tmp-codex/llamafile`
|
|
- Verified through RoleMesh Gateway proxy routing
|
|
- Verified role switching between two live backends
|
|
- Tested successfully with:
|
|
- `phi-2.Q5_K_M.llamafile`
|
|
- `rocket-3b.Q5_K_M.llamafile`
|
|
|
|
Example launch:
|
|
|
|
```bash
|
|
./llamafile --server -m /path/to/model.gguf --host 127.0.0.1 --port 8011 --nobrowser
|
|
```
|
|
|
|
In this case, `/path/to/model.gguf` is where the actual weights are chosen, and RoleMesh only points to that running server.
|
|
|
|
### llama.cpp / llama-server
|
|
|
|
- Verified live through the RoleMesh Node Agent on NVIDIA GPUs
|
|
- Tested with `/home/netuser/bin/llama.cpp/build/bin/llama-server`
|
|
- Tested model load and model switching on Tesla P40 GPUs
|
|
- Tested successfully with:
|
|
- `gemma-2b-it-q8_0.gguf`
|
|
- `Mistral-7B-Instruct-v0.3-Q5_K_M.gguf`
|
|
|
|
The node agent now waits for `llama-server` readiness during model load or model switch before proxying the first
|
|
request, which avoids transient "Loading model" failures on cold start.
|
|
|
|
## Multi-host (node registration)
|
|
|
|
If you want machines to host backends and “register” them dynamically, run a tiny node agent on each backend host
|
|
(or just call the registration endpoint from your own tooling).
|
|
|
|
- Gateway endpoint: `POST /v1/nodes/register`
|
|
- Node payload describes which **roles** it serves and the base URL to reach its OpenAI-compatible backend.
|
|
|
|
See: `docs/DEPLOYMENT.md` and `docs/CONFIG.md`.
|
|
|
|
## Status
|
|
|
|
This repository is a **preliminary scaffold**:
|
|
- Proxying to OpenAI-compatible upstreams works.
|
|
- Registration and load-selection are implemented (basic round-robin).
|
|
- API-key auth for clients and nodes is available.
|
|
- Persistence is basic JSON-backed state, not a full service registry.
|
|
- Gateway proxying has been exercised live with Ollama and `llamafile`.
|
|
- Node-agent managed inference has been exercised live with `llama-server` on CUDA hardware.
|
|
|
|
## Availability Semantics
|
|
|
|
RoleMesh Gateway now distinguishes between configured aliases and currently usable aliases.
|
|
|
|
- `GET /v1/models` advertises only aliases whose upstreams are reachable right now
|
|
- unavailable aliases are reported under `rolemesh.unavailable_models`
|
|
- `GET /ready` returns `200` only when the configured `default_model` is currently usable
|
|
- `GET /health` remains a lightweight process health check and does not probe upstreams
|
|
- discovered nodes are removed from routing once they become stale
|
|
|
|
This makes the API surface more truthful for clients that rely on the advertised role list.
|
|
|
|
By default, registered nodes become stale after `30` seconds without a fresh heartbeat or registration event.
|
|
You can change that with `ROLE_MESH_NODE_STALE_AFTER_S`.
|
|
|
|
## License
|
|
|
|
MIT. See `LICENSE`.
|
|
|
|
## Node Agent (per-host)
|
|
|
|
This repo also includes a **RoleMesh Node Agent** (`rolemesh-node-agent`) that can manage **persistent** `llama.cpp` servers (one per GPU) and report inventory/metrics back to the gateway.
|
|
|
|
- Sample config: `configs/node_agent.example.yaml`
|
|
- Docs: `docs/NODE_AGENT.md`
|
|
|
|
## Safe-by-default binding
|
|
|
|
Gateway and node-agent default to binding on `127.0.0.1` to avoid accidental exposure. Bind only to private/LAN or VPN interfaces and firewall ports if you need remote access.
|