Software to run multiple LLMs for agentic tasks over hosts on a defined network.
Go to file
welsberr 00852c6c0f Refactored to separate role and model_id references. 2026-03-17 10:32:14 -04:00
artwork Add logo 2026-03-16 10:13:54 -04:00
configs Refactored to separate role and model_id references. 2026-03-17 10:32:14 -04:00
docker Still trying to get the initial commits done 2026-02-06 16:20:34 -05:00
docs Refactored to separate role and model_id references. 2026-03-17 10:32:14 -04:00
scripts Change to having node agents and dispatcher structure 2026-02-07 11:41:04 -05:00
src Refactored to separate role and model_id references. 2026-03-17 10:32:14 -04:00
tests Refactored to separate role and model_id references. 2026-03-17 10:32:14 -04:00
.editorconfig Still trying to get the initial commits done 2026-02-06 16:20:34 -05:00
.gitignore Node agent and docs revisions. 2026-03-16 10:59:21 -04:00
LICENSE Still trying to get the initial commits done 2026-02-06 16:20:34 -05:00
README.md Refactored to separate role and model_id references. 2026-03-17 10:32:14 -04:00
docker-compose.yml Still trying to get the initial commits done 2026-02-06 16:20:34 -05:00
pyproject.toml Change to having node agents and dispatcher structure 2026-02-07 11:41:04 -05:00

README.md

RoleMesh Gateway

RoleMesh Gateway logo

RoleMesh Gateway is a lightweight OpenAI-compatible API gateway for routing chat-completions requests to multiple locally hosted LLM backends (e.g., llama.cpp llama-server) by role (planner, writer, coder, reviewer, …).

It is designed for agentic workflows that benefit from using different models for different steps, and for deployments where different machines host different models (e.g., GPU box for fast inference, big RAM CPU box for large models).

What you get

  • OpenAI-compatible endpoints:
    • GET /v1/models
    • POST /v1/chat/completions (streaming and non-streaming)
    • GET /health and GET /ready
  • Model registry from configs/models.yaml
  • Optional node registration so remote machines can announce role backends to the gateway
  • Robust proxying with explicit httpx timeouts (no “hang forever”)
  • Structured logging with request IDs

Roles Are Project-Defined

The role names in this repository are examples, not a fixed taxonomy.

  • planner, writer, coder, and reviewer are only sample aliases
  • you can add, remove, or rename roles per project
  • a role is simply the model alias clients send to RoleMesh Gateway
  • each role can point at any OpenAI-compatible backend that fits that project's workflow

Examples of project-specific roles:

  • researcher
  • summarizer
  • tool-user
  • swe-backend
  • swe-frontend
  • test-writer
  • security-reviewer

If your workflow changes, update the models: section in config rather than treating the example roles as required.

Where model weights are defined

There are two different patterns in this project, and the model-weight location is defined in different places depending on which one you use.

Proxy mode

In gateway proxy mode, the gateway does not point directly to a GGUF or other weight file. It only points to an upstream inference server:

models:
  planner:
    type: proxy
    proxy_url: http://127.0.0.1:8011

In that setup, the actual model weights are chosen by the upstream server itself. Examples:

  • llamafile --server -m /path/to/model.gguf ...
  • llama-server -m /path/to/model.gguf ...
  • Ollama with defaults.model: some-model-name

So in proxy mode:

  • RoleMesh alias planner -> upstream server at proxy_url
  • upstream server -> actual weight file or model name
Upstream type Where weights/model are chosen What RoleMesh config provides
llamafile --server CLI -m /path/to/model.gguf when the server starts proxy_url
llama-server CLI -m /path/to/model.gguf when the server starts proxy_url
Ollama OpenAI-compatible API request body model, often injected via defaults.model proxy_url plus optional defaults.model

Node-agent mode

In node-agent mode, the weight file is defined explicitly in the node-agent config:

models:
  - model_id: "planner-gguf"
    path: "/models/SomePlannerModel.Q5_K_M.gguf"
    roles: ["planner"]

In that setup:

  • model_id is the model name exposed by the node agent
  • path is the actual GGUF weight file to load
  • roles are the role labels that node can serve if used with discovery

So in node-agent mode:

  • node-agent model_id -> exact weight file path via path
  • gateway discovered alias -> node role -> node-agent model load

Quick Start

This is the fastest path to a working local setup.

1. Install

python -m venv .venv
source .venv/bin/activate
pip install -e .

2. Start two OpenAI-compatible backends

Any backend that exposes GET /v1/models and POST /v1/chat/completions will work. One practical option is llamafile in server mode:

llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
llamafile --server -m /path/to/writer-model.gguf  --host 127.0.0.1 --port 8012 --nobrowser

3. Create a gateway config

version: 1
default_model: planner
auth:
  client_api_keys:
    - "change-me-client-key"
models:
  planner:
    type: proxy
    openai_model_name: planner
    proxy_url: http://127.0.0.1:8011
    defaults:
      temperature: 0
      max_tokens: 128
  writer:
    type: proxy
    openai_model_name: writer
    proxy_url: http://127.0.0.1:8012
    defaults:
      temperature: 0.6
      max_tokens: 256

Save that as configs/models.yaml.

You are not limited to planner and writer. Those are just placeholders for whatever roles your project needs. In this proxy example, the actual weight files are defined by the two backend processes started in step 2, not by the gateway config.

4. Run the gateway

ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000

5. Verify it

curl -sS http://127.0.0.1:8000/v1/models \
  -H 'X-Api-Key: change-me-client-key'
curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "planner",
    "messages": [{"role":"user","content":"Say hello in 3 words."}]
  }'

If you prefer the provided example file, copy configs/models.example.yaml and adjust the proxy_url values.

Local Overrides

Keep tracked repo config generic and put machine-specific values in a separate local override file.

Examples of machine-specific values:

  • model weight paths
  • local llama-server binary path
  • LAN IPs and ports
  • local API keys

Supported launch patterns:

rolemesh-gateway --config configs/models.example.yaml --config-override configs/models.local.yaml
rolemesh-node-agent --config configs/node_agent.example.yaml --config-override configs/node_agent.local.yaml

You can also use:

  • ROLE_MESH_CONFIG_OVERRIDE for the gateway

Tracked examples should use placeholders such as /path/to/model-weights, while your local override file contains the real values for your machine.

Worked Deployment Example

For a concrete multi-machine example, including:

  • two node-agent processes on a dual-GPU host
  • another node-agent on a second host
  • project-defined roles planner, writer, and critic

see docs/EXAMPLE_MULTI_NODE.md.

Known Good Inference Backends

The gateway is designed to work with any backend that exposes OpenAI-compatible GET /v1/models and POST /v1/chat/completions endpoints. The following applications have been exercised successfully in this repository.

Ollama

  • Verified directly against http://127.0.0.1:11434
  • Verified through RoleMesh Gateway proxy routing
  • Tested with model dolphin3:latest

Example upstream:

models:
  planner:
    type: proxy
    openai_model_name: planner
    proxy_url: http://127.0.0.1:11434
    defaults:
      model: dolphin3:latest

Note: when proxying to Ollama's OpenAI-compatible API, the upstream Ollama model name still needs to be supplied. One simple pattern is to set it in defaults.model and let the gateway inject it.

Llamafile

  • Verified directly with the newer llamafile runner in tmp-codex/llamafile
  • Verified through RoleMesh Gateway proxy routing
  • Verified role switching between two live backends
  • Tested successfully with:
    • phi-2.Q5_K_M.llamafile
    • rocket-3b.Q5_K_M.llamafile

Example launch:

./llamafile --server -m /path/to/model.gguf --host 127.0.0.1 --port 8011 --nobrowser

In this case, /path/to/model.gguf is where the actual weights are chosen, and RoleMesh only points to that running server.

llama.cpp / llama-server

  • Verified live through the RoleMesh Node Agent on NVIDIA GPUs
  • Tested with /home/netuser/bin/llama.cpp/build/bin/llama-server
  • Tested model load and model switching on Tesla P40 GPUs
  • Tested successfully with:
    • gemma-2b-it-q8_0.gguf
    • Mistral-7B-Instruct-v0.3-Q5_K_M.gguf

The node agent now waits for llama-server readiness during model load or model switch before proxying the first request, which avoids transient "Loading model" failures on cold start.

Multi-host (node registration)

If you want machines to host backends and “register” them dynamically, run a tiny node agent on each backend host (or just call the registration endpoint from your own tooling).

  • Gateway endpoint: POST /v1/nodes/register
  • Node payload describes which concrete upstream model IDs it serves, which roles each model can satisfy, and the base URL to reach its OpenAI-compatible backend.

In discovered mode, RoleMesh now treats these as separate concepts:

  • client alias: what the caller sends, such as tutor
  • role: the capability used for routing, such as tutor or critic
  • upstream model ID: the concrete model name served by the selected node, such as qwen3-8b

That means a node can advertise one served model for multiple roles, and the gateway can rewrite the forwarded request from a stable alias to the selected upstream model ID.

See: docs/DEPLOYMENT.md and docs/CONFIG.md.

Status

This repository is a preliminary scaffold:

  • Proxying to OpenAI-compatible upstreams works.
  • Registration and load-selection are implemented (basic round-robin).
  • API-key auth for clients and nodes is available.
  • Persistence is basic JSON-backed state, not a full service registry.
  • Gateway proxying has been exercised live with Ollama and llamafile.
  • Node-agent managed inference has been exercised live with llama-server on CUDA hardware.

Availability Semantics

RoleMesh Gateway now distinguishes between configured aliases and currently usable aliases.

  • GET /v1/models advertises only aliases whose upstreams are reachable right now
  • unavailable aliases are reported under rolemesh.unavailable_models
  • GET /ready returns 200 only when the configured default_model is currently usable
  • GET /health remains a lightweight process health check and does not probe upstreams
  • discovered nodes are removed from routing once they become stale

This makes the API surface more truthful for clients that rely on the advertised role list.

By default, registered nodes become stale after 30 seconds without a fresh heartbeat or registration event. You can change that with ROLE_MESH_NODE_STALE_AFTER_S.

License

MIT. See LICENSE.

Node Agent (per-host)

This repo also includes a RoleMesh Node Agent (rolemesh-node-agent) that can manage persistent llama.cpp servers (one per GPU) and report inventory/metrics back to the gateway.

  • Sample config: configs/node_agent.example.yaml
  • Docs: docs/NODE_AGENT.md

Safe-by-default binding

Gateway and node-agent default to binding on 127.0.0.1 to avoid accidental exposure. Bind only to private/LAN or VPN interfaces and firewall ports if you need remote access.