RoleMesh-Gateway/README.md

# RoleMesh Gateway

![RoleMesh Gateway logo](artwork/rolemesh_gateway_logo.png)

RoleMesh Gateway is a lightweight **OpenAI-compatible** API gateway for routing chat-completions requests to multiple
locally hosted LLM backends (e.g., `llama.cpp` `llama-server`) **by role** (planner, writer, coder, reviewer, …).

It is designed for **agentic workflows** that benefit from using different models for different steps, and for
deployments where **different machines host different models** (e.g., GPU box for fast inference, big RAM CPU box for large models).

## What you get

- OpenAI-compatible endpoints:
  - `GET /v1/models`
  - `POST /v1/chat/completions` (streaming and non-streaming)
  - `GET /health` and `GET /ready`
- Model registry from `configs/models.yaml`
- Optional **node registration** so remote machines can announce role backends to the gateway
- Robust proxying with **explicit httpx timeouts** (no “hang forever”)
- Structured logging with request IDs

## Roles Are Project-Defined

The role names in this repository are examples, not a fixed taxonomy.

- `planner`, `writer`, `coder`, and `reviewer` are only sample aliases
- you can add, remove, or rename roles per project
- a role is simply the `model` alias clients send to RoleMesh Gateway
- each role can point at any OpenAI-compatible backend that fits that project's workflow

Examples of project-specific roles:
- `researcher`
- `summarizer`
- `tool-user`
- `swe-backend`
- `swe-frontend`
- `test-writer`
- `security-reviewer`

If your workflow changes, update the `models:` section in config rather than treating the example roles as required.

## Where model weights are defined

There are two different patterns in this project, and the model-weight location is defined in different places depending on which one you use.

### Proxy mode

In gateway proxy mode, the gateway does **not** point directly to a GGUF or other weight file.
It only points to an upstream inference server:

```yaml
models:
  planner:
    type: proxy
    proxy_url: http://127.0.0.1:8011
```

In that setup, the actual model weights are chosen by the upstream server itself.
Examples:

- `llamafile --server -m /path/to/model.gguf ...`
- `llama-server -m /path/to/model.gguf ...`
- Ollama with `defaults.model: some-model-name`

So in proxy mode:
- RoleMesh alias `planner` -> upstream server at `proxy_url`
- upstream server -> actual weight file or model name

| Upstream type | Where weights/model are chosen | What RoleMesh config provides |
| --- | --- | --- |
| `llamafile --server` | CLI `-m /path/to/model.gguf` when the server starts | `proxy_url` |
| `llama-server` | CLI `-m /path/to/model.gguf` when the server starts | `proxy_url` |
| Ollama OpenAI-compatible API | request body `model`, often injected via `defaults.model` | `proxy_url` plus optional `defaults.model` |

### Node-agent mode

In node-agent mode, the weight file is defined explicitly in the node-agent config:

```yaml
models:
  - model_id: "planner-gguf"
    path: "/models/SomePlannerModel.Q5_K_M.gguf"
    roles: ["planner"]
```

In that setup:
- `model_id` is the model name exposed by the node agent
- `path` is the actual GGUF weight file to load
- `roles` are the role labels that node can serve if used with discovery

So in node-agent mode:
- node-agent `model_id` -> exact weight file path via `path`
- gateway discovered alias -> node role -> node-agent model load

## Quick Start

This is the fastest path to a working local setup.

### 1. Install

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .
```

### 2. Start two OpenAI-compatible backends

Any backend that exposes `GET /v1/models` and `POST /v1/chat/completions` will work.
One practical option is `llamafile` in server mode:

```bash
llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
llamafile --server -m /path/to/writer-model.gguf  --host 127.0.0.1 --port 8012 --nobrowser
```

### 3. Create a gateway config

```yaml
version: 1
default_model: planner
auth:
  client_api_keys:
    - "change-me-client-key"
models:
  planner:
    type: proxy
    openai_model_name: planner
    proxy_url: http://127.0.0.1:8011
    defaults:
      temperature: 0
      max_tokens: 128
  writer:
    type: proxy
    openai_model_name: writer
    proxy_url: http://127.0.0.1:8012
    defaults:
      temperature: 0.6
      max_tokens: 256
```

Save that as `configs/models.yaml`.

You are not limited to `planner` and `writer`. Those are just placeholders for whatever roles your project needs.
In this proxy example, the actual weight files are defined by the two backend processes started in step 2, not by the gateway config.

### 4. Run the gateway

```bash
ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000
```

### 5. Verify it

```bash
curl -sS http://127.0.0.1:8000/v1/models \
  -H 'X-Api-Key: change-me-client-key'
```

```bash
curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "planner",
    "messages": [{"role":"user","content":"Say hello in 3 words."}]
  }'
```

If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values.

## Known Good Inference Backends

The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and
`POST /v1/chat/completions` endpoints. The following applications have been exercised successfully in this repository.

### Ollama

- Verified directly against `http://127.0.0.1:11434`
- Verified through RoleMesh Gateway proxy routing
- Tested with model `dolphin3:latest`

Example upstream:

```yaml
models:
  planner:
    type: proxy
    openai_model_name: planner
    proxy_url: http://127.0.0.1:11434
    defaults:
      model: dolphin3:latest
```

Note: when proxying to Ollama's OpenAI-compatible API, the upstream Ollama model name still needs to be supplied.
One simple pattern is to set it in `defaults.model` and let the gateway inject it.

### Llamafile

- Verified directly with the newer `llamafile` runner in `tmp-codex/llamafile`
- Verified through RoleMesh Gateway proxy routing
- Verified role switching between two live backends
- Tested successfully with:
  - `phi-2.Q5_K_M.llamafile`
  - `rocket-3b.Q5_K_M.llamafile`

Example launch:

```bash
./llamafile --server -m /path/to/model.gguf --host 127.0.0.1 --port 8011 --nobrowser
```

In this case, `/path/to/model.gguf` is where the actual weights are chosen, and RoleMesh only points to that running server.

### llama.cpp / llama-server

- Verified live through the RoleMesh Node Agent on NVIDIA GPUs
- Tested with `/home/netuser/bin/llama.cpp/build/bin/llama-server`
- Tested model load and model switching on Tesla P40 GPUs
- Tested successfully with:
  - `gemma-2b-it-q8_0.gguf`
  - `Mistral-7B-Instruct-v0.3-Q5_K_M.gguf`

The node agent now waits for `llama-server` readiness during model load or model switch before proxying the first
request, which avoids transient "Loading model" failures on cold start.

## Multi-host (node registration)

If you want machines to host backends and “register” them dynamically, run a tiny node agent on each backend host
(or just call the registration endpoint from your own tooling).

- Gateway endpoint: `POST /v1/nodes/register`
- Node payload describes which **roles** it serves and the base URL to reach its OpenAI-compatible backend.

See: `docs/DEPLOYMENT.md` and `docs/CONFIG.md`.

## Status

This repository is a **preliminary scaffold**:
- Proxying to OpenAI-compatible upstreams works.
- Registration and load-selection are implemented (basic round-robin).
- API-key auth for clients and nodes is available.
- Persistence is basic JSON-backed state, not a full service registry.
- Gateway proxying has been exercised live with Ollama and `llamafile`.
- Node-agent managed inference has been exercised live with `llama-server` on CUDA hardware.

## Availability Semantics

RoleMesh Gateway now distinguishes between configured aliases and currently usable aliases.

- `GET /v1/models` advertises only aliases whose upstreams are reachable right now
- unavailable aliases are reported under `rolemesh.unavailable_models`
- `GET /ready` returns `200` only when the configured `default_model` is currently usable
- `GET /health` remains a lightweight process health check and does not probe upstreams
- discovered nodes are removed from routing once they become stale

This makes the API surface more truthful for clients that rely on the advertised role list.

By default, registered nodes become stale after `30` seconds without a fresh heartbeat or registration event.
You can change that with `ROLE_MESH_NODE_STALE_AFTER_S`.

## License

MIT. See `LICENSE`.

## Node Agent (per-host)

This repo also includes a **RoleMesh Node Agent** (`rolemesh-node-agent`) that can manage **persistent** `llama.cpp` servers (one per GPU) and report inventory/metrics back to the gateway.

- Sample config: `configs/node_agent.example.yaml`
- Docs: `docs/NODE_AGENT.md`

## Safe-by-default binding

Gateway and node-agent default to binding on `127.0.0.1` to avoid accidental exposure. Bind only to private/LAN or VPN interfaces and firewall ports if you need remote access.