RoleMesh-Gateway/README.md

# RoleMesh Gateway

![RoleMesh Gateway logo](artwork/rolemesh_gateway_logo.png)

RoleMesh Gateway is a lightweight **OpenAI-compatible** API gateway for routing chat-completions requests to multiple
locally hosted LLM backends (e.g., `llama.cpp` `llama-server`) **by role** (planner, writer, coder, reviewer, …).

It is designed for **agentic workflows** that benefit from using different models for different steps, and for
deployments where **different machines host different models** (e.g., GPU box for fast inference, big RAM CPU box for large models).

## What you get

- OpenAI-compatible endpoints:
  - `GET /v1/models`
  - `POST /v1/chat/completions` (streaming and non-streaming)
  - `GET /health` and `GET /ready`
- Model registry from `configs/models.yaml`
- Optional **node registration** so remote machines can announce role backends to the gateway
- Robust proxying with **explicit httpx timeouts** (no “hang forever”)
- Structured logging with request IDs

## Quick Start

This is the fastest path to a working local setup.

### 1. Install

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .
```

### 2. Start two OpenAI-compatible backends

Any backend that exposes `GET /v1/models` and `POST /v1/chat/completions` will work.
One practical option is `llamafile` in server mode:

```bash
llamafile --server -m /path/to/planner-model.gguf --host 127.0.0.1 --port 8011 --nobrowser
llamafile --server -m /path/to/writer-model.gguf  --host 127.0.0.1 --port 8012 --nobrowser
```

### 3. Create a gateway config

```yaml
version: 1
default_model: planner
auth:
  client_api_keys:
    - "change-me-client-key"
models:
  planner:
    type: proxy
    openai_model_name: planner
    proxy_url: http://127.0.0.1:8011
    defaults:
      temperature: 0
      max_tokens: 128
  writer:
    type: proxy
    openai_model_name: writer
    proxy_url: http://127.0.0.1:8012
    defaults:
      temperature: 0.6
      max_tokens: 256
```

Save that as `configs/models.yaml`.

### 4. Run the gateway

```bash
ROLE_MESH_CONFIG=configs/models.yaml uvicorn rolemesh_gateway.main:app --host 127.0.0.1 --port 8000
```

### 5. Verify it

```bash
curl -sS http://127.0.0.1:8000/v1/models \
  -H 'X-Api-Key: change-me-client-key'
```

```bash
curl -sS -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'X-Api-Key: change-me-client-key' \
  -d '{
    "model": "planner",
    "messages": [{"role":"user","content":"Say hello in 3 words."}]
  }'
```

If you prefer the provided example file, copy `configs/models.example.yaml` and adjust the `proxy_url` values.

## Known Good Inference Backends

The gateway is designed to work with any backend that exposes OpenAI-compatible `GET /v1/models` and
`POST /v1/chat/completions` endpoints. The following applications have been exercised successfully in this repository.

### Ollama

- Verified directly against `http://127.0.0.1:11434`
- Verified through RoleMesh Gateway proxy routing
- Tested with model `dolphin3:latest`

Example upstream:

```yaml
models:
  planner:
    type: proxy
    openai_model_name: planner
    proxy_url: http://127.0.0.1:11434
    defaults:
      model: dolphin3:latest
```

Note: when proxying to Ollama's OpenAI-compatible API, the upstream Ollama model name still needs to be supplied.
One simple pattern is to set it in `defaults.model` and let the gateway inject it.

### Llamafile

- Verified directly with the newer `llamafile` runner in `tmp-codex/llamafile`
- Verified through RoleMesh Gateway proxy routing
- Verified role switching between two live backends
- Tested successfully with:
  - `phi-2.Q5_K_M.llamafile`
  - `rocket-3b.Q5_K_M.llamafile`

Example launch:

```bash
./llamafile --server -m /path/to/model.gguf --host 127.0.0.1 --port 8011 --nobrowser
```

### llama.cpp / llama-server

- Verified live through the RoleMesh Node Agent on NVIDIA GPUs
- Tested with `/home/netuser/bin/llama.cpp/build/bin/llama-server`
- Tested model load and model switching on Tesla P40 GPUs
- Tested successfully with:
  - `gemma-2b-it-q8_0.gguf`
  - `Mistral-7B-Instruct-v0.3-Q5_K_M.gguf`

The node agent now waits for `llama-server` readiness during model load or model switch before proxying the first
request, which avoids transient "Loading model" failures on cold start.

## Multi-host (node registration)

If you want machines to host backends and “register” them dynamically, run a tiny node agent on each backend host
(or just call the registration endpoint from your own tooling).

- Gateway endpoint: `POST /v1/nodes/register`
- Node payload describes which **roles** it serves and the base URL to reach its OpenAI-compatible backend.

See: `docs/DEPLOYMENT.md` and `docs/CONFIG.md`.

## Status

This repository is a **preliminary scaffold**:
- Proxying to OpenAI-compatible upstreams works.
- Registration and load-selection are implemented (basic round-robin).
- API-key auth for clients and nodes is available.
- Persistence is basic JSON-backed state, not a full service registry.
- Gateway proxying has been exercised live with Ollama and `llamafile`.
- Node-agent managed inference has been exercised live with `llama-server` on CUDA hardware.

## License

MIT. See `LICENSE`.

## Node Agent (per-host)

This repo also includes a **RoleMesh Node Agent** (`rolemesh-node-agent`) that can manage **persistent** `llama.cpp` servers (one per GPU) and report inventory/metrics back to the gateway.

- Sample config: `configs/node_agent.example.yaml`
- Docs: `docs/NODE_AGENT.md`

## Safe-by-default binding

Gateway and node-agent default to binding on `127.0.0.1` to avoid accidental exposure. Bind only to private/LAN or VPN interfaces and firewall ports if you need remote access.