zap

Zap

A high-performance LLM inference gateway in Rust. Route, load balance, rate-limit, and retry requests across multiple cloud LLM providers through a single API endpoint.

Why Zap?

Instead of calling LLM providers directly, Zap sits between your app and the providers:

Your app  →  Zap  →  Groq (primary)
                  →  Cerebras (fallback)

Zap makes sense when you’ve outgrown “one app, one provider.” Use it when you’re mixing models (GPT-4o for hard stuff, Groq for cheap fast calls, local Ollama for sensitive data), when you want automatic failover so users don’t notice an OpenAI outage, when you’re saving money by routing simple queries to free/cheap providers, or when multiple services all need LLM access and you’re tired of managing keys and retry logic in each one. If you’re only calling one provider, building a prototype, or need provider-specific features that don’t fit a generic chat completions interface, just call the provider directly. Zap is for when LLM calls become plumbing that lots of things depend on, not a one-off integration.

Quickstart

1. Get free API keys

2. Configure config.toml

[server]
host = "0.0.0.0"
port = 8000

[queue]
max_size = 1000
timeout_secs = 300

[[backends]]
url = "https://api.groq.com/openai"
weight = 2                              # Gets 2x traffic (fastest)
health_path = "/v1/models"
api_key = "gsk_your_groq_key"
default_model = "llama-3.1-8b-instant"

[[backends]]
url = "https://api.cerebras.ai"
weight = 1
health_path = "/v1/models"
api_key = "csk-your_cerebras_key"
default_model = "llama3.1-8b"

[rate_limit]
requests_per_minute = 60

3. Build and run

cargo build --release
./target/release/zap

4. Send a request

# Short form — no model needed
curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

# OpenAI-compatible form
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

Features

Configuration

[[backends]]

Field Type Default Description
url string required Base URL of the provider
weight integer required Routing weight (higher = more traffic)
health_path string "/health" Health check endpoint path
api_key string none API key injected as Authorization: Bearer <key>
default_model string none Model name injected when client omits it
chat_path string "/v1/chat/completions" Chat completions endpoint path

[queue]

Field Type Description
max_size integer Max queued requests before 503
timeout_secs integer Request timeout in seconds

[rate_limit]

Field Type Description
requests_per_minute integer Max requests per minute per API key

Security: Add config.toml to .gitignore — it contains your API keys.

API

Endpoint Method Description
/chat POST Short-form chat completions
/v1/chat/completions POST OpenAI-compatible chat completions
/health GET Returns {"status":"ok"}
/metrics GET Prometheus metrics

Request body

Field Type Required Description
messages array yes Conversation messages with role and content
model string no Model override (uses backend’s default_model if omitted)
stream boolean no Stream via SSE (default: false)
temperature float no Sampling temperature (0.0 – 2.0)
max_tokens integer no Max tokens to generate

Examples

Streaming

curl --no-buffer http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Write a poem about Rust"}], "stream": true}'

With system prompt

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a concise coding assistant."},
      {"role": "user", "content": "Write fizzbuzz in Python"}
    ]
  }'

Multi-turn conversation

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is 2 + 2?"},
      {"role": "assistant", "content": "4."},
      {"role": "user", "content": "Multiply that by 10"}
    ]
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any-key",
)

response = client.chat.completions.create(
    model="",  # uses backend default
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

With rate limiting

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-api-key" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

Requests without Authorization share the "anonymous" rate limit bucket.

Architecture

Client
  │
  ▼
[Rate Limiter]  ── 429 ──>  Client
  │
  ▼
[Bounded Queue]  ── 503 ──>  Client
  │
  ▼
[Dispatcher]  (spawns tokio task per request)
  │
  ▼
[Load Balancer]  (round-robin, weighted)
  │         │
  ▼         ▼
Groq    Cerebras     (+API key + model injection)
  │         │
  └────┬────┘
       ▼
[Retry w/ exponential backoff + jitter]
       │
       ▼
  Response ──> Client

Adding a new provider

Any OpenAI-compatible API works. Add a [[backends]] block to config.toml:

[[backends]]
url = "https://api.together.xyz"
weight = 1
health_path = "/v1/models"
api_key = "your_key"
default_model = "meta-llama/Llama-3.1-8B-Instruct"

For providers with non-standard paths, use chat_path:

chat_path = "/v1beta/openai/chat/completions"

Development

cargo build --release       # Build
cargo test                  # Run tests
cargo clippy -- -D warnings # Lint
cargo fmt                   # Format

License

MIT