zap

Zap

A high-performance LLM inference gateway in Rust. Route, load balance, rate-limit, and retry requests across multiple cloud LLM providers through a single API endpoint.

Why Zap?

Instead of calling LLM providers directly, Zap sits between your app and the providers:

Your app  →  Zap  →  Groq (primary)
                  →  Cerebras (fallback)

Zap makes sense when you’ve outgrown “one app, one provider.” Use it when you’re mixing models (GPT-4o for hard stuff, Groq for cheap fast calls, local Ollama for sensitive data), when you want automatic failover so users don’t notice an OpenAI outage, when you’re saving money by routing simple queries to free/cheap providers, or when multiple services all need LLM access and you’re tired of managing keys and retry logic in each one. If you’re only calling one provider, building a prototype, or need provider-specific features that don’t fit a generic chat completions interface, just call the provider directly. Zap is for when LLM calls become plumbing that lots of things depend on, not a one-off integration.

Quickstart

1. Get free API keys

Groq — console.groq.com (free tier, ~30 req/min)
Cerebras — cloud.cerebras.ai (free tier)

2. Configure `config.toml`

[server]
host = "0.0.0.0"
port = 8000

[queue]
max_size = 1000
timeout_secs = 300

[[backends]]
url = "https://api.groq.com/openai"
weight = 2                              # Gets 2x traffic (fastest)
health_path = "/v1/models"
api_key = "gsk_your_groq_key"
default_model = "llama-3.1-8b-instant"

[[backends]]
url = "https://api.cerebras.ai"
weight = 1
health_path = "/v1/models"
api_key = "csk-your_cerebras_key"
default_model = "llama3.1-8b"

[rate_limit]
requests_per_minute = 60

3. Build and run

cargo build --release
./target/release/zap

4. Send a request

# Short form — no model needed
curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

# OpenAI-compatible form
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

Features

Multi-provider load balancing — round-robin or least-loaded across Groq, Cerebras, or any OpenAI-compatible API
Automatic failover — health checks every 10s, unhealthy backends skipped after 3 failures
Default model injection — clients send messages without specifying a model, Zap injects the backend’s default_model
API key injection — per-backend API keys attached as Bearer tokens automatically
Request queue with backpressure — bounded queue, returns 503 when full
Per-key rate limiting — sliding window with Retry-After headers
Retry with backoff — exponential backoff + jitter on 5xx/connection errors (max 2 retries)
SSE streaming — "stream": true for real-time token streaming
Prometheus metrics — request counts, latency, queue depth, errors at /metrics
Configurable paths — custom health_path and chat_path per backend

Configuration

[[backends]]

Field	Type	Default	Description
`url`	string	required	Base URL of the provider
`weight`	integer	required	Routing weight (higher = more traffic)
`health_path`	string	`"/health"`	Health check endpoint path
`api_key`	string	none	API key injected as `Authorization: Bearer <key>`
`default_model`	string	none	Model name injected when client omits it
`chat_path`	string	`"/v1/chat/completions"`	Chat completions endpoint path

[queue]

Field	Type	Description
`max_size`	integer	Max queued requests before 503
`timeout_secs`	integer	Request timeout in seconds

[rate_limit]

Field	Type	Description
`requests_per_minute`	integer	Max requests per minute per API key

Security: Add config.toml to .gitignore — it contains your API keys.

API

Endpoint	Method	Description
`/chat`	POST	Short-form chat completions
`/v1/chat/completions`	POST	OpenAI-compatible chat completions
`/health`	GET	Returns `{"status":"ok"}`
`/metrics`	GET	Prometheus metrics

Request body

Field	Type	Required	Description
`messages`	array	yes	Conversation messages with `role` and `content`
`model`	string	no	Model override (uses backend’s `default_model` if omitted)
`stream`	boolean	no	Stream via SSE (default: false)
`temperature`	float	no	Sampling temperature (0.0 – 2.0)
`max_tokens`	integer	no	Max tokens to generate

Examples

Streaming

curl --no-buffer http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Write a poem about Rust"}], "stream": true}'

With system prompt

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a concise coding assistant."},
      {"role": "user", "content": "Write fizzbuzz in Python"}
    ]
  }'

Multi-turn conversation

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is 2 + 2?"},
      {"role": "assistant", "content": "4."},
      {"role": "user", "content": "Multiply that by 10"}
    ]
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any-key",
)

response = client.chat.completions.create(
    model="",  # uses backend default
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

With rate limiting

curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-api-key" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}]}'

Requests without Authorization share the "anonymous" rate limit bucket.

Architecture

Client
  │
  ▼
[Rate Limiter]  ── 429 ──>  Client
  │
  ▼
[Bounded Queue]  ── 503 ──>  Client
  │
  ▼
[Dispatcher]  (spawns tokio task per request)
  │
  ▼
[Load Balancer]  (round-robin, weighted)
  │         │
  ▼         ▼
Groq    Cerebras     (+API key + model injection)
  │         │
  └────┬────┘
       ▼
[Retry w/ exponential backoff + jitter]
       │
       ▼
  Response ──> Client

Adding a new provider

Any OpenAI-compatible API works. Add a [[backends]] block to config.toml:

[[backends]]
url = "https://api.together.xyz"
weight = 1
health_path = "/v1/models"
api_key = "your_key"
default_model = "meta-llama/Llama-3.1-8B-Instruct"

For providers with non-standard paths, use chat_path:

chat_path = "/v1beta/openai/chat/completions"

Development

cargo build --release       # Build
cargo test                  # Run tests
cargo clippy -- -D warnings # Lint
cargo fmt                   # Format

License

MIT

zap

Zap