The conventional take on rate limiting is to pick an algorithm, point it at Redis, and move on. That covers roughly 40% of what production requires. The rest lives in the key you limit on, the order your middleware runs, and what your system does when the counter store stops responding.
I've implemented rate limiting on Callidus, a multi-tenant SaaS for UK aesthetic clinics built on React and Firebase. That includes a per-session AI rate limiter at 30 requests per hour for standard users and 100 per hour for pro tier, isolated from the rest of the booking API. That endpoint separation is what kept the Calia AI assistant from degrading booking throughput when a practitioner ran heavy consultation sessions. The algorithm choice mattered far less than the key strategy and middleware order. This post covers the decisions that show up in actual incidents.
Fixed Window Is the Wrong Default for Most SaaS APIs

Fixed window rate limiting has one critical flaw: deliberate double-bursting at window boundaries can consume twice the intended quota in two seconds.
Arcjet's breakdown of rate limiting algorithms describes the exact exploit: a client sends 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in two seconds against a 100/min limit. This isn't theoretical. Anyone reading your `X-RateLimit-Reset` header can time this precisely. The reset timestamp tells them when the window resets; exploiting the boundary is one line of client code.
Token bucket is the right default for most developer-facing REST APIs. Tokens accumulate over time up to a maximum capacity; each request consumes one. Idle clients bank capacity for controlled bursts without exceeding long-term average throughput. You can't enforce a strict "no more than N requests in the last 60 seconds" guarantee, but for interactive APIs that's rarely the constraint that matters. What matters is that a user can send a burst of requests without hitting an arbitrary wall.
Here's the algorithm selection table that actually guides the choice:
| Scenario | Algorithm | |---|---| | Public developer API, variable cadence | Token bucket | | Auth endpoints, OTP, password reset | Sliding window log | | Billing quotas (daily/monthly caps) | Fixed window | | Outbound webhooks, smooth delivery | Leaky bucket | | High-scale distributed systems | Sliding window counter |
Fixed window belongs on billing quotas — monthly API call limits where the period boundary doesn't create an exploitable edge. Not on the endpoints that carry your actual product traffic.
Why Does Your Bucket Key Matter More Than the Algorithm?
Your rate limit bucket key determines whose quota gets consumed — choose the wrong dimension and you've isolated nothing even with a correct algorithm.
Per-IP limiting is the default in almost every rate limiting tutorial. For B2B SaaS, it's wrong. Enterprise clients route traffic through corporate proxies. Fifty users can share one egress IP. You'll throttle your most valuable customers first while bot traffic from residential IPs passes freely. Per-IP is the right key for login endpoints and public unauthenticated routes. Not for your authenticated product API.
Per-tenant keying (`tenantId`) is the right default for any SaaS serving organizations rather than individual users. Each tenant gets an independent bucket, sized by plan tier. A Starter tenant gets 100 requests/minute; an Enterprise tenant gets 1,000. Their traffic never bleeds into each other's counter.
Per-endpoint granularity goes on top of per-tenant. Not all endpoints cost the same. A `GET /patients` list call is not the same as a `POST /reports/generate` call. On Callidus, the Calia AI assistant runs a separate rate limiter entirely — isolated so that a heavy consultation session doesn't reduce throughput for the booking flow. The composite key: `tenantId:endpoint`. For AI or compute-heavy endpoints, add `:tier` to vary limits by plan level.
Amazon's builders library defines the goal precisely: fairness in a multi-tenant system means every client is provided with a single-tenant experience. The noisy-neighbor problem is a key problem, not a scaling problem. One tenant's data export consuming headroom that belongs to another tenant's real-time dashboard gets fixed by keying on the right dimension.
How Do You Wire Rate Limiting in a Next.js SaaS?
Rate limiting in Next.js runs in Edge middleware using `@upstash/ratelimit`, keyed on a composite `tenantId` and endpoint identifier, returning `429` with `Retry-After` on breach.
The reason Upstash is the practical answer here is architectural. Edge runtimes (Vercel Edge, Cloudflare Workers) block TCP connections. Standard Redis clients use TCP. Upstash exposes a native HTTP API alongside the Redis protocol, making it the only persistent counter store that reaches an Edge function without a workaround. Upstash's 2026 comparison of serverless Redis options puts their free tier at 500K commands per month — enough for rate limiting across several hundred daily active tenants before you pay anything.
```typescript import { Ratelimit } from "@upstash/ratelimit"; import { Redis } from "@upstash/redis";
const limiters = { default: new Ratelimit({ redis: Redis.fromEnv(), limiter: Ratelimit.tokenBucket(100, "1 m", 200), // 100/min, burst cap 200 prefix: "rl:default", }), ai: new Ratelimit({ redis: Redis.fromEnv(), limiter: Ratelimit.tokenBucket(30, "1 h", 30), prefix: "rl:ai", }), };
export async function middleware(request: NextRequest) { const tenantId = request.headers.get("x-tenant-id") ?? "anon"; const isAi = request.nextUrl.pathname.startsWith("/api/ai"); const limiter = isAi ? limiters.ai : limiters.default;
const { success, remaining, reset, limit } = await limiter.limit(tenantId);
if (!success) { return new NextResponse("Too Many Requests", { status: 429, headers: { "X-RateLimit-Limit": limit.toString(), "X-RateLimit-Remaining": "0", "X-RateLimit-Reset": reset.toString(), "Retry-After": Math.ceil((reset - Date.now()) / 1000).toString(), }, }); }
const response = NextResponse.next(); response.headers.set("X-RateLimit-Remaining", remaining.toString()); return response; } ```
Two things this code does that most examples skip. First, it sets `X-RateLimit-Remaining` on successful responses as well — clients that implement proactive backoff need this header on every response, not only on the 429. Second, it separates the AI limiter into its own bucket with its own prefix, so AI and non-AI traffic account against separate counters.
Return `429`, not `503`. Amazon's builders library makes this distinction explicit: `429` signals client misbehavior; `503` signals server failure. The difference matters for retry logic. A responsible client should not aggressively retry a `429` the same way it retries a transient server error. Send the right code, and clients that respect the spec will handle it correctly.
One additional threshold worth adding: warn before you reject. Emit a `X-RateLimit-Warning: true` header at 80% utilization so clients can back off before hitting the hard limit. You'll have significantly fewer failed-request support tickets from enterprise customers who run automated scripts against your API.
What Happens When Your Redis Goes Down?
When Redis is unreachable, failing open with a circuit breaker beats hard failure — rate limiting outages should not become product outages.
Have you ever had your rate limiter take down your API at 11pm because a Redis connection timed out? The control layer should be resilient even when the store it reads from isn't.
Three strategies in order of preference:
-
Fail open with circuit breaker: when the Upstash HTTP request times out, allow the request through and log the failure. Set a 30-second fail-open window, then attempt recovery. Brief unguarded throughput is a smaller problem than a complete outage for most SaaS products.
-
Postgres fallback counter: for standard Node.js API routes outside the Edge runtime, a Postgres-based fixed-window counter handles several hundred requests per second. This is the right choice for any serverless or early-stage SaaS MVP stack that wants rate limiting without adding Redis to the infrastructure bill yet:
```sql INSERT INTO rate_limit_counters (bucket_key, count, window_start) VALUES ($1, 1, date_trunc('minute', NOW())) ON CONFLICT (bucket_key) DO UPDATE SET count = CASE WHEN rate_limit_counters.window_start < date_trunc('minute', NOW()) THEN 1 ELSE rate_limit_counters.count + 1 END, window_start = CASE WHEN rate_limit_counters.window_start < date_trunc('minute', NOW()) THEN date_trunc('minute', NOW()) ELSE rate_limit_counters.window_start END, updated_at = NOW() RETURNING count, window_start; ```
- Cloudflare network-layer limiting: if your product sits behind Cloudflare, their network-layer rate limiting runs before traffic reaches your origin and handles volumetric abuse without touching your application. Combine this with application-layer limiting — Cloudflare catches the volumetric flood; your application enforces the per-tenant tier logic Cloudflare can't see.
Rate limiting is control infrastructure. It has to stay up when the dependencies it sits on don't.
The Billing Tier Check Belongs Before the Rate Limit
Most SaaS codebases I review have the rate limiter in middleware and the billing tier check inside the route handler. That order is backwards, and the consequences are expensive.
A request from a free-trial tenant arrives. The rate limiter clears it — the per-minute throughput is fine. The request hits your route handler, which calls an LLM inference endpoint, generates a PDF, or hits a metered third-party API. Then the billing check refuses it. The expensive downstream operation has already been invoked.
Zuplo's cost protection analysis puts this precisely: billing alerts notify you after the damage is done. By the time a human reads a cost threshold email on a Saturday morning, the spend has accumulated. Enforcement has to be in the critical path, before the expensive operation.
Rate limits and billing quotas measure different things. A rate limit controls throughput: requests per minute. A quota controls total consumption: API calls this billing period, LLM tokens this month. A slow retry loop, one request every ten seconds, passes every rate limit check while burning through a monthly quota in hours. You need both checks, and both need to run before the downstream call.
The correct middleware order:
```
- Auth → is this request authenticated?
- Billing quota → has this tenant's period allowance been exhausted?
- Rate limit → is this tenant bursting above their per-minute limit?
- Route handler ```
Actually — steps 2 and 3 often collapse into a single middleware pass when both counters live in the same Redis instance. The point is that both checks run before the downstream call, not inside it or after it.
Concretely: if your billing quota check lives only in a Stripe webhook handler or a nightly cron job, you have a gap. Any request that arrives between a quota exhaustion event and the next scheduled check will be allowed through. Middleware is the only enforcement layer that runs on every request.
What's your current bucket key? If it's IP address on a B2B product, that's the highest-return change available — more than algorithm selection, more than Redis provider. Key on `tenantId`, add endpoint granularity for your expensive operations, move the billing quota check upstream of the downstream call. Everything else is tuning.
