Tech Stack19 June 2026 · 9 min read

Rate Limiting Patterns for SaaS APIs in 2026

Token bucket, sliding window, or fixed window — the algorithm is rarely the main problem. Bucket key strategy, middleware order, and fallback behavior are where SaaS incidents live.

Rate Limiting Patterns for SaaS APIs in 2026

The conventional take on rate limiting is to pick an algorithm, point it at Redis, and move on. That covers roughly 40% of what production requires. The rest lives in the key you limit on, the order your middleware runs, and what your system does when the counter store stops responding.

I've implemented rate limiting on Callidus, a multi-tenant SaaS for UK aesthetic clinics built on React and Firebase. That includes a per-session AI rate limiter at 30 requests per hour for standard users and 100 per hour for pro tier, isolated from the rest of the booking API. That endpoint separation is what kept the Calia AI assistant from degrading booking throughput when a practitioner ran heavy consultation sessions. The algorithm choice mattered far less than the key strategy and middleware order. This post covers the decisions that show up in actual incidents.

Fixed Window Is the Wrong Default for Most SaaS APIs

Two concrete panels side by side with electric cyan light bleeding through the narrow vertical gap between them, representing the exploitable window boundary

Fixed window rate limiting has one critical flaw: deliberate double-bursting at window boundaries can consume twice the intended quota in two seconds.

Arcjet's breakdown of rate limiting algorithms describes the exact exploit: a client sends 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in two seconds against a 100/min limit. This isn't theoretical. Anyone reading your `X-RateLimit-Reset` header can time this precisely. The reset timestamp tells them when the window resets; exploiting the boundary is one line of client code.

Token bucket is the right default for most developer-facing REST APIs. Tokens accumulate over time up to a maximum capacity; each request consumes one. Idle clients bank capacity for controlled bursts without exceeding long-term average throughput. You can't enforce a strict "no more than N requests in the last 60 seconds" guarantee, but for interactive APIs that's rarely the constraint that matters. What matters is that a user can send a burst of requests without hitting an arbitrary wall.

Here's the algorithm selection table that actually guides the choice:

| Scenario | Algorithm | |---|---| | Public developer API, variable cadence | Token bucket | | Auth endpoints, OTP, password reset | Sliding window log | | Billing quotas (daily/monthly caps) | Fixed window | | Outbound webhooks, smooth delivery | Leaky bucket | | High-scale distributed systems | Sliding window counter |

Fixed window belongs on billing quotas — monthly API call limits where the period boundary doesn't create an exploitable edge. Not on the endpoints that carry your actual product traffic.

Why Does Your Bucket Key Matter More Than the Algorithm?

Your rate limit bucket key determines whose quota gets consumed — choose the wrong dimension and you've isolated nothing even with a correct algorithm.

Per-IP limiting is the default in almost every rate limiting tutorial. For B2B SaaS, it's wrong. Enterprise clients route traffic through corporate proxies. Fifty users can share one egress IP. You'll throttle your most valuable customers first while bot traffic from residential IPs passes freely. Per-IP is the right key for login endpoints and public unauthenticated routes. Not for your authenticated product API.

Per-tenant keying (`tenantId`) is the right default for any SaaS serving organizations rather than individual users. Each tenant gets an independent bucket, sized by plan tier. A Starter tenant gets 100 requests/minute; an Enterprise tenant gets 1,000. Their traffic never bleeds into each other's counter.

Per-endpoint granularity goes on top of per-tenant. Not all endpoints cost the same. A `GET /patients` list call is not the same as a `POST /reports/generate` call. On Callidus, the Calia AI assistant runs a separate rate limiter entirely — isolated so that a heavy consultation session doesn't reduce throughput for the booking flow. The composite key: `tenantId:endpoint`. For AI or compute-heavy endpoints, add `:tier` to vary limits by plan level.

Amazon's builders library defines the goal precisely: fairness in a multi-tenant system means every client is provided with a single-tenant experience. The noisy-neighbor problem is a key problem, not a scaling problem. One tenant's data export consuming headroom that belongs to another tenant's real-time dashboard gets fixed by keying on the right dimension.

How Do You Wire Rate Limiting in a Next.js SaaS?

Rate limiting in Next.js runs in Edge middleware using `@upstash/ratelimit`, keyed on a composite `tenantId` and endpoint identifier, returning `429` with `Retry-After` on breach.

The reason Upstash is the practical answer here is architectural. Edge runtimes (Vercel Edge, Cloudflare Workers) block TCP connections. Standard Redis clients use TCP. Upstash exposes a native HTTP API alongside the Redis protocol, making it the only persistent counter store that reaches an Edge function without a workaround. Upstash's 2026 comparison of serverless Redis options puts their free tier at 500K commands per month — enough for rate limiting across several hundred daily active tenants before you pay anything.

```typescript import { Ratelimit } from "@upstash/ratelimit"; import { Redis } from "@upstash/redis";

const limiters = { default: new Ratelimit({ redis: Redis.fromEnv(), limiter: Ratelimit.tokenBucket(100, "1 m", 200), // 100/min, burst cap 200 prefix: "rl:default", }), ai: new Ratelimit({ redis: Redis.fromEnv(), limiter: Ratelimit.tokenBucket(30, "1 h", 30), prefix: "rl:ai", }), };

export async function middleware(request: NextRequest) { const tenantId = request.headers.get("x-tenant-id") ?? "anon"; const isAi = request.nextUrl.pathname.startsWith("/api/ai"); const limiter = isAi ? limiters.ai : limiters.default;

const { success, remaining, reset, limit } = await limiter.limit(tenantId);

if (!success) { return new NextResponse("Too Many Requests", { status: 429, headers: { "X-RateLimit-Limit": limit.toString(), "X-RateLimit-Remaining": "0", "X-RateLimit-Reset": reset.toString(), "Retry-After": Math.ceil((reset - Date.now()) / 1000).toString(), }, }); }

const response = NextResponse.next(); response.headers.set("X-RateLimit-Remaining", remaining.toString()); return response; } ```

Two things this code does that most examples skip. First, it sets `X-RateLimit-Remaining` on successful responses as well — clients that implement proactive backoff need this header on every response, not only on the 429. Second, it separates the AI limiter into its own bucket with its own prefix, so AI and non-AI traffic account against separate counters.

Return `429`, not `503`. Amazon's builders library makes this distinction explicit: `429` signals client misbehavior; `503` signals server failure. The difference matters for retry logic. A responsible client should not aggressively retry a `429` the same way it retries a transient server error. Send the right code, and clients that respect the spec will handle it correctly.

One additional threshold worth adding: warn before you reject. Emit a `X-RateLimit-Warning: true` header at 80% utilization so clients can back off before hitting the hard limit. You'll have significantly fewer failed-request support tickets from enterprise customers who run automated scripts against your API.

What Happens When Your Redis Goes Down?

When Redis is unreachable, failing open with a circuit breaker beats hard failure — rate limiting outages should not become product outages.

Have you ever had your rate limiter take down your API at 11pm because a Redis connection timed out? The control layer should be resilient even when the store it reads from isn't.

Three strategies in order of preference:

  1. Fail open with circuit breaker: when the Upstash HTTP request times out, allow the request through and log the failure. Set a 30-second fail-open window, then attempt recovery. Brief unguarded throughput is a smaller problem than a complete outage for most SaaS products.

  2. Postgres fallback counter: for standard Node.js API routes outside the Edge runtime, a Postgres-based fixed-window counter handles several hundred requests per second. This is the right choice for any serverless or early-stage SaaS MVP stack that wants rate limiting without adding Redis to the infrastructure bill yet:

```sql INSERT INTO rate_limit_counters (bucket_key, count, window_start) VALUES ($1, 1, date_trunc('minute', NOW())) ON CONFLICT (bucket_key) DO UPDATE SET count = CASE WHEN rate_limit_counters.window_start < date_trunc('minute', NOW()) THEN 1 ELSE rate_limit_counters.count + 1 END, window_start = CASE WHEN rate_limit_counters.window_start < date_trunc('minute', NOW()) THEN date_trunc('minute', NOW()) ELSE rate_limit_counters.window_start END, updated_at = NOW() RETURNING count, window_start; ```

  1. Cloudflare network-layer limiting: if your product sits behind Cloudflare, their network-layer rate limiting runs before traffic reaches your origin and handles volumetric abuse without touching your application. Combine this with application-layer limiting — Cloudflare catches the volumetric flood; your application enforces the per-tenant tier logic Cloudflare can't see.

Rate limiting is control infrastructure. It has to stay up when the dependencies it sits on don't.

The Billing Tier Check Belongs Before the Rate Limit

Most SaaS codebases I review have the rate limiter in middleware and the billing tier check inside the route handler. That order is backwards, and the consequences are expensive.

A request from a free-trial tenant arrives. The rate limiter clears it — the per-minute throughput is fine. The request hits your route handler, which calls an LLM inference endpoint, generates a PDF, or hits a metered third-party API. Then the billing check refuses it. The expensive downstream operation has already been invoked.

Zuplo's cost protection analysis puts this precisely: billing alerts notify you after the damage is done. By the time a human reads a cost threshold email on a Saturday morning, the spend has accumulated. Enforcement has to be in the critical path, before the expensive operation.

Rate limits and billing quotas measure different things. A rate limit controls throughput: requests per minute. A quota controls total consumption: API calls this billing period, LLM tokens this month. A slow retry loop, one request every ten seconds, passes every rate limit check while burning through a monthly quota in hours. You need both checks, and both need to run before the downstream call.

The correct middleware order:

```

  1. Auth → is this request authenticated?
  2. Billing quota → has this tenant's period allowance been exhausted?
  3. Rate limit → is this tenant bursting above their per-minute limit?
  4. Route handler ```

Actually — steps 2 and 3 often collapse into a single middleware pass when both counters live in the same Redis instance. The point is that both checks run before the downstream call, not inside it or after it.

Concretely: if your billing quota check lives only in a Stripe webhook handler or a nightly cron job, you have a gap. Any request that arrives between a quota exhaustion event and the next scheduled check will be allowed through. Middleware is the only enforcement layer that runs on every request.


What's your current bucket key? If it's IP address on a B2B product, that's the highest-return change available — more than algorithm selection, more than Redis provider. Key on `tenantId`, add endpoint granularity for your expensive operations, move the billing quota check upstream of the downstream call. Everything else is tuning.

DL

Dusko Licanin

Full-Stack Developer · Banja Luka, Bosnia

Full-stack developer shipping SaaS MVPs, web apps, and mobile apps 2× faster than agencies using AI-augmented workflows. Live portfolio: BookBed, Callidus, Pizzeria Bestek.

Frequently Asked Questions

What is the best rate limiting algorithm for a SaaS API?

Token bucket is the recommended default for most developer-facing SaaS APIs because it allows controlled short bursts while enforcing a long-term average rate. Each request consumes a token from a bucket that refills at a fixed rate up to a maximum capacity — so a client that's been idle for a minute can burst without being rejected. Sliding window log offers stricter fairness but stores a timestamp per request, creating memory pressure under high load. Use sliding window log for security-sensitive endpoints like authentication and OTP. Fixed window belongs on billing quota enforcement (daily or monthly API call caps), where the window boundary doesn't create an exploitable edge. For outbound delivery to customer webhooks, leaky bucket is the right choice — it smooths bursts into a steady stream that downstream endpoints can handle reliably.

Should I rate limit per IP or per tenant in a SaaS API?

Rate limit per tenant, not per IP, for any SaaS API serving organizations rather than anonymous traffic. Per-IP limiting fails for B2B products because enterprise clients often route all traffic through corporate proxies, meaning fifty users share one egress IP — your most important customers hit the rate limit first while bot traffic from residential IPs passes freely. Per-tenant keying assigns each organization its own independent counter, sized to their plan tier. Add per-endpoint granularity for expensive operations: the composite key `tenantId:endpoint` lets you enforce a tighter limit on AI inference or PDF generation without restricting cheaper list queries. Reserve per-IP rate limiting for login endpoints, password reset flows, and any public unauthenticated surface where tenant context isn't available. [Amazon's builders library](https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems/) describes this as the mechanism for providing every client a 'single-tenant experience' within shared infrastructure.

How do I set up Upstash rate limiting in a Next.js API?

Install `@upstash/ratelimit` and `@upstash/redis`, initialize a `Ratelimit` instance in your Next.js middleware file with `Redis.fromEnv()` and your chosen algorithm, then call `ratelimit.limit(tenantId)` on each request. Set `UPSTASH_REDIS_REST_URL` and `UPSTASH_REDIS_REST_TOKEN` in your environment variables — the SDK reads these automatically. The SDK returns `success`, `remaining`, `reset`, and `limit`, which you forward as `X-RateLimit-*` response headers. Upstash works in Next.js Edge middleware because it uses HTTP rather than TCP — Edge runtimes (Vercel Edge, Cloudflare Workers) block TCP connections, so standard Redis clients don't work there. For different limits on different endpoint groups, initialize multiple `Ratelimit` instances with different prefixes and select the right one based on `request.nextUrl.pathname`. The `tokenBucket` algorithm is the most suitable for interactive APIs; use the `slidingWindow` variant for stricter fairness on auth endpoints.

What is the difference between API rate limiting and API throttling?

Rate limiting rejects excess requests immediately with a 429 response; throttling queues or delays requests to deliver them at a smoothed rate. Rate limiting is the right choice for inbound SaaS API traffic because it gives clients immediate feedback via 429 plus Retry-After, doesn't consume server resources holding queued requests, and forces well-behaved clients to implement backoff. Throttling is better for outbound operations where smooth delivery matters — sending webhooks to customer endpoints, sending emails, or calling metered third-party APIs where bursts cause downstream failures. The distinction matters for your client contract: a rate-limited client knows it was rejected and can retry after the reset window; a throttled client assumes eventual delivery but has no guaranteed time. Most SaaS products want rate limiting for incoming requests and throttling only for outbound delivery pipelines.

Should I check billing quota before or after the rate limit?

Check billing quota before the rate limit to prevent expensive downstream operations from running on tenants who have exhausted their period allowance. Rate limits control per-minute throughput; quotas control total consumption over a billing period. A request can pass the rate limit check while the tenant has already consumed their monthly API call cap — if the billing check runs inside the route handler, the request will reach your downstream service (LLM, PDF renderer, third-party API) before being refused. [Zuplo's cost protection analysis](https://zuplo.com/learning-center/api-cost-protection-rate-limits-quotas-spending-caps) puts this clearly: billing alerts notify you after the damage is done; enforcement in middleware prevents it. The correct order is authentication → billing quota → rate limit → route handler. A slow retry loop that respects rate limits can still exhaust a monthly quota in hours — you need both checks, both running before the expensive call.