Tech Stack15 June 2026 · 9 min read

Webhook Reliability: Idempotency Keys and Dead-Letter Queues in 2026

Three layers every production SaaS webhook handler needs: signature verification, event-ID idempotency, and a dead-letter queue with a working replay path.

Webhook Reliability: Idempotency Keys and Dead-Letter Queues in 2026

Every webhook handler you've ever written is one bad day away from double-billing a customer or provisioning an account that was never paid for.

Stripe will retry for three days. Your queue won't drop the message. The gap is in your own handler: the milliseconds between acknowledging receipt and recording the outcome to your database. That gap is where production billing incidents live.

This post covers the three layers every production SaaS webhook handler needs: signature verification, idempotency keyed to the event ID, and a dead-letter queue with a replay path. Stripe documents all three. Most implementations skip at least one.

What Breaks Without Idempotency?

Two identical cyan cubic event packets approaching the same amber intake port in an isometric server junction, representing the risk of duplicate webhook delivery

Skipping idempotency means the first retry produces a second charge, a second subscription, or a second user account — and there will be a retry.

Stripe retries failed webhook deliveries for up to three days in live mode, with exponential backoff. Sandbox mode is gentler — three attempts over a few hours — which is why the bug hides during development and detonates in production. Stripe also doesn't guarantee event delivery order: creating a subscription fires customer.subscription.created, then invoice.created, then invoice.paid, then charge.created, in no guaranteed sequence. Your handler needs to survive receiving any of these out of order, multiple times.

You've felt this. Your handler takes 6 seconds because a downstream API is slow. Stripe's timeout window closes at 5. Stripe marks the delivery failed and retries at +5 minutes. By then the handler works fine. It processes the retry. Now you have a double-provisioned account. The customer contacts support. Your handler never logged the timeout because from its side it returned successfully — to nothing.

The naive check-then-act pattern is the first trap. Your handler SELECTs for an existing event_id, finds nothing, provisions the subscription, then INSERTs the event as processed. Let me back up — that sequence fails under concurrent delivery. Two retries arriving within milliseconds of each other both SELECT before either has INSERTed. Both proceed. Two subscriptions created.

A 2026 DZone analysis cataloging four "Phantom Write" failure modes across payment platforms found that eliminating the non-atomic check pattern removed 99.98% of duplicate transactions. The fix is atomic: INSERT INTO processed_webhooks (event_id, received_at) VALUES ($1, now()) ON CONFLICT (event_id) DO NOTHING RETURNING id. If the RETURNING clause yields no rows, the event was already claimed. Stop.

Does Signature Verification Actually Matter?

An amber verification gate panel with a single glowing cyan circular sensor in an isometric data corridor, with a cream pipeline approaching from the left and an open path beyond

Skipping it means anyone who knows your webhook URL can trigger your billing logic with fabricated payloads.

The Stripe-Signature header contains a Unix timestamp (t=) and an HMAC-SHA256 value (v1=). The signature is computed over {timestamp}.{raw_body} using your endpoint secret as the key. Stripe's default tolerance is five minutes — events older than that are rejected to prevent replay attacks. Use your SDK's constructEvent() helper. Don't verify manually unless you have a specific reason.

One subtlety that costs engineers a day: verification must run against the raw request body, before any JSON parsing. Express middleware that parses JSON changes the body representation and breaks the HMAC check. The pattern is express.raw({ type: 'application/json' }) scoped to the webhook route only, not the global body parser. Test this in local dev with stripe listen --forward-to localhost:3000/api/webhooks — a parsing middleware mismatch shows up immediately. Stripe also publishes its IP ranges; allowlist them at the network edge before application code runs.

How Do You Build Webhook Idempotency That Doesn't Race?

An isometric single-slot turnstile gate with amber pipelines converging from multiple directions toward a single glowing cyan opening, representing atomic mutual exclusion in event processing

The atomic claim is the core: insert the event ID before touching business state, using a database constraint that makes the insert a no-op if the ID already exists.

Ordering matters more than almost any other implementation detail. If you send a confirmation email first and then mark the event as processed, a crash between those two steps means the retry sends the email again. Hookdeck's idempotency guide puts it plainly: mark events as processed before executing side effects. The correct sequence:

  1. Verify signature. Return 400 immediately on failure.
  2. Enqueue the raw event to a durable queue. Return 200 immediately.
  3. In the worker: claim with INSERT ... ON CONFLICT DO NOTHING.
  4. If no rows returned: already processed. Ack the message. Stop.
  5. In the same database transaction: execute the business operation.
  6. Commit.

Step 2 — returning 200 before processing — decouples delivery acknowledgment from processing time. Stripe interprets 2xx within its timeout window as "delivered." A handler that processes inline will time out. Stripe retries. You process twice.

Steps 5 and 6 are where many implementations introduce a second bug: executing the business operation and then committing the idempotency record as a separate transaction. A crash between those two commits leaves you in a state where the business operation happened but the event is still unrecorded. Next retry double-processes. Run both writes in a single transaction — the claim INSERT and the business state mutation commit together or not at all.

For TTL: your dedup storage needs to outlive the retry window. Stripe's live-mode window is three days. A safe dedup TTL is seven days — window plus margin. For Postgres-based setups, including the Supabase + Stripe pattern, a processed_webhooks table with a created_at index and a background cleanup job that deletes rows older than eight days keeps things simple without adding a Redis dependency.

What Does a Dead-Letter Queue Actually Need?

A functional DLQ needs four things: the original payload, the error with stack trace, a retry count, and a replay path your on-call team can reach at 3am.

A DLQ is what happens after the worker exhausts its retry budget and the event still hasn't processed. The options are: silently drop the event, hang the worker forever, or route it somewhere inspectable and replayable. Silently dropping a Stripe webhook event for invoice.payment_succeeded is not a decision you want made accidentally.

| Property | Main Queue | Dead-Letter Queue | |---|---|---| | Retention | 4 days | 14 days | | Auto-retry | Yes (exponential backoff) | No — manual replay only | | Alert trigger | On repeated failure | On first entry | | Stored context | Event payload | Payload + error + retry count + timestamps |

The 14-day DLQ retention recommendation gives you enough time to diagnose, write a fix, deploy, and replay without an ops scramble. The main queue only needs four days because events persisting beyond that without resolution are already permanent failures by another name.

Replay needs rate limiting. Replaying 500 backed-up events simultaneously after an outage creates a thundering herd against your own database. Replay in batches with delays between them, and monitor DLQ depth dropping rather than queue depth spiking. Alert on DLQ depth — a single entry for invoice.payment_succeeded warrants a page. Ten events is an incident.

Observability Per Provider: The Part Usually Left as an Exercise

Most webhook guides end at the implementation. The operating half is knowing whether it's working.

Per-provider tracking means indexing your processed_webhooks table by event_id and event_type, then monitoring: failure rate by event type over the last 24 hours, DLQ inflow rate per hour, oldest unprocessed DLQ event age — the actual SLA clock — and percentage of events processed within 30 seconds of receipt. Configure separate alert thresholds for financial event types versus data event types. A failed invoice.payment_succeeded is not the same severity as a failed customer.updated.

BookBed handles twenty distinct Stripe webhook events end-to-end: the full subscription lifecycle, customer.subscription.trial_will_end, invoice.payment_action_required with 3D Secure escalation, and the checkout-to-cancellation flow. With that breadth of events, knowing which type is failing — not just that something failed — is the difference between a two-minute diagnosis and a two-hour log spelunking session. Wednesday at 11pm on one production build, webhooks started returning 500 because a schema migration had added a NOT NULL column without a default. Forty-three events in the DLQ in twenty minutes. The DLQ caught the data. The fix was a backfill — no events lost.

That's the argument for observability. Catching the 43-event spike before it becomes 4,300.

What the Docs Don't Say

A few things that took production incidents to learn. Not theory.

Stripe doesn't guarantee delivery order. Your handler for invoice.payment_succeeded cannot assume customer.subscription.created has already run. Fetch state fresh from your database. Derived state from expected event sequence is a slow-burn reliability bug — works in staging, breaks under real load ordering variance.

Idempotency without version ordering produces its own class of problem. An older event arrives late and overwrites a newer state. For subscription state, the fix is to store the Stripe event's created timestamp alongside each state change and reject any event whose created predates the currently stored timestamp. One comparison. Much safer.

The Stripe integration guide covers the overall billing pattern but doesn't flag the PgBouncer interaction: if you're using PgBouncer in transaction pooling mode, advisory locks are session-scoped and won't work as an idempotency mechanism. Use INSERT ON CONFLICT instead. Transaction-mode pooling breaks session assumptions consistently, and this failure mode is silent — no error, just incorrect behavior.

The simplest correct idempotent webhook handler is about 40 lines of TypeScript. The patterns exist. The docs describe them. The gap between "documented" and "actually implemented" is where most SaaS billing bugs live.

What's your handler missing?

DL

Dusko Licanin

Full-Stack Developer · Banja Luka, Bosnia

Full-stack developer shipping SaaS MVPs, web apps, and mobile apps 2× faster than agencies using AI-augmented workflows. Live portfolio: BookBed, Callidus, Pizzeria Bestek.

Frequently Asked Questions

What makes a webhook handler reliable in production?

A reliable webhook handler has three layers: signature verification to reject forged events, idempotency keyed to the event ID to prevent duplicate processing, and a dead-letter queue to capture events that exhaust the retry budget. Missing any one means relying on luck over design. Signature verification blocks fabricated payloads. Idempotency protects against [Stripe's three-day retry window](https://docs.stripe.com/webhooks) — Stripe doesn't guarantee exactly-once delivery, so deduplication is the consumer's job. The DLQ ensures no event is permanently lost when your handler encounters a bug or database issue under load.

What does a dead-letter queue do for webhook reliability?

A dead-letter queue captures webhook events that fail after exhausting all retry attempts, preserving the payload and error context so no event is permanently lost. Instead of silently dropping failed events, the DLQ stores the original payload, the final error and stack trace, the retry count, and timestamps for each attempt. Once the root cause is fixed, events can be replayed in controlled batches. [Hookdeck recommends 14-day retention for DLQs](https://hookdeck.com/webhooks/guides/dead-letter-queues-webhook-reliability) versus 4 days for the main queue, giving enough runway to diagnose, deploy a fix, and replay before events expire. Alert on DLQ depth — a single entry for a financial event type warrants immediate attention.

What are the key patterns for production Stripe webhooks?

Production Stripe webhook handling requires four things done in order: verify the Stripe-Signature header against the raw request body before any parsing, return 200 immediately and enqueue to a durable background queue, claim idempotency using the event ID with atomic INSERT ON CONFLICT in the worker, and maintain a dead-letter queue for events that exhaust retries. [Stripe retries for up to three days](https://docs.stripe.com/webhooks) in live mode with exponential backoff, so dedup TTL should be at least seven days. Stripe doesn't guarantee event order — fetch current state from your database rather than assuming prior events have already run.

What should a webhook retry strategy include?

A webhook retry strategy needs exponential backoff with jitter, a defined maximum attempt count, and a dead-letter queue for events that exhaust the budget. Exponential backoff prevents overwhelming recovering services. Jitter prevents a synchronized thundering herd when multiple services restart simultaneously on the same backoff schedule. The maximum attempt count prevents runaway retry loops consuming compute on permanently broken events. After the budget is exhausted, the [DLQ](https://hookdeck.com/webhooks/guides/dead-letter-queues-webhook-reliability) preserves the event for manual inspection and controlled replay — the difference between a recoverable incident and a data loss event.

How do I prevent duplicate webhook event processing with Postgres?

Use `INSERT INTO processed_webhooks (event_id) VALUES ($1) ON CONFLICT (event_id) DO NOTHING RETURNING id` to claim each event atomically before any business logic runs. If the RETURNING clause returns no rows, the event was already claimed by a prior delivery or a concurrent retry — skip processing. Execute business logic inside the same database transaction as the claim INSERT, so both commit together or not at all. A crash between a separate business commit and idempotency commit is the second failure mode — they must be atomic. Set a cleanup schedule to delete rows older than seven days, matching Stripe's retry window plus a safety margin.