Webhook Reliability: Idempotency Keys and Dead-Letter Queues in 2026

Key takeaways

Stripe retries failed webhook deliveries for up to three days in live mode with exponential backoff and does not guarantee event delivery order, so handlers must survive duplicate, out-of-order events.
The naive check-then-act idempotency pattern races under concurrent retries; an atomic INSERT ... ON CONFLICT (event_id) DO NOTHING RETURNING id claims the event before touching business state, and a 2026 DZone analysis found eliminating the non-atomic pattern removed 99.98% of duplicate transactions.
Signature verification must run on the raw request body before any JSON parsing, since Express middleware that parses JSON breaks the HMAC-SHA256 check on the Stripe-Signature header.
The correct sequence returns 200 immediately, then in a worker claims the event and runs the business operation plus the idempotency record in a single database transaction so a crash never leaves business state recorded but the event unmarked.
A functional dead-letter queue needs four things — original payload, error with stack trace, retry count, and a rate-limited replay path — with ~14-day retention and alerting on the first entry for financial events like invoice.payment_succeeded.

Every webhook handler you've ever written is one bad day away from double-billing a customer or provisioning an account that was never paid for.

Stripe will retry for three days. Your queue won't drop the message. The gap is in your own handler: the milliseconds between acknowledging receipt and recording the outcome to your database. That gap is where production billing incidents live.

This post covers the three layers every production SaaS webhook handler needs: signature verification, idempotency keyed to the event ID, and a dead-letter queue with a replay path. Stripe documents all three. Most implementations skip at least one.

What Breaks Without Idempotency?

Two identical cyan cubic event packets approaching the same amber intake port in an isometric server junction, representing the risk of duplicate webhook delivery

Skipping idempotency means the first retry produces a second charge, a second subscription, or a second user account — and there will be a retry.

Stripe retries failed webhook deliveries for up to three days in live mode, with exponential backoff. Sandbox mode is gentler — three attempts over a few hours — which is why the bug hides during development and detonates in production. Stripe also doesn't guarantee event delivery order: creating a subscription fires customer.subscription.created, then invoice.created, then invoice.paid, then charge.created, in no guaranteed sequence. Your handler needs to survive receiving any of these out of order, multiple times.

You've felt this. Your handler takes 6 seconds because a downstream API is slow. Stripe's timeout window closes at 5. Stripe marks the delivery failed and retries at +5 minutes. By then the handler works fine. It processes the retry. Now you have a double-provisioned account. The customer contacts support. Your handler never logged the timeout because from its side it returned successfully — to nothing.

The naive check-then-act pattern is the first trap. Your handler SELECTs for an existing event_id, finds nothing, provisions the subscription, then INSERTs the event as processed. Let me back up — that sequence fails under concurrent delivery. Two retries arriving within milliseconds of each other both SELECT before either has INSERTed. Both proceed. Two subscriptions created.

A 2026 DZone analysis cataloging four "Phantom Write" failure modes across payment platforms found that eliminating the non-atomic check pattern removed 99.98% of duplicate transactions. The fix is atomic: INSERT INTO processed_webhooks (event_id, received_at) VALUES ($1, now()) ON CONFLICT (event_id) DO NOTHING RETURNING id. If the RETURNING clause yields no rows, the event was already claimed. Stop.

Does Signature Verification Actually Matter?

An amber verification gate panel with a single glowing cyan circular sensor in an isometric data corridor, with a cream pipeline approaching from the left and an open path beyond

Skipping it means anyone who knows your webhook URL can trigger your billing logic with fabricated payloads.

The Stripe-Signature header contains a Unix timestamp (t=) and an HMAC-SHA256 value (v1=). The signature is computed over {timestamp}.{raw_body} using your endpoint secret as the key. Stripe's default tolerance is five minutes — events older than that are rejected to prevent replay attacks. Use your SDK's constructEvent() helper. Don't verify manually unless you have a specific reason.

One subtlety that costs engineers a day: verification must run against the raw request body, before any JSON parsing. Express middleware that parses JSON changes the body representation and breaks the HMAC check. The pattern is express.raw({ type: 'application/json' }) scoped to the webhook route only, not the global body parser. Test this in local dev with stripe listen --forward-to localhost:3000/api/webhooks — a parsing middleware mismatch shows up immediately. Stripe also publishes its IP ranges; allowlist them at the network edge before application code runs.

How Do You Build Webhook Idempotency That Doesn't Race?

An isometric single-slot turnstile gate with amber pipelines converging from multiple directions toward a single glowing cyan opening, representing atomic mutual exclusion in event processing

The atomic claim is the core: insert the event ID before touching business state, using a database constraint that makes the insert a no-op if the ID already exists.

Ordering matters more than almost any other implementation detail. If you send a confirmation email first and then mark the event as processed, a crash between those two steps means the retry sends the email again. Hookdeck's idempotency guide puts it plainly: mark events as processed before executing side effects. The correct sequence:

Verify signature. Return 400 immediately on failure.
Enqueue the raw event to a durable queue. Return 200 immediately.
In the worker: claim with INSERT ... ON CONFLICT DO NOTHING.
If no rows returned: already processed. Ack the message. Stop.
In the same database transaction: execute the business operation.
Commit.

Step 2 — returning 200 before processing — decouples delivery acknowledgment from processing time. Stripe interprets 2xx within its timeout window as "delivered." A handler that processes inline will time out. Stripe retries. You process twice.

Steps 5 and 6 are where many implementations introduce a second bug: executing the business operation and then committing the idempotency record as a separate transaction. A crash between those two commits leaves you in a state where the business operation happened but the event is still unrecorded. Next retry double-processes. Run both writes in a single transaction — the claim INSERT and the business state mutation commit together or not at all.

For TTL: your dedup storage needs to outlive the retry window. Stripe's live-mode window is three days. A safe dedup TTL is seven days — window plus margin. For Postgres-based setups, including the Supabase + Stripe pattern, a processed_webhooks table with a created_at index and a background cleanup job that deletes rows older than eight days keeps things simple without adding a Redis dependency.

What Does a Dead-Letter Queue Actually Need?

A functional DLQ needs four things: the original payload, the error with stack trace, a retry count, and a replay path your on-call team can reach at 3am.

A DLQ is what happens after the worker exhausts its retry budget and the event still hasn't processed. The options are: silently drop the event, hang the worker forever, or route it somewhere inspectable and replayable. Silently dropping a Stripe webhook event for invoice.payment_succeeded is not a decision you want made accidentally.

Property	Main Queue	Dead-Letter Queue
Retention	4 days	14 days
Auto-retry	Yes (exponential backoff)	No — manual replay only
Alert trigger	On repeated failure	On first entry
Stored context	Event payload	Payload + error + retry count + timestamps

The 14-day DLQ retention recommendation gives you enough time to diagnose, write a fix, deploy, and replay without an ops scramble. The main queue only needs four days because events persisting beyond that without resolution are already permanent failures by another name.

Replay needs rate limiting. Replaying 500 backed-up events simultaneously after an outage creates a thundering herd against your own database. Replay in batches with delays between them, and monitor DLQ depth dropping rather than queue depth spiking. Alert on DLQ depth — a single entry for invoice.payment_succeeded warrants a page. Ten events is an incident.

Observability Per Provider: The Part Usually Left as an Exercise

Most webhook guides end at the implementation. The operating half is knowing whether it's working.

Per-provider tracking means indexing your processed_webhooks table by event_id and event_type, then monitoring: failure rate by event type over the last 24 hours, DLQ inflow rate per hour, oldest unprocessed DLQ event age — the actual SLA clock — and percentage of events processed within 30 seconds of receipt. Configure separate alert thresholds for financial event types versus data event types. A failed invoice.payment_succeeded is not the same severity as a failed customer.updated.

BookBed handles twenty distinct Stripe webhook events end-to-end: the full subscription lifecycle, customer.subscription.trial_will_end, invoice.payment_action_required with 3D Secure escalation, and the checkout-to-cancellation flow. With that breadth of events, knowing which type is failing — not just that something failed — is the difference between a two-minute diagnosis and a two-hour log spelunking session. Wednesday at 11pm on one production build, webhooks started returning 500 because a schema migration had added a NOT NULL column without a default. Forty-three events in the DLQ in twenty minutes. The DLQ caught the data. The fix was a backfill — no events lost.

That's the argument for observability. Catching the 43-event spike before it becomes 4,300.

What the Docs Don't Say

A few things that took production incidents to learn. Not theory.

Stripe doesn't guarantee delivery order. Your handler for invoice.payment_succeeded cannot assume customer.subscription.created has already run. Fetch state fresh from your database. Derived state from expected event sequence is a slow-burn reliability bug — works in staging, breaks under real load ordering variance.

Idempotency without version ordering produces its own class of problem. An older event arrives late and overwrites a newer state. For subscription state, the fix is to store the Stripe event's created timestamp alongside each state change and reject any event whose created predates the currently stored timestamp. One comparison. Much safer.

The Stripe integration guide covers the overall billing pattern but doesn't flag the PgBouncer interaction: if you're using PgBouncer in transaction pooling mode, advisory locks are session-scoped and won't work as an idempotency mechanism. Use INSERT ON CONFLICT instead. Transaction-mode pooling breaks session assumptions consistently, and this failure mode is silent — no error, just incorrect behavior.

The simplest correct idempotent webhook handler is about 40 lines of TypeScript. The patterns exist. The docs describe them. The gap between "documented" and "actually implemented" is where most SaaS billing bugs live.

What's your handler missing?

Part of the SaaS Backend Infrastructure: Jobs, Webhooks, Email, and Admin Tooling guide.

Webhook Reliability: Idempotency Keys and Dead-Letter Queues in 2026

Key takeaways

What Breaks Without Idempotency?

Does Signature Verification Actually Matter?

How Do You Build Webhook Idempotency That Doesn't Race?

What Does a Dead-Letter Queue Actually Need?

Observability Per Provider: The Part Usually Left as an Exercise

What the Docs Don't Say

Free SaaS MVP Scope Template

Frequently Asked Questions

What makes a webhook handler reliable in production?

What does a dead-letter queue do for webhook reliability?

What are the key patterns for production Stripe webhooks?

What should a webhook retry strategy include?

How do I prevent duplicate webhook event processing with Postgres?

How much does a SaaS MVP cost in 2026?

Supabase vs Firebase — which to pick for your SaaS

Supabase + Stripe integration guide

Best tech stack for a SaaS MVP

Vercel vs Railway vs Fly.io for SaaS Hosting in 2026

Sentry vs Datadog vs Honeycomb for SaaS Observability in 2026

Choosing a Headless CMS for Your SaaS Marketing Site in 2026