Observability for SaaS: Logging, Monitoring, and Alerting

Observability for SaaS means being able to answer "what is broken, for whom, and why" from telemetry you already collect — without redeploying to add a print statement. It rests on three signals: logs (what happened), metrics (how much and how often), and traces (the path a single request took). Get those three right and you stop guessing during incidents and start reading the answer off a dashboard.

This is a supporting guide under the SaaS Backend Infrastructure pillar. If you have shipped an MVP and now have real users, the gap between "it works on my machine" and "I know it works in production" is exactly observability.

Key takeaways

Three signals, one purpose: logs tell you what happened, metrics tell you how often, traces tell you where in the request path it happened. You need all three for fast incident resolution.
Structured logs beat string logs. Emit JSON with a stable schema (level, timestamp, tenantId, requestId, message) so you can filter and aggregate instead of grepping.
Alert on symptoms, not causes. Page on "checkout error rate above 2% for 5 minutes," not on "CPU at 80%." Symptom alerts map to user pain; cause alerts create noise.
Every log line needs a correlation ID so you can reconstruct one user's journey across services. In multi-tenant SaaS, add tenantId too — you will want per-tenant error rates.
Start cheap. Your platform (Vercel, Firebase, Supabase) already emits most of what you need. Add a hosted tool (Sentry, Axiom, Grafana Cloud) only when free-tier logs stop being enough.

What is the difference between logging, monitoring, and observability?

These words get used interchangeably and they shouldn't be.

Logging is the act of recording discrete events: a request came in, a payment failed, a webhook was retried. Monitoring is watching known metrics against known thresholds — you decided in advance what to measure and what "bad" looks like. Observability is the broader property: can you ask new questions of your system without shipping new code? Monitoring answers "is the thing I expected to break, broken?" Observability answers "why is this specific tenant seeing 500s that nobody else is?"

The practical implication: monitoring catches the failures you predicted; observability is what saves you on the failure you didn't.

What should you actually log in a SaaS app?

Log enough to reconstruct an incident, not so much that you drown. A useful default for each request:

Inbound request: method, path, tenantId, authenticated user ID, requestId.
Outbound calls: every call to Stripe, your DB, an email provider — with duration and outcome.
Errors: full stack trace, the input that triggered it (with secrets redacted), and the requestId so it links back.
Business events: subscription created, plan upgraded, tenant signed up. These double as a cheap product analytics stream.

What not to log: raw passwords, full card numbers, API keys, personal data you don't need. Redact at the logging layer, not after the fact.

The format matters more than people expect. Structured JSON logs let you run level=error AND tenantId=abc123 across millions of lines in seconds. String logs force you to grep and hope. Here is the difference:

{ "level": "error", "ts": "2026-06-21T10:12:03Z", "tenantId": "clinic_42", "requestId": "req_8f3a", "msg": "stripe charge failed", "code": "card_declined" }

That one line answers four questions a plain console.log("charge failed") cannot.

When I built Callidus, a clinic SaaS on React, TypeScript, and Firebase with per-tenant Firestore security rules keyed off tenantId JWT claims, the same tenantId that gates data access is the natural correlation key for logs. That is not a coincidence — your multi-tenancy boundary and your observability dimension should be the same field. If you want the security side of that, the SaaS Backend Infrastructure pillar covers the tenant-isolation foundation.

Which metrics matter for a SaaS backend?

Don't try to graph everything. Start with the four "golden signals" and a few business metrics:

Latency — how long requests take, tracked as percentiles (p50, p95, p99), never as an average. Averages hide the slow tail that users actually feel.
Traffic — requests per second, so you can tell a real outage from a quiet Sunday.
Errors — error rate as a percentage of traffic, split by endpoint and ideally by tenant.
Saturation — how full your resources are (DB connections, function concurrency, queue depth).

Then layer business metrics on top: signups per day, failed payments, webhook retry rate, background-job backlog. These are the ones a founder actually cares about, and they often surface problems before the technical metrics do — a spike in failed payments is a billing bug before it is a CPU graph.

If background work is a big part of your app, the queue-depth and retry metrics deserve their own dashboard. I cover the processing side in the guide on background jobs for SaaS, and the closely related problem of making webhooks survive failure in webhook reliability, idempotency, and dead-letter queues.

How do traces tie it all together?

A trace follows one request through every service it touches and stamps each hop with timing. When checkout is slow, a trace tells you instantly whether the time went to your database, to Stripe, or to your own code — instead of you bisecting logs by hand.

The mechanism is a correlation ID (often called a trace ID) generated at the edge and propagated through every downstream call and log line. Even without a full distributed-tracing vendor, you get 80% of the value just by generating one ID per request and including it in every log. Then "show me everything that happened for req_8f3a" becomes a single filter.

In a six-platform Flutter app like BookBed (iOS, Android, Web, macOS, Linux, Windows from one codebase, with bidirectional iCal sync and Stripe billing at EUR 9/mo up to 20 units), the value compounds: a bug report from a macOS user and the same bug from an Android user share a code path, and a trace ID in the client logs lets you connect the client-side report to the exact server request that failed. Without correlation, cross-platform bug reports are just vibes.

What makes an alert good instead of annoying?

The fastest way to make a team ignore alerts is to page them for things that don't matter. Good alerting follows a few rules:

Alert on symptoms users feel. "Login error rate above 5%" is actionable. "Memory at 70%" usually isn't — it might be totally normal.
Require duration. A one-second blip shouldn't page anyone. "Above threshold for 5 minutes" filters out noise.
Tier by severity. A page-me-at-3am alert (checkout is down) is different from a look-at-it-tomorrow alert (a nightly job ran 10% slow). Route them to different channels.
Make every alert actionable. If the on-call person can't do anything about it, it shouldn't be an alert — it should be a dashboard.
Kill flaky alerts immediately. One false page erodes trust in all of them. Tune or delete.

A simple starting set for a small SaaS: error-rate alert on your top three endpoints, a payment-failure alert, a "site is fully down" uptime check from an external monitor, and a background-job backlog alert. That is four alerts that cover the failures that actually lose you customers.

What tools should a small SaaS start with?

Resist the urge to build a Grafana cluster on day one. Your platform already gives you a lot:

Vercel surfaces function logs and basic analytics out of the box.
Firebase and Supabase both ship logs and usage dashboards; Supabase exposes Postgres logs and the Realtime layer directly.
For errors, a hosted error tracker like Sentry captures stack traces with context and groups duplicates — this is usually the first paid tool worth adding.
For log search and metrics, hosted options like Axiom, Grafana Cloud, or Better Stack have generous free tiers that cover an early-stage SaaS.

The progression that works: ship on your platform's built-in logs, add error tracking the moment you have paying users, add a log-search tool when grepping platform logs gets painful, and only consider self-hosted observability when cost or data-residency forces it. This same "buy your platform's defaults first, graduate later" logic shows up in deployment decisions too — see multi-region SaaS deployment: when and how for the same trade-off applied to infrastructure.

When I build for clients solo — whether a clinic SaaS shipped over roughly ten weeks or a React and Supabase ordering app like Pizzeria Bestek with four languages and Supabase Realtime — observability is not a separate phase. It is correlation IDs and structured logs added as the code is written, because retrofitting them after an incident is how you spend a weekend grepping.

FAQ

(See FAQ section below.)

Observability for SaaS: Logging, Monitoring, and Alerting

Key takeaways

What is the difference between logging, monitoring, and observability?

What should you actually log in a SaaS app?

Which metrics matter for a SaaS backend?

How do traces tie it all together?

What makes an alert good instead of annoying?

What tools should a small SaaS start with?

FAQ

Frequently Asked Questions

What is the difference between monitoring and observability?

What should I log in a multi-tenant SaaS app?

What metrics should a small SaaS monitor first?

How do I avoid alert fatigue?

SaaS Security and Compliance: SOC 2, GDPR, and Audit Trails