← All guides

Guide · Database monitoring

How to monitor a SQL Server database without an APM

Your HTTP healthcheck is green, your TCP probe on 1433 is green, and SQL Server is quietly stuck in a 40-minute blocking chain with tempdb at 98%. Here's the minimum-viable SQL Server health monitoring setup that catches the silent failures — without paying for Datadog DBM or installing a SQL Server Agent extension.

Published 2026-05-22 · ~10 min read · StatusPulse Team

"The app is up" doesn't mean SQL Server is healthy

Most monitoring setups stop at one of two checks: an HTTP probe on /healthz that returns 200, or a TCP probe on port 1433 that confirms the SQL endpoint is accepting connections. Both can be green for hours while SQL Server is in a state nobody would call healthy.

The failure modes a TCP or HTTP probe misses, in SQL Server production, every week:

  • Availability Group replication is broken. The primary is fine, the secondary accepts connections, the log send queue blew past a gigabyte and the synchronisation_health column flipped to not_healthy at 03:14. Read-only workloads pointed at the listener are now reading minutes-old data — silently.
  • A blocking chain is growing. One session took a row lock, didn't commit, fifty more sessions are now stacked behind it waiting on LCK_M_X. The database itself is idle, the TCP probe is delighted, and your app is timing out at exactly the rate new requests arrive.
  • tempdb is full. A bad query spilled a sort to tempdb, autogrow hit the disk ceiling, and now every query that needs a worktable fails with Could not allocate space for object 'dbo.SORT temporary run storage'. Connection-only probes pass; anything more complex than SELECT 1 craters.
  • The database is SUSPECT or RECOVERY_PENDING. A storage hiccup or a botched restore left a database in a state where you can connect to the server but the database itself refuses queries. sys.databases.state_desc tells you in one column; SELECT 1 against master won't.
  • Statistics are stale and the plan flipped. A missing index, a parameter-sniff regression after a stats update, or a planner choosing a parallel scan on what used to be a seek — checkout suddenly takes 8 seconds. The HTTP probe still loads the homepage in 80 ms, because the homepage doesn't hit that query.

None of these show up as "SQL Server is down." All of them show up as "the product is broken." If you wait for your customers to tell you on a support ticket, you'll find out about every single one of them.

What "healthy SQL Server" actually looks like

Forget the dashboard with 80 metrics for a moment. The signals that matter for "is SQL Server healthy right now" boil down to six numbers:

  1. Connection success. Can a real client open a fresh connection, authenticate, and run a query within a few seconds? This subsumes TCP, TLS, login auth, and the database is not in recovery / suspect / restoring cases.
  2. Latency on a sentinel query. A trivial SELECT 1 end-to-end gives you the floor latency. When it doubles, something is wrong — saturated worker threads, a tempdb that's spilled, a wait chain on SOS_SCHEDULER_YIELD.
  3. AG replication lag. Either the size of the log send queue on the primary, or seconds behind on the secondary. Either is fine; both catastrophically diverging is what you care about. (If you're not on AlwaysOn, skip this; if you're on log shipping, the equivalent is the last restore log timestamp on the standby.)
  4. Active session count. How many sessions are actively running as opposed to sleeping. Crossing some fraction of your worker-thread budget (or the platform default — 576 for SQL Server on most modern boxes) is your warning shot.
  5. Blocking chain depth. Anything with blocking_session_id > 0 for more than a few minutes is suspicious. A blocking chain three or four deep that doesn't clear is the most common "the database is fine, the app is broken" pattern in SQL Server.
  6. Database state. Every database you care about should be ONLINE. Anything in SUSPECT, RECOVERY_PENDING, EMERGENCY, or OFFLINE means a human needs to look — fast.

Notice that all six are answered by queries against SQL Server's own DMVs and system catalogues. You don't need an agent. You don't need extended events. You need a thing that runs a SELECT on a schedule and tells you when the answer changes.

Why a generic APM is overkill for most teams

The textbook answer to "monitor SQL Server" is to buy Datadog or New Relic, install the agent, enable Database Monitoring (DBM), and let it ingest query samples, execution plans, and wait events. It's a great product. It's also wildly oversized for most Series A-B teams running one or two SQL Server instances.

What you actually pay for with a full APM on SQL Server:

  • An agent on every database host or sidecar. On Azure SQL Database, Azure SQL Managed Instance, RDS for SQL Server, or Cloud SQL for SQL Server, you don't get a host — you get an endpoint. Agent-based monitoring becomes a sidecar VM you run yourself, which is the opposite of why you went managed.
  • An observability stack you didn't ask for. APMs assume you also want distributed tracing, log ingestion, RUM, and synthetic checks. You'll pay per-host, per-GB, and per-million-spans whether you use them or not.
  • Per-database-host pricing that punishes growth. Datadog DBM lists around $70 per database host per month before ingest, on top of your APM seat. For a team running staging + prod + a read replica, you're at $200+/month for the DB layer alone before you've watched a single query.
  • Windows Authentication friction. If your SQL Server is on-prem with Windows Auth, the APM agent needs a Windows service account with the right delegation, and you end up debugging Kerberos every time someone rotates a password.

Most teams don't need query-level introspection at that price. They need a confident answer to "is SQL Server healthy?" with alerting on the failure modes that actually page humans. That's what StatusPulse's Database probe is built for: a real TDS connection, a real query, a real assertion, on a schedule, with a status page in front of it. No agent. No host accounting. No span ingest.

StatusPulse's Database probe is on the Business tier — it's not free, but it's a flat add-on, not a per-host meter. If you've already outgrown free uptime monitors but you're not ready to feed the APM beast, this is the gap it fills.

The minimum-viable probe set

Three probes will catch the overwhelming majority of "the database is wrong" incidents. Each maps to one of the signals from above and runs an actual T-SQL query through a real TDS connection.

Probe 1 — Connection + sentinel query

The base case. Opens a fresh connection with your probe login, runs SELECT 1, captures the auth-handshake time and total round-trip separately. If the server is unreachable, the login is locked out, the TLS cert has rotated, or the instance is just plain slow to accept connections, this probe goes Down or Degraded.

-- Probe query
SELECT 1;

On StatusPulse's Database probe this is the Level 2 auth + ping mode — no custom query needed, just host, port (usually 1433), database, credentials, and TLS. Treat the Degraded threshold as your sentinel-latency budget. 500 ms is a reasonable starting value for a same-region managed SQL Server (Azure SQL Database or RDS); tighten or loosen once you have a week of baseline data.

Probe 2 — AlwaysOn replication lag

Point this at any node of the Availability Group — the DMV joins give you the same answer regardless. The query asks how much log data is stacked up waiting to ship to the secondary, in kilobytes:

-- Run on any AG replica
SELECT TOP 1 drs.log_send_queue_size AS lag_kb
  FROM sys.dm_hadr_database_replica_states drs
  JOIN sys.availability_replicas ar
    ON drs.replica_id = ar.replica_id
 WHERE ar.replica_server_name <> @SERVERNAME
 ORDER BY drs.log_send_queue_size DESC;

Use the First col in [min..max] assertion with a range like 0..51200 (50 MB of unshipped log). The secondary is allowed to be a little behind; anything more flips the probe Down. If your AG has a strict RPO — say, a financial system where 10 seconds of lag is already a contract breach — narrow the range.

If you'd rather measure seconds behind instead of log bytes (more intuitive for read-only secondary traffic):

-- Estimated seconds behind on the secondary
SELECT TOP 1 drs.secondary_lag_seconds AS lag_seconds
  FROM sys.dm_hadr_database_replica_states drs
  JOIN sys.availability_replicas ar
    ON drs.replica_id = ar.replica_id
 WHERE ar.replica_server_name <> @SERVERNAME
   AND drs.secondary_lag_seconds IS NOT NULL
 ORDER BY drs.secondary_lag_seconds DESC;

Not on AlwaysOn? If you're on log shipping, the equivalent probe asks how recent the last applied log restore is on the standby:

-- Run on the log-shipped secondary, against msdb
SELECT DATEDIFF(second, MAX(restore_date), GETUTCDATE()) AS seconds_since_restore
  FROM msdb.dbo.restorehistory
 WHERE restore_type = 'L';

Probe 3 — Active sessions and blocking

Counts active sessions, which is the best proxy for "is the server under load" you can get without instrumenting your app. And — more importantly — counts how many of those sessions are blocked by someone else.

-- Active and blocked sessions, excluding system + the probe itself
SELECT count(*) AS active
  FROM sys.dm_exec_sessions s
 WHERE s.is_user_process = 1
   AND s.status = 'running'
   AND s.session_id <> @SPID;

Use First col in [min..max] with a ceiling well below your worker-thread budget. SQL Server's max_worker_threads default is 0 (auto-calculated) which usually lands at 576 — set the range to 0..400 as a safe ceiling. The probe flips Down the moment the count crosses, and you have a clear signal to look at blocking, parallelism, or whatever is starving the scheduler — before worker-thread exhaustion takes the whole instance down.

That's it. Three probes covers connection health, replication health, and session-pressure health — the three failure modes that actually page people. The product spec for every assertion mode and the engine-version drift detection lives in the Database probe help section if you want the full reference.

Probes that catch each failure mode

A cheat sheet mapping each failure mode from the first section to a concrete probe, query, and threshold. Run these on a 1-5 minute cadence — SQL Server can wait that long for you to notice.

Failure mode Query Assertion
Login broken / instance unreachable SELECT 1 None (query must succeed)
Database not ONLINE SELECT count(*) FROM sys.databases WHERE name = N'YourDb' AND state_desc = 'ONLINE' Row count at least 1
AG log send queue too large SELECT TOP 1 log_send_queue_size FROM sys.dm_hadr_database_replica_states ORDER BY log_send_queue_size DESC First col in 0..51200
AG sync health degraded SELECT count(*) FROM sys.dm_hadr_database_replica_states WHERE synchronization_health_desc <> 'HEALTHY' Row count exactly 0
Blocking chain not clearing SELECT count(*) FROM sys.dm_exec_requests WHERE blocking_session_id > 0 AND DATEDIFF(second, start_time, GETDATE()) > 60 Row count exactly 0
Long-running request SELECT count(*) FROM sys.dm_exec_requests WHERE session_id > 50 AND DATEDIFF(minute, start_time, GETDATE()) > 10 Row count exactly 0
tempdb close to full SELECT 100.0 * SUM(unallocated_extent_page_count) / SUM(total_page_count) AS pct_free FROM sys.dm_db_file_space_usage First col in 10..100 (Down when below 10% free)
Last full backup too old SELECT DATEDIFF(hour, MAX(backup_finish_date), GETDATE()) FROM msdb.dbo.backupset WHERE database_name = N'YourDb' AND type = 'D' First col in 0..36
SQL Server Agent job failed SELECT count(*) FROM msdb.dbo.sysjobhistory h JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id WHERE h.run_status = 0 AND h.step_id = 0 AND msdb.dbo.agent_datetime(h.run_date, h.run_time) > DATEADD(hour, -1, GETDATE()) Row count exactly 0

Each row drops into a StatusPulse Database probe verbatim. The assertion modes are first-class on the probe form — pick the mode from a dropdown, paste the expected value, save. Internally the probe runs the SQL through a real TDS connection on a schedule, evaluates the assertion, and reports Up / Degraded / Down with a sanitised error if the query itself fails.

A note on the backup and Agent-job probes: both query msdb, which doesn't exist on Azure SQL Database (single-database SKU). Use the sys.dm_database_backups DMV on Azure SQL DB instead, or just skip these probes if your backups are managed by the platform.

A probe login with the right grants

Don't reuse your app login for the probe. Create a dedicated SQL login or contained user with the absolute minimum grants — every query in this guide works with the following set:

-- On SQL Server (Windows / Linux / RDS / Managed Instance)
CREATE LOGIN statuspulse_probe WITH PASSWORD = N'<strong-random>';

USE master;
GRANT VIEW SERVER STATE      TO statuspulse_probe;  -- DMVs at server scope
GRANT VIEW SERVER PERFORMANCE STATE TO statuspulse_probe; -- SQL 2022+ AG DMVs
GRANT VIEW ANY DATABASE      TO statuspulse_probe;  -- sys.databases visibility

-- In each database you probe, including msdb if you query backups
USE YourDb;
CREATE USER statuspulse_probe FOR LOGIN statuspulse_probe;
GRANT VIEW DATABASE STATE TO statuspulse_probe;     -- db-scope DMVs
-- Optional: SELECT on specific tables if you assert against business data
-- GRANT SELECT ON dbo.orders TO statuspulse_probe;

On Azure SQL Database (single-DB SKU) the GRANT VIEW SERVER STATE equivalent is to add the user to ##MS_ServerStateReader## in master. The script above won't apply verbatim; adapt to the platform.

Rotate the probe password independently from the app's credentials, and store it encrypted at rest — StatusPulse's Database probe wraps it in AES-GCM before persisting, but the principle holds wherever you run probes.

What to alert on, what to ignore

Nine probes will generate dozens of state transitions a week if you page on every flap. The fastest way to make on-call hate the SQL Server dashboard is to wire every Degraded signal straight to PagerDuty.

A defensible split, after running this setup across several production SQL Server fleets:

  • Page (wake someone up): connection probe Down for 2 consecutive checks; database state not ONLINE; AG sync health degraded for 2 consecutive checks; tempdb below 10% free; blocking chain older than 60 seconds for 5 consecutive minutes. These are real, customer-affecting problems that only get worse with time.
  • Notify (Slack or email, no page): AG log send queue above the threshold for 10 minutes; active session count above 80% of worker-thread budget for 10 minutes; long-running request count above zero for 15 minutes; last full backup older than your RPO budget. These need a human to look, but they don't need a human at 03:00.
  • Dashboard only: engine version drift, Agent-job failures from the last hour (yes, even these — they self-clear), single-check Degraded transitions, sentinel latency between baseline and 2× baseline. Put them on the status page and the internal dashboard. Don't alert on them.

StatusPulse handles the consecutive-checks heuristic and the intermittent-failure suppression at the probe level, so you can wire the Slack channel and the on-call PagerDuty webhook to different incident severities without writing routing logic yourself. The Postgres companion to this article runs through the same pattern from the Postgres side; the grammar of the probes is identical, only the queries change.

Wrap-up

Three things to take away:

  1. "The app is up" and "SQL Server is healthy" are not the same statement. A TCP probe on 1433 can be green while tempdb is full, an AG secondary is hours behind, or a blocking chain is taking the application down.
  2. Six T-SQL queries against the DMVs answer the practical questions. You don't need an APM agent or a per-host-priced DBM tier to run them on a schedule.
  3. Alert on the failures that page humans (connection, database state, AG health, blocking, tempdb). Notify on the rest. Dashboard the noise. The fastest way to lose trust in your monitoring is to alert on everything.

If you'd rather not build the cron-plus-T-SQL-plus-alert-routing yourself, that's exactly what StatusPulse's Database probe ships. Business tier, no agent, no per-host fee, runs in US or EU regions.

Try StatusPulse's Database probe

5 probes, 1 status page, forever. No credit card. US or EU host — you choose.