← All guides

Guide · DNS probe

DNS monitoring — catch propagation lag and hijacks

The HTTP probe says Down, but the real cause is a TXT record someone edited two hours ago. Or the apex A is fine on Cloudflare but stale on Google's resolver. Or the MX flipped to an attacker-controlled host and email has been silently bouncing for a week. DNS monitoring is the layer that catches the failures HTTP probes never see — and the layer where the most expensive incidents originate.

Published 2026-05-22 · ~10 min read · StatusPulse Team

Why DNS failures cascade silently

The default monitoring story for a public service is: a synthetic HTTP probe hits /health from a few regions, asserts a 200, alerts when it doesn't get one. That works fine for application-layer outages, and it is precisely the wrong shape for DNS failures — the HTTP probe collapses the entire stack into a single binary signal.

When the HTTP probe flips Down, the on-call walks the stack from the top: load balancer, app, DB. By the time someone runs dig against the apex and notices the record points at an old elastic IP that was recycled last week, an hour has passed. Users reported "site won't load" on Twitter hours before the infrastructure caught up — because their resolvers cached the broken record at varying rates and the first wave hit the cliff before the synthetic probe's own resolver did.

The pattern is worse for inbound email. Nothing in the standard HTTP-based monitoring kit watches MX. A junior admin "cleans up" a TXT record, the SPF string vanishes, and outbound mail starts failing SPF checks at every receiver. No alert fires anywhere — bounces accumulate, the deliverability dashboard takes a day to update, and you find out from a customer asking why their password reset never arrived. Pair this with an MX edited to point at a bogus host, and you have an undetected mail-flow takeover.

DNS monitoring fixes the polarity. Instead of asserting the end of the stack works, you assert each link holds the value you expect. The probe doesn't care whether /health returns 200; it cares whether the apex A still resolves to 93.184.216.34, whether MX still lists 10 mail.example.com, whether the SPF TXT still contains include:_spf.google.com.

What DNS records to actually monitor

Not every record needs a probe. A small SaaS with one product domain runs six to ten DNS probes; a mid-sized platform with marketing, app, API and mail sub-domains runs 20-40. The records that earn their probe slot, ranked by how often they cause incidents:

  • A / AAAA on the apex and critical sub-domains. example.com, www.example.com, api.example.com, app.example.com. The apex is the most common target for accidental edits during migrations — someone updates the CDN, forgets the apex flattener, and the A record points at the old origin. AAAA matters if you advertise IPv6; broken AAAA silently burns the Happy Eyeballs timeout for every IPv6-preferred user.
  • CNAME for vanity URLs. status.example.com, docs.example.com, email.example.com. When the provider rotates endpoints and forgets to email the customer, the vanity URL 404s and the only signal is the CNAME pointing at a host that no longer answers.
  • MX for inbound mail. Catastrophic when wrong, completely invisible to the HTTP probe. Assert the exact priority+host pair (e.g. 10 mail.example.com). Multi-MX setups use any-match against the primary so a deletion of the secondary doesn't flap.
  • TXT for SPF / DKIM / DMARC and verification tokens. Email-sender records at the apex, DKIM selectors at default._domainkey.example.com, DMARC at _dmarc.example.com, and the Google / Microsoft / Okta verification TXTs you don't think about until someone deletes them and SSO breaks. See email monitoring for the deliverability angle.
  • NS for nameserver drift. Especially when you run dual DNS providers (Route 53 + NS1, Cloudflare + Dyn). A misconfigured registrar update drops one provider from the delegation and your "redundant DNS" becomes a single point of failure nobody noticed.
  • CAA for certificate issuance authorisation. One CAA edit and Let's Encrypt's renewal pipeline returns CAA mismatch. The cert keeps serving for 30-90 days, then the renewal day arrives and prod 521s at the edge. The SSL monitoring guide covers the cert side; the CAA TXT is the upstream cause.

Resolver pinning — catching propagation lag

DNS is eventually consistent. A record edited at the authoritative becomes visible at every public resolver only after that resolver's cached copy expires (TTL) and it fetches a fresh answer. Different pools refresh on different schedules: Cloudflare 1.1.1.1 is aggressive, Google 8.8.8.8 is slower and geographically variable, ISP resolvers can hold a stale answer longer than the TTL contractually permits.

During a migration this matters in two directions. Assert only against the system resolver and you may see a green probe while half your users hit the old record; assert only against one public resolver and you miss the window where another is still serving the old value. The fix is resolver pinning: run the same DNS probe several times, each pinned to a different resolver, and let the deltas tell the migration story.

dig @1.1.1.1 example.com A +short
93.184.216.34

dig @8.8.8.8 example.com A +short
198.51.100.42       # still the old value — propagation in progress

dig @9.9.9.9 example.com A +short
93.184.216.34

In StatusPulse this is the Resolver server field on the DNS probe. Leave it blank for the prober region's system resolver. Set it to 1.1.1.1, 8.8.8.8 or 9.9.9.9 (Quad9) to bypass system and query that one directly. Create three probes — one per resolver, same target, same expected value — and the deltas are your propagation map.

A useful asymmetry: pin the probe at your own authoritative nameserver. dig @ns1.example.com gets you the source-of-truth answer, bypassing every public cache. Probing the authoritative is the "what should be served" check; probing public resolvers is "what is actually being served". The gap between them is propagation lag.

Assertion modes: exact, contains, any-match

A DNS query can return one value (a single CNAME), several in deterministic order (CAA), or several in non-deterministic order (round-robin A, multi-MX). The assertion mode picks how the probe compares the resolved set against your expected string.

Exact match joins every returned record into a single comma-separated string and compares it to the expected string. Use it when the record is stable and every change should fire: a single-value CNAME, a singleton CAA, the apex A on a service that doesn't round-robin.

Contains requires the expected string to appear as a substring of the joined resolved value. The forgiving default for TXT, where resolvers strip quotes and whitespace differently and long SPF / DMARC records get split into multiple 255-byte chunks the resolver reassembles.

Any-match walks each record in the response and asserts at least one equals the expected string exactly. The right choice for round-robin A and multi-MX: you assert a specific IP or MX is still in the pool without caring about ordering. Asserting 10 mail.example.com in any-match against a four-MX configuration passes as long as that priority+host pair is present.

The single most common false-Down on DNS probes is exact match against a round-robin A. The resolver shuffles, the joined string flips between "1.2.3.4, 5.6.7.8" and "5.6.7.8, 1.2.3.4", and the probe flaps every interval. If you're acknowledging the same DNS alert twice a day, switch it to any-match.

SPF / DKIM / DMARC monitoring

Email-authentication TXT records are the highest-leverage DNS monitoring targets and the most commonly missed. An SPF record is a single TXT at the apex that authorises sending services. When you migrate from SendGrid to Postmark, the new include:spf.mtasv.net goes in and someone is supposed to remove the old include:sendgrid.net — and they forget, or remove both, or paste the wrong vendor's include. Receivers silently demote affected mail to spam and the deliverability drop takes a week to show up.

dig @1.1.1.1 example.com TXT +short
"v=spf1 include:_spf.google.com include:mailgun.org -all"

dig @1.1.1.1 default._domainkey.example.com TXT +short
"v=DKIM1; k=rsa; p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQ..."

dig @1.1.1.1 _dmarc.example.com TXT +short
"v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com; pct=100"

Three probes cover the email-auth stack. Each is a DNS TXT probe in Contains mode against a stable substring:

  • SPF. Target example.com, type TXT, expected v=spf1 plus your most critical include: — usually the transactional sender. The full SPF string drifts as marketing tools come and go; asserting on the transactional include catches the catastrophic case (sender removed entirely) without flapping on every marketing tool change.
  • DKIM. Target <selector>._domainkey.example.com, type TXT, expected v=DKIM1. The public-key body changes when you rotate keys; asserting on the v=DKIM1 prefix is enough to catch deletion without binding you to a specific key.
  • DMARC. Target _dmarc.example.com, type TXT, expected v=DMARC1; p=. Asserts the policy is present at all. If you've moved from p=none to p=quarantine or p=reject, assert the exact policy — DMARC regressions are a deliverability event in their own right.

CAA monitoring is the same idea applied to certificate issuance. A CAA record like 0 issue "letsencrypt.org" at the apex tells public CAs that only Let's Encrypt is permitted to issue certs for the domain. Edit it, and your renewal pipeline starts failing at the CAA-check phase. The probe is a DNS CAA query against the apex with the expected string set to your authorised CA. Pair it with the SSL probe so the cert expiry watcher catches the downstream symptom and the CAA probe catches the upstream cause.

DNS hijack detection

A registrar compromise typically goes like this: the attacker gains access to your registrar account, edits the nameserver delegation to point at their own DNS, and serves their own answers for every record. Mail starts flowing through their MX, the website starts serving from their A — and your monitoring keeps querying the public resolver, which happily returns the attacker's records, because from the resolver's perspective the attacker IS the authoritative source.

The first signal in a value-pinned setup is the assertion failure. Every probe with an expected value flips Down within one interval of the edit propagating. Apex A at an unfamiliar IP. MX at a host you don't operate. NS listing a different nameserver provider than the one you pay for. The on-call sees five DNS probes fire simultaneously, looks at the targets, and the pattern is unmistakable — every record has been rewritten.

Pair this with the domain probe. The domain probe watches the WHOIS / RDAP record — registrar, registrant, nameserver delegation, expiry, EPP status codes. A registrar compromise updates the NS delegation at the registrar a few minutes before DNS edits propagate; the domain probe sees the NS change on WHOIS, the DNS probes see record values change after propagation. Both firing in the same incident is a high-confidence registrar-takeover signal. Watch for EPP status drops too — a domain that should sit at clientTransferProhibited suddenly showing ok is the canary that the registrar-level lock is gone.

Common failure modes

DNS monitoring is full of edge cases that bite anyone who hasn't run it for a year. The ones worth knowing up front:

  • TTL too high during a migration. Records at TTL 86400 mean the propagation tail runs for a full day; resolver-pinned probes disagree for hours and you can't tell if the migration is healthy or stuck. Drop TTLs to 300 seconds 24-48 hours before any planned change; propagation becomes minutes and probes converge fast.
  • Resolver caching makes assertion lag. Even at a low TTL, the prober region's system resolver caches answers. For real-time cutover visibility, pin the probe at the authoritative (dig @ns1.example.com) — the authoritative doesn't cache its own zone.
  • Multi-provider authoritative drift. Run Route 53 + NS1 and update only one, and the two authoritatives serve different answers. Public resolvers get whichever NS they happened to query and the mismatch looks like a flapping probe. Probe each authoritative directly by pinning the resolver at each NS; the drift becomes visible immediately.
  • The redundant copy gets forgotten. One provider gets the update, the other keeps the old record, and a portion of the world's resolvers return the old value forever. Fix the discipline, not the symptom: change DNS through a single source-of-truth (Terraform, OctoDNS) that fans out atomically, never by hand at one provider's console.
  • NXDOMAIN vs NODATA. NXDOMAIN means "the name doesn't exist"; NODATA means "the name exists but not for this record type". Probing AAAA on an A-only host returns NODATA and a Down probe. Confirm the record type matches what's published.
  • Long TXT records get chunked. Each TXT string caps at 255 bytes; long SPF / DKIM records get split into multiple strings the resolver concatenates back — some with extra whitespace, some without. Use Contains mode for TXT and assert on a stable substring rather than the full record.

Wrap-up

DNS monitoring decouples "is the application working" from "are the records correct". An HTTP probe alone collapses every failure into a single Down and the on-call walks the stack top-down hunting for the cause. A value-pinned DNS probe tells you, before symptoms surface, which record drifted and which resolver pool is serving the bad answer. Pair it with the SSL probe for the cert-issuance chain, the domain probe for registrar-level changes, and the email probe for round-trip delivery — every leg of the DNS-dependent infrastructure gets a value assertion.

The recipe: six to ten DNS probes per domain covering apex A, critical sub-domains, MX, SPF / DKIM / DMARC TXTs, NS, and CAA. Resolver-pin two or three critical probes against 1.1.1.1, 8.8.8.8, 9.9.9.9. Use any-match for round-robin and multi-MX. Use contains for TXT. Drop TTLs to 300s before any planned change. When five DNS probes fire at the same time, treat it as registrar compromise until you've ruled it out — that's the failure mode that earns this layer its place. The full field reference lives in the DNS probe help page.

Try StatusPulse's DNS probe

5 probes free; DNS probe from Starter ($5/mo). US or EU host — you choose.