Guide · Ping probe
When ICMP ping is the right monitor (and when it lies)
ICMP ping monitoring is the most-used and most-misused probe in our industry. It is the right tool for a small, well-defined set of jobs — and a confident liar for almost everything else. This is the opinionated guide on which side of that line your check belongs on.
The cardinal sin: ping as a website check
The single most common misuse of ICMP ping monitoring is
pointing it at www.yourcompany.com and calling it
a website uptime monitor. It will look like it works. It will
even page you for some real outages. And then one Tuesday
afternoon the application will be returning HTTP 500 on every
request, the TLS cert will have expired four hours ago, the
payments form will be silently dropping submissions — and
your ping uptime monitor will be cheerfully green, because the
load balancer in front of your app is still answering ICMP
echo just fine.
Three structural reasons this happens, all of them independent of how well you configured the probe:
- The thing answering ping is not the thing serving your app. A modern web service sits behind at least one layer of indirection — an AWS ALB, a Cloudflare edge, an nginx ingress, an Azure Front Door. The host that responds to ICMP is the front door. The web app lives behind it, on a different IP, possibly in a different VPC. The front door can be perfectly healthy while every backend pod is crash-looping.
- Many load balancers don't answer ping at all. AWS ALB famously does not respond to ICMP echo on its public IP — the listener is L7-only. CloudFront and most CDN edges drop ICMP at the perimeter. So in the cases where ping does succeed against a cloud-fronted hostname, it's usually answering at the ISP's first router, not at your origin.
- DNS resolves, ICMP succeeds, app is gone. Ping a hostname and you've proven exactly two things: DNS resolved, and something on that IP answered ICMP. Neither of those things implies your app is up. The hostname might point at the old IP from last month's migration. The IP might host a different service entirely. The application port might be closed and the host might still happily echo.
If the thing you actually care about is "can users reach my web app", the right probe is the HTTP probe, which terminates TLS, follows the redirect chain, inspects the status code and (if you want) asserts on response body. If you care about a specific port that isn't HTTP, the right probe is the TCP port probe. Ping is neither of those things.
What ping actually measures
Ping is three signals in a trench coat. People treat the "up/down" output of a ping monitor as one thing, but the underlying ICMP echo exchange produces three independent measurements, and every serious ICMP packet loss monitoring setup needs to think about each of them separately.
Run ping -c 5 1.1.1.1 and you'll get something
like:
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=58 time=12.3 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=58 time=11.8 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=58 time=12.1 ms
64 bytes from 1.1.1.1: icmp_seq=4 ttl=58 time=14.9 ms
64 bytes from 1.1.1.1: icmp_seq=5 ttl=58 time=12.0 ms
--- 1.1.1.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 11.802/12.620/14.901/1.143 ms
Three signals:
-
Round-trip time (RTT). The
min/avg/maxline. This is your network latency from probe to target and back. Useful for baselining, useless for "is it up" — a host can have 400 ms RTT and still be perfectly reachable. -
Packet loss percentage. The
0% packet lossline. This is the binary "did the packet come back" signal aggregated over the batch. Partial loss (10-40%) means a flaky path; total loss (100%) means unreachable. -
Jitter. The
mdevfield (mean deviation of the RTTs). High jitter means the path is unstable even when packets are getting through — relevant for VoIP, gaming, real-time streaming, less so for "is the box alive".
A good RTT monitoring SaaS reports all three. A bad one collapses them into "up/down" and you lose the ability to tell the difference between "router is congested" and "router is dead". StatusPulse's Ping probe stores the average RTT and the packet-loss percentage per check and charts both on the probe detail page — set thresholds on whichever one matches your failure mode.
The narrow set of cases where ping IS the right tool
Ping has its place. It's a small place. The cases where ICMP echo is genuinely the best probe — better than HTTP, better than TCP — all share one property: the thing you're checking is a piece of network infrastructure whose primary job is to forward packets, not to run an application.
- Routers, switches, gateways. A colo router, a VPN concentrator's public endpoint, a firewall's WAN interface, a managed-switch trunk. These devices typically respond to ICMP and not much else. You might have SSH on a management VLAN you can't reach from the public internet, and SNMP behind a community string you don't want to expose — ICMP is the one thing that works from anywhere and tells you the device is alive and routing. This is the original use case ping was designed for in 1983 and it is still the use case it's best at.
- Bare-metal physical-host liveness. Co-located servers that have no public HTTP surface, no exposed application ports, but exist on a public IP and answer ICMP. Useful for "is the box powered on and on the network" in lieu of any application-level signal.
- IoT gateways, printers, network appliances. A network printer at a remote office, a building HVAC controller, an industrial IoT gateway, a UPS with a management NIC. These devices often expose vendor-specific protocols you can't probe directly but respond to ICMP reliably. Ping is the lowest-common- denominator "is the device on the network" check.
- RTT-based capacity and weather monitoring. When the latency itself is the signal you care about, not reachability. Baselining the path from your edge POP to a peering partner, watching for transit congestion on a known route, detecting BGP rerouting that increased hop count. Pair the Ping probe with a Degraded threshold tuned to your typical RTT plus headroom, and you have cheap, continuous path-quality monitoring.
Notice what is not on that list. Web apps. APIs. Databases. Kubernetes services. Anything fronted by a cloud load balancer. If your check target is "an application", ping is the wrong shape.
The wider set where ping is wrong
By volume, more ping probes in the world are pointed at the wrong targets than at the right ones. The pattern is always the same: someone wants to monitor a service, picks ping because it's the first probe they learned, and now has a network reachability monitor masquerading as a service monitor.
- Any cloud-LB-fronted service. AWS ALB/NLB, GCP Cloud Load Balancing, Azure Application Gateway, Cloudflare, Fastly, Akamai. Some of these respond to ping, some don't, and the answer changes without notice as vendors update their edge fleets. The ALB you're pinging today might be replaced with one that drops ICMP tomorrow and your probe will flip Down for a non-incident. Use HTTP.
- Modern Kubernetes ingress. Whatever your ingress controller is (nginx, Traefik, Istio, Contour), it terminates HTTP/HTTPS. The pod's network namespace may or may not answer ICMP — it depends on the CNI, the NetworkPolicy, and whether the node's iptables happen to forward ICMP to the pod. None of that is what you want to assert. Use HTTP against the Service or Ingress URL.
-
CDN-cached endpoints. A ping to
cdn.example.comhits the nearest edge POP, which might be 5 ms away and answer ICMP in its sleep. That tells you nothing about whether the cache is serving fresh content, whether the origin is reachable from the edge, or whether the configured TLS cert is valid. Use HTTP and assert on a known cache header. -
SaaS APIs. If the API you depend on is
api.stripe.comorgraph.microsoft.comor any other vendor endpoint, ping monitoring tells you nothing the vendor's own status page doesn't already tell you, and several things that aren't true (anycast edge answering on behalf of an entire region). Use HTTP against a real endpoint with a real auth token.
Latency and loss thresholds: Degraded vs Down
The biggest config mistake on ping monitors is treating any RTT spike as Down. RTT and loss are different signals and they want different alerting rules.
RTT spike alone is not Down. If your normal RTT to a router is 40 ms and it jumps to 180 ms, the path is congested or has rerouted — the device is still reachable, traffic is still flowing, your users may notice slowness but nothing is broken in a "wake the on-call at 03:00" sense. That's Degraded, not Down. Set the latency threshold at 2-4x your normal RTT and route Degraded to a low-severity channel.
Partial loss is Degraded. Real internet paths legitimately drop packets. A wireless mesh hop, a congested transit link, a hardware queue overflow on a backbone router — any of these will show up as 10-40% loss on the ICMP path even though TCP retransmits hide it from your users. The right rule: if loss is above your threshold (we default to 20%, which is one of five packets), the probe goes Degraded. If loss is 100% for three consecutive cycles, then it's Down.
Total loss for N cycles is Down. One cycle of 100% loss is a hiccup. Three cycles in a row is an outage. The "consecutive failures" debounce is what keeps a single bad minute from paging anyone. Set N=3 on a 60-second interval and you'll page on a 3-minute total outage and ignore the 60-second blip.
A worked example. Suppose fping -c 10 -p 1000 router-1.colo
normally reports:
router-1.colo : xmt/rcv/%loss = 10/10/0%, min/avg/max = 38.2/40.1/42.7
Sensible thresholds for that target: Degraded above 120 ms RTT (3x normal), Degraded above 20% loss (any 2-of-10 packets dropped), Down on 100% loss for 3 consecutive cycles. RTT spike alone won't page. Two packets lost will move it to Degraded on the dashboard but won't page. Three minutes of total silence pages.
IPv4 vs IPv6, and when dual-stack lies
If your network reachability monitor only pings IPv4, you are missing half the failure surface. Modern services are dual-stack — they have both an A and an AAAA record — and the two stacks fail independently. The classic gotcha: IPv4 works, IPv6 is silently broken (or vice versa), and because most ping tools default to v4, the monitor reports "fine" while half your users are seeing connection failures.
The standard ping tool defaults to whichever
address family getaddrinfo returns first, which
depends on system config and is almost never what you want.
Force the family explicitly:
$ ping -4 -c 3 example.com
PING example.com (93.184.216.34) 56(84) bytes of data.
64 bytes from 93.184.216.34: icmp_seq=1 ttl=56 time=18.4 ms
...
$ ping -6 -c 3 example.com
PING example.com(2606:2800:220:1:248:1893:25c8:1946) 56 data bytes
64 bytes from 2606:2800:220:1:248:1893:25c8:1946: icmp_seq=1 ttl=55 time=18.6 ms
...
Three rules that survive contact with real dual-stack:
- If you serve IPv6, monitor IPv6. Run two probes against the same hostname, one with v4 preference, one with v6 preference. Each one targets a different address family and they alert independently. The StatusPulse Ping probe has a Prefer IPv6 toggle for exactly this — flip it on for the v6 probe and leave it off for the v4 one.
- Dual-stack-lies is a real failure mode. An ISP can have a v6 routing problem that affects only certain transit paths; a CDN can deploy a config that breaks AAAA resolution on a subset of POPs; an upstream firewall can be rate-limiting ICMPv6 Echo because someone copied an iptables rule without translating it to ip6tables. None of these show up on a v4-only probe.
- RTTs are not the same. IPv6 paths often traverse different transit than IPv4 — sometimes shorter, sometimes longer. Set RTT thresholds per probe; don't copy the v4 threshold to v6 without baselining.
Common failure modes
Things ICMP packet loss monitoring will hand you that look like outages but aren't, and the inverse — things that look fine but aren't:
- Firewall drops ICMP without an RST. The most common false-Down. The host is up, the application is serving traffic, ICMP echo gets dropped silently by the network firewall or the host firewall. You see 100% loss, the probe flips Down, the user-facing service is completely fine. Cloud hosts (EC2, Azure VM, GCE) are almost always in this category by default. If you're getting 100% loss but the service works, this is the first thing to check — and the right answer is usually to switch to a TCP or HTTP probe rather than fight the firewall.
-
Dual-stack misconfig: v4 succeeds, v6 silently
broken. Covered above; the failure mode where
ping example.comfrom your laptop works fine (because your laptop is on v4) and a measurable slice of your users on v6-preferred ISPs (mobile carriers, especially) see timeouts. Only caught by an explicit v6 probe. - Sleeping interfaces. Some embedded devices, low-power IoT gateways, and laptop NICs in power-save mode answer the first ICMP echo with a multi-hundred-millisecond delay because the radio or interface was idle. The first packet of a 5-packet batch shows up as 800 ms; the next four are 12 ms each. Don't set the RTT threshold based on the max — use the average, and if the average is dragged up by the wakeup packet, raise Packets per check to 10 so the outlier dilutes.
-
Rate-limited ICMP responses. Linux
kernels rate-limit ICMP echo replies by default
(
net.ipv4.icmp_ratelimit, 1000 ms). Cisco IOS, Junos, and most enterprise routers do the same. Fire a high-frequency probe at one of these devices and you'll see synthetic packet loss that isn't real loss — the device is dropping replies because it's being self-protective. Keep the per-check rate sane (5-10 packets per minute is plenty) and don't probe routers at sub-second intervals. -
Reverse-path filtering. The probe's
source IP is on a network the target doesn't have a
route back to (think: anycast source, asymmetric
routing, NAT misconfig). The echo arrives, the reply
gets dropped on the way back. Looks like 100% loss from
the probe, looks fine from anywhere with a normal
return path.
mtrfrom the probe region is the diagnostic — it'll show you exactly which hop stopped returning.
mtr in particular is the right diagnostic to
reach for when a Ping probe reports loss and you want to
know whether the loss is at the target or somewhere on the
path:
$ mtr --report --report-cycles 30 -4 router-1.colo
HOST: probe-eu-1 Loss% Snt Last Avg
1.|-- 10.0.0.1 0.0% 30 0.4 0.5
2.|-- 100.64.0.1 0.0% 30 1.2 1.4
3.|-- ae-3.r01.lhr01.example 0.0% 30 8.2 8.4
4.|-- ae-12.r02.fra04.example 46.7% 30 38.1 38.5
5.|-- router-1.colo 0.0% 30 40.2 40.8
Hop 4 is showing 46.7% loss but hop 5 is at 0%. That's almost
always rate-limited ICMP at the intermediate router replying
to TTL-expired packets, not real loss — because hop 5 (the
actual destination) is receiving every packet. The Ping
probe would show 0% loss to router-1.colo and
be correct.
Wrap-up
ICMP ping monitoring is a sharp tool with a narrow blade. Use it for routers, switches, gateways, bare-metal liveness, embedded devices, and RTT-based path quality, and it will serve you for decades. Point it at a web app and it will lie to you until the day you stop using it.
The rules that survive a real on-call rotation: pick ping when the target is network infrastructure, not an application; treat RTT, packet loss, and total loss as three separate alerting tiers; require N consecutive cycles of 100% loss before paging anyone Down; probe IPv4 and IPv6 separately if you serve both; and the moment you find yourself fighting a firewall to get ICMP through, switch to a TCP or HTTP probe instead. See the Ping probe documentation for the exact fields and defaults StatusPulse exposes.
Try StatusPulse's Ping probe
5 probes free; Ping probe from Starter ($5/mo). US or EU host — you choose.