WebSocket health monitoring beyond TCP checks

A TCP port can be open while every WebSocket connection drops three seconds after upgrade. A load balancer can answer 200 OK while the Upgrade header is silently stripped. The difference between "the socket is healthy" and "we can ping the host" is several layers of protocol, and every one of them can fail independently. This is what a real WebSocket uptime check looks like.

Published 2026-05-22 · ~10 min read · StatusPulse Team

Why TCP-up doesn't mean WS-up

Every team running a real-time product has the same support ticket in its history: "everything looks green, but our users say chat is broken." The dashboard shows the host responding, the TCP port probe is happily completing three-way handshakes, the HTTP front page returns 200. And the WebSocket layer is on fire.

The reason is that WebSocket connectivity depends on at least five separate pieces of infrastructure doing the right thing, and a TCP-up check exercises one of them. Here are the failure modes a TCP probe will happily certify as green:

The load balancer accepts the TCP connection but doesn't forward the upgrade. NGINX in its default config strips Upgrade and Connection as hop-by-hop headers. Without the explicit proxy_set_header rules on the location block, the backend never sees the upgrade request. TCP probe green; WebSocket probe gets a 200 where it expected a 101.
The server returns 200 instead of 101. The endpoint is alive but isn't WebSocket-aware on that path — maybe the route was removed, maybe the deployment rolled back to a build without WebSocket support. To a TCP probe this is indistinguishable from a healthy upgrade.
Subprotocol negotiation fails. Your graphql-ws client connects to a server that only speaks graphql-transport-ws after a library upgrade. The TCP connects, the 101 returns, and every actual subscription times out because the framing conventions don't match.
The auth header is rejected post-handshake. The upgrade itself succeeds — bearer token validated, 101 returned — and then the server immediately closes with code 4401 because the token's scopes don't include the channel you're trying to join. A TCP probe doesn't get close to this layer.
An idle timeout kills the socket at 60s. The probe connection succeeds, but every real user session dies after a minute because the upstream timeout in the proxy is shorter than your keep-alive ping. A point-in-time TCP probe sees a 50ms green check.
A firewall passes connect but drops frames. Stateful firewalls and deep packet inspection appliances sometimes allow the TCP handshake and the HTTP upgrade but drop the binary frames that follow. The 101 returns; the first connection_init never gets through.

Every one of these is a real outage that paged a real on-call, and every one passes a TCP-up check. A websocket uptime check has to speak the protocol — do the upgrade, validate the 101, optionally assert the subprotocol, optionally exchange a frame. Anything less doesn't reproduce what your users' browsers are doing.

The 3-phase WebSocket health check

A correct WebSocket health probe runs through three logically distinct phases on every check. Each phase can fail independently, each has its own latency budget, and each points at a different layer of your stack when it goes wrong.

Phase 1 — TCP + TLS connect. DNS resolution of the host, TCP three-way handshake, and (for wss://) the TLS handshake including SNI and ALPN negotiation. This is identical to what an HTTPS probe would do for the same hostname. If this fails, the problem is at the network or TLS layer — bad DNS, firewall block, expired cert, no SNI route, port closed.

Phase 2 — HTTP upgrade. The probe sends a real GET with the WebSocket-handshake headers and validates the server's response. As discussed in our HTTP monitoring guide, every WebSocket connection starts as an HTTP/1.1 request that asks to switch protocols mid-stream. The wire looks like this:

GET /graphql HTTP/1.1
Host: api.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: <16-byte base64 nonce>
Sec-WebSocket-Version: 13
Sec-WebSocket-Protocol: graphql-transport-ws
Origin: https://app.example.com
Authorization: Bearer eyJhbGciOiJI…

A healthy server replies with exactly one status line and a small set of headers:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: graphql-transport-ws

101 Switching Protocols is the only acceptable response. A 200 means the route isn't WebSocket-aware. A 302 means the proxy is redirecting (it shouldn't — wss:// URLs are followed as-is). A 401 or 403 means the auth header was rejected before the upgrade. A 4xx or 5xx means anything else broke. The Sec-WebSocket-Accept hash must equal the base64-encoded SHA-1 of the client's Sec-WebSocket-Key concatenated with the magic GUID 258EAFA5-E914-47DA-95CA-C5AB0DC85B11 — if it doesn't match, the server is misbehaving or you're not actually talking to a WebSocket implementation.

Phase 3 — RTT frame exchange (optional). After the 101, the probe sends a small text frame and waits for a response. Two reception modes cover the field: Echo for echo servers where the reply must equal what was sent, and Any frame for servers that emit a welcome/handshake message on connect (Phoenix Channels, Hasura, GraphQL subscription endpoints). The phase passes if a non-control frame arrives within the configured RTT timeout.

Three phases, three failure boundaries. A probe that collapses them into "is the WebSocket up?" loses the information that makes diagnosis possible.

Per-phase latency: connect, handshake, RTT

The reason for splitting the phases at all is that each one measures a different part of your stack, and you almost always want to alert on them separately.

Connect latency rising means DNS, the load balancer, the TCP path, or TLS handshake are slow. It moves with infrastructure events — a new cert chain that adds an OCSP round-trip, a misconfigured DNS server, an upstream proxy under pressure. It rarely has anything to do with your application code.
Handshake latency rising — the time from end-of-TLS to receipt of the 101 — points squarely at your application's upgrade path. The framework is slow to negotiate the subprotocol, an auth middleware is making a slow database call before deciding whether to allow the upgrade, or a startup-cost connection pool is cold.
RTT latency rising tells you the connection is up but the application behind it is struggling. The first frame takes longer to compute, the subscription engine is backed up, a downstream message bus is slow. This is the user-visible "everything feels laggy" signal.

Alerting on a single end-to-end timer makes you choose between false alarms (if you set it tight enough to catch application slowness, network jitter pages you) and missed incidents (if you set it loose enough to ride out jitter, real application regressions hide inside the budget). Separate budgets per phase fix this. A reasonable default shape: Handshake degraded at 1000 ms, RTT degraded at 200 ms, RTT timeout (the hard Down line) at 5000 ms. Connect doesn't usually need its own threshold because the probe-orchestration timeout already covers it.

Subprotocol negotiation (graphql-ws, mqtt, wamp)

The Sec-WebSocket-Protocol header is the bit of the handshake that distinguishes "a WebSocket is open" from "the right application protocol is being spoken over it". It's also the one piece of the handshake that, if you don't assert on it, will let a Hello-World endpoint pass for your real production service.

The client sends a comma-separated list of subprotocols it's prepared to speak; the server picks one and echoes it back. Real-world tokens you'll meet:

graphql-transport-ws — the current GraphQL subscriptions protocol used by graphql-ws library 5.x, Apollo Server 4+, Hasura, AWS AppSync.
graphql-ws — the legacy subprotocol from the subscriptions-transport-ws library. Many deployments are mid-migration; graphql-ws monitoring needs to assert which one your server is actually negotiating because the wire framing differs.
mqtt and mqttv3.1 — MQTT over WebSocket for IoT brokers (HiveMQ, EMQX, AWS IoT Core).
wamp.2.json / wamp.2.msgpack — Web Application Messaging Protocol, used by Crossbar.io and Autobahn-based stacks.
signalr — Microsoft SignalR for ASP.NET Core real-time hubs.

If you don't tell the probe what subprotocol you expect, anything that returns a 101 will pass — including a placeholder route someone deployed during a Friday firefight. With the assertion on, a deployment that regresses from graphql-transport-ws back to graphql-ws (or vice versa) flips the probe Down within minutes, not whenever a customer complains.

Phoenix Channels monitoring is a slightly different shape: Phoenix doesn't use a custom subprotocol, but it does send a join/welcome message immediately after the upgrade. For Phoenix, set the probe to Level 3 with Any frame reception mode and a short RTT timeout — the assertion is "the welcome arrives", not "a specific subprotocol was chosen".

Custom headers and auth on the upgrade

One quirk of the WebSocket protocol that catches a lot of teams off guard: authentication happens on the upgrade request, not after the connection is open. The browser fetch API can't set arbitrary headers on the WebSocket upgrade — but every server-side client (including a monitoring probe) can, and that's where bearer tokens, API keys, and tenant routing headers go.

A typical authenticated upgrade looks like this:

GET /ws HTTP/1.1
Host: api.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: <16-byte base64 nonce>
Sec-WebSocket-Version: 13
Authorization: Bearer eyJhbGciOiJIUzI1NiIs…
X-Tenant-Id: acme-prod
Origin: https://app.example.com

If the bearer token is invalid the server replies 401 Unauthorized — not a graceful WebSocket close, because the WebSocket connection was never opened. That's why a probe that only does TCP-up cannot detect auth regressions: the layer where auth is checked is one above what TCP-up measures. A probe that does the real upgrade catches token expiry, signing-key rotations gone wrong, scope-claim regressions, and tenant-routing misconfigurations — every one of them surfaces as a 401 or 403 in place of the expected 101.

Two implementation details worth pinning. First, the auth header is a bearer credential — store it encrypted at rest. StatusPulse's WebSocket probe holds the Authorization value AES-GCM-encrypted with a Key Vault master key, never logs it, never echoes it back after save. Second, Origin gets its own field because CSRF-aware servers reject the upgrade unless Origin matches an allowlist — browser-style enforcement applied to your probe. Set it to your real frontend URL.

HTTP/2 over WebSocket (RFC 8441)

The classic WebSocket upgrade is HTTP/1.1 only because the HTTP/2 binary framing layer has no equivalent of the Connection: Upgrade mechanism. RFC 8441 (Bootstrapping WebSockets with HTTP/2, published 2018) closes that gap: an HTTP/2 server that sets SETTINGS_ENABLE_CONNECT_PROTOCOL = 1 in its settings frame can accept an extended CONNECT request that opens a WebSocket inside an HTTP/2 stream. That gives you WebSocket multiplexing over an existing HTTP/2 connection — no separate TCP/TLS handshake per socket — with the same upgrade flow conceptually but different wire framing.

.NET's ClientWebSocket supports RFC 8441 from .NET 7 onwards, and falls back to HTTP/1.1 transparently if the server doesn't advertise the setting. The catch: real-world support across reverse proxies and CDN edges is patchy. NGINX added it in 1.25.3 but only behind explicit configuration. Older HAProxy versions don't have it. Several CDN edges still negotiate down to HTTP/1.1.

The pragmatic rule: leave the probe on HTTP/1.1 by default. Flip to HTTP/2 only when you specifically know your server stack supports the extended CONNECT end-to-end and you want the probe to verify that the HTTP/2 path keeps working. If you flip it and the probe still reports a 101, watch the negotiated protocol version in the response detail — silent fallback to HTTP/1.1 means you're not actually testing what you intended to test.

Common failure modes

A short field guide to the patterns that show up most often in production WebSocket outages, what they look like in the probe, and where to point your debugging:

Load balancer strips the Upgrade header. The probe error is "Server returned 200; expected 101 Switching Protocols". The TCP probe stays green. Fix at the proxy: in NGINX, the location block needs proxy_http_version 1.1; plus proxy_set_header Upgrade $http_upgrade; and proxy_set_header Connection "upgrade";. In Azure Application Gateway, "WebSocket support" has to be explicitly enabled on the listener.
Rate-limiter triggers on connection attempts. A WAF or rate-limit rule counting connections-per-minute from a single IP flags the probe as abuse and starts returning 429s. The fix is either to allowlist the probe's source range or to relax the rule. StatusPulse publishes its outbound IP ranges per-region so you can allowlist precisely.
Idle timeout shorter than keep-alive. Your application sends keep-alive pings every 60s; the upstream proxy's idle timeout is 30s; real connections die mid-conversation. A short-lived probe doesn't catch this directly — but the RTT chart starts jittering as proxy reconnect overhead leaks into the latency budget. If RTT jitters while handshake stays flat, the connection layer is the suspect.
Custom binary subprotocol needs a real send/receive. A connect-and-101 probe will pass against a SignalR or MQTT-over-WS endpoint that's broken at the application layer. Level 3 with a protocol-specific init payload (e.g. {"type":"connection_init"} for graphql-ws) is what catches "the upgrade works but the framing handshake doesn't".
Subprotocol case mismatch. The header is case-sensitive in assertion. graphql-ws and GraphQL-WS are different tokens. If your assertion suddenly fails after a server upgrade, check whether the new library version normalised the casing.
HTTP/2 silent downgrade. See above. The probe doesn't break, but you're testing the HTTP/1.1 path instead of the HTTP/2 path you intended.

For the full configuration surface — RTT modes, RTT timeout and degraded thresholds, custom JSON headers, ignore-cert toggle, the SSRF guard against 169.254.169.254 — see the WebSocket probe help page. And if you're building protocol-aware probes for sister stacks, the gRPC Health probe guide covers the same shape applied to gRPC's standard health-check service.

Wrap-up

"The TCP port is open" is not a WebSocket health check. A real websocket health probe has to speak the protocol — perform the upgrade, validate the 101 Switching Protocols response, optionally assert the negotiated subprotocol, and optionally exchange a frame to confirm the application layer behind the socket is responsive. Anything less leaves you discovering chat, collaboration, trading, or multiplayer outages from support tickets instead of from your monitoring.

The discipline is in the per-phase split. Connect, handshake, and RTT latency point at three different layers, and a probe that surfaces them separately turns "WebSocket is slow" into "the upgrade endpoint is slow but the application is fine" in a single chart glance. Subprotocol assertion catches the silent library-version regression; the upgrade-time auth header catches token and scope regressions; HTTP/2 (RFC 8441) is there when your stack supports it end-to-end. Wire those in and the "TCP-up but socket-down" class of incident moves from invisible to obvious.

Try StatusPulse's WebSocket probe

5 probes free; WebSocket probe from Pro ($19/mo). US or EU host — you choose.

Start with Pro See pricing