Guide · WebSocket probe
WebSocket health monitoring beyond TCP checks
A TCP port can be open while every WebSocket connection drops
three seconds after upgrade. A load balancer can answer 200 OK
while the Upgrade header is silently stripped. The
difference between "the socket is healthy" and "we can ping the
host" is several layers of protocol, and every one of them can
fail independently. This is what a real WebSocket uptime check
looks like.
Why TCP-up doesn't mean WS-up
Every team running a real-time product has the same support ticket in its history: "everything looks green, but our users say chat is broken." The dashboard shows the host responding, the TCP port probe is happily completing three-way handshakes, the HTTP front page returns 200. And the WebSocket layer is on fire.
The reason is that WebSocket connectivity depends on at least five separate pieces of infrastructure doing the right thing, and a TCP-up check exercises one of them. Here are the failure modes a TCP probe will happily certify as green:
-
The load balancer accepts the TCP connection
but doesn't forward the upgrade. NGINX in its
default config strips
UpgradeandConnectionas hop-by-hop headers. Without the explicitproxy_set_headerrules on the location block, the backend never sees the upgrade request. TCP probe green; WebSocket probe gets a 200 where it expected a 101. - The server returns 200 instead of 101. The endpoint is alive but isn't WebSocket-aware on that path — maybe the route was removed, maybe the deployment rolled back to a build without WebSocket support. To a TCP probe this is indistinguishable from a healthy upgrade.
-
Subprotocol negotiation fails. Your
graphql-wsclient connects to a server that only speaksgraphql-transport-wsafter a library upgrade. The TCP connects, the 101 returns, and every actual subscription times out because the framing conventions don't match. - The auth header is rejected post-handshake. The upgrade itself succeeds — bearer token validated, 101 returned — and then the server immediately closes with code 4401 because the token's scopes don't include the channel you're trying to join. A TCP probe doesn't get close to this layer.
- An idle timeout kills the socket at 60s. The probe connection succeeds, but every real user session dies after a minute because the upstream timeout in the proxy is shorter than your keep-alive ping. A point-in-time TCP probe sees a 50ms green check.
-
A firewall passes connect but drops frames.
Stateful firewalls and deep packet inspection appliances
sometimes allow the TCP handshake and the HTTP upgrade but
drop the binary frames that follow. The 101 returns; the
first
connection_initnever gets through.
Every one of these is a real outage that paged a real on-call, and every one passes a TCP-up check. A websocket uptime check has to speak the protocol — do the upgrade, validate the 101, optionally assert the subprotocol, optionally exchange a frame. Anything less doesn't reproduce what your users' browsers are doing.
The 3-phase WebSocket health check
A correct WebSocket health probe runs through three logically distinct phases on every check. Each phase can fail independently, each has its own latency budget, and each points at a different layer of your stack when it goes wrong.
Phase 1 — TCP + TLS connect. DNS resolution
of the host, TCP three-way handshake, and (for
wss://) the TLS handshake including SNI and ALPN
negotiation. This is identical to what an HTTPS probe would
do for the same hostname. If this fails, the problem is at
the network or TLS layer — bad DNS, firewall block, expired
cert, no SNI route, port closed.
Phase 2 — HTTP upgrade. The probe sends a
real GET with the WebSocket-handshake headers
and validates the server's response. As discussed in our
HTTP monitoring guide,
every WebSocket connection starts as an HTTP/1.1 request that
asks to switch protocols mid-stream. The wire looks like this:
GET /graphql HTTP/1.1
Host: api.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: <16-byte base64 nonce>
Sec-WebSocket-Version: 13
Sec-WebSocket-Protocol: graphql-transport-ws
Origin: https://app.example.com
Authorization: Bearer eyJhbGciOiJI…
A healthy server replies with exactly one status line and a small set of headers:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
Sec-WebSocket-Protocol: graphql-transport-ws
101 Switching Protocols is the only acceptable
response. A 200 means the route isn't WebSocket-aware. A 302
means the proxy is redirecting (it shouldn't —
wss:// URLs are followed as-is). A 401 or 403
means the auth header was rejected before the upgrade. A 4xx
or 5xx means anything else broke. The
Sec-WebSocket-Accept hash must equal the
base64-encoded SHA-1 of the client's
Sec-WebSocket-Key concatenated with the magic
GUID 258EAFA5-E914-47DA-95CA-C5AB0DC85B11 — if
it doesn't match, the server is misbehaving or you're not
actually talking to a WebSocket implementation.
Phase 3 — RTT frame exchange (optional). After the 101, the probe sends a small text frame and waits for a response. Two reception modes cover the field: Echo for echo servers where the reply must equal what was sent, and Any frame for servers that emit a welcome/handshake message on connect (Phoenix Channels, Hasura, GraphQL subscription endpoints). The phase passes if a non-control frame arrives within the configured RTT timeout.
Three phases, three failure boundaries. A probe that collapses them into "is the WebSocket up?" loses the information that makes diagnosis possible.
Per-phase latency: connect, handshake, RTT
The reason for splitting the phases at all is that each one measures a different part of your stack, and you almost always want to alert on them separately.
- Connect latency rising means DNS, the load balancer, the TCP path, or TLS handshake are slow. It moves with infrastructure events — a new cert chain that adds an OCSP round-trip, a misconfigured DNS server, an upstream proxy under pressure. It rarely has anything to do with your application code.
- Handshake latency rising — the time from end-of-TLS to receipt of the 101 — points squarely at your application's upgrade path. The framework is slow to negotiate the subprotocol, an auth middleware is making a slow database call before deciding whether to allow the upgrade, or a startup-cost connection pool is cold.
- RTT latency rising tells you the connection is up but the application behind it is struggling. The first frame takes longer to compute, the subscription engine is backed up, a downstream message bus is slow. This is the user-visible "everything feels laggy" signal.
Alerting on a single end-to-end timer makes you choose between false alarms (if you set it tight enough to catch application slowness, network jitter pages you) and missed incidents (if you set it loose enough to ride out jitter, real application regressions hide inside the budget). Separate budgets per phase fix this. A reasonable default shape: Handshake degraded at 1000 ms, RTT degraded at 200 ms, RTT timeout (the hard Down line) at 5000 ms. Connect doesn't usually need its own threshold because the probe-orchestration timeout already covers it.
Subprotocol negotiation (graphql-ws, mqtt, wamp)
The Sec-WebSocket-Protocol header is the bit of
the handshake that distinguishes "a WebSocket is open" from
"the right application protocol is being spoken over it".
It's also the one piece of the handshake that, if you don't
assert on it, will let a Hello-World endpoint pass for your
real production service.
The client sends a comma-separated list of subprotocols it's prepared to speak; the server picks one and echoes it back. Real-world tokens you'll meet:
-
graphql-transport-ws— the current GraphQL subscriptions protocol used by graphql-ws library 5.x, Apollo Server 4+, Hasura, AWS AppSync. -
graphql-ws— the legacy subprotocol from the subscriptions-transport-ws library. Many deployments are mid-migration; graphql-ws monitoring needs to assert which one your server is actually negotiating because the wire framing differs. -
mqttandmqttv3.1— MQTT over WebSocket for IoT brokers (HiveMQ, EMQX, AWS IoT Core). -
wamp.2.json/wamp.2.msgpack— Web Application Messaging Protocol, used by Crossbar.io and Autobahn-based stacks. -
signalr— Microsoft SignalR for ASP.NET Core real-time hubs.
If you don't tell the probe what subprotocol you expect,
anything that returns a 101 will pass — including a
placeholder route someone deployed during a Friday
firefight. With the assertion on, a deployment that
regresses from graphql-transport-ws back to
graphql-ws (or vice versa) flips the probe Down
within minutes, not whenever a customer complains.
Phoenix Channels monitoring is a slightly different shape: Phoenix doesn't use a custom subprotocol, but it does send a join/welcome message immediately after the upgrade. For Phoenix, set the probe to Level 3 with Any frame reception mode and a short RTT timeout — the assertion is "the welcome arrives", not "a specific subprotocol was chosen".
Custom headers and auth on the upgrade
One quirk of the WebSocket protocol that catches a lot of teams off guard: authentication happens on the upgrade request, not after the connection is open. The browser fetch API can't set arbitrary headers on the WebSocket upgrade — but every server-side client (including a monitoring probe) can, and that's where bearer tokens, API keys, and tenant routing headers go.
A typical authenticated upgrade looks like this:
GET /ws HTTP/1.1
Host: api.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: <16-byte base64 nonce>
Sec-WebSocket-Version: 13
Authorization: Bearer eyJhbGciOiJIUzI1NiIs…
X-Tenant-Id: acme-prod
Origin: https://app.example.com
If the bearer token is invalid the server replies
401 Unauthorized — not a graceful WebSocket
close, because the WebSocket connection was never opened.
That's why a probe that only does TCP-up cannot detect auth
regressions: the layer where auth is checked is one
above what TCP-up measures. A probe that does the
real upgrade catches token expiry, signing-key rotations
gone wrong, scope-claim regressions, and tenant-routing
misconfigurations — every one of them surfaces as a 401 or
403 in place of the expected 101.
Two implementation details worth pinning. First, the auth
header is a bearer credential — store it encrypted at rest.
StatusPulse's WebSocket probe holds the Authorization value
AES-GCM-encrypted with a Key Vault master key, never logs
it, never echoes it back after save. Second,
Origin gets its own field because CSRF-aware
servers reject the upgrade unless Origin matches an
allowlist — browser-style enforcement applied to your
probe. Set it to your real frontend URL.
HTTP/2 over WebSocket (RFC 8441)
The classic WebSocket upgrade is HTTP/1.1 only because the
HTTP/2 binary framing layer has no equivalent of the
Connection: Upgrade mechanism. RFC 8441
(Bootstrapping WebSockets with HTTP/2, published
2018) closes that gap: an HTTP/2 server that sets
SETTINGS_ENABLE_CONNECT_PROTOCOL = 1 in its
settings frame can accept an extended CONNECT
request that opens a WebSocket inside an HTTP/2 stream.
That gives you WebSocket multiplexing over an existing
HTTP/2 connection — no separate TCP/TLS handshake per
socket — with the same upgrade flow conceptually but
different wire framing.
.NET's ClientWebSocket supports RFC 8441 from
.NET 7 onwards, and falls back to HTTP/1.1 transparently
if the server doesn't advertise the setting. The catch:
real-world support across reverse proxies and CDN edges is
patchy. NGINX added it in 1.25.3 but only behind explicit
configuration. Older HAProxy versions don't have it.
Several CDN edges still negotiate down to HTTP/1.1.
The pragmatic rule: leave the probe on HTTP/1.1 by default. Flip to HTTP/2 only when you specifically know your server stack supports the extended CONNECT end-to-end and you want the probe to verify that the HTTP/2 path keeps working. If you flip it and the probe still reports a 101, watch the negotiated protocol version in the response detail — silent fallback to HTTP/1.1 means you're not actually testing what you intended to test.
Common failure modes
A short field guide to the patterns that show up most often in production WebSocket outages, what they look like in the probe, and where to point your debugging:
-
Load balancer strips the Upgrade header.
The probe error is "Server returned 200; expected
101 Switching Protocols". The TCP probe stays
green. Fix at the proxy: in NGINX, the location block
needs
proxy_http_version 1.1;plusproxy_set_header Upgrade $http_upgrade;andproxy_set_header Connection "upgrade";. In Azure Application Gateway, "WebSocket support" has to be explicitly enabled on the listener. - Rate-limiter triggers on connection attempts. A WAF or rate-limit rule counting connections-per-minute from a single IP flags the probe as abuse and starts returning 429s. The fix is either to allowlist the probe's source range or to relax the rule. StatusPulse publishes its outbound IP ranges per-region so you can allowlist precisely.
- Idle timeout shorter than keep-alive. Your application sends keep-alive pings every 60s; the upstream proxy's idle timeout is 30s; real connections die mid-conversation. A short-lived probe doesn't catch this directly — but the RTT chart starts jittering as proxy reconnect overhead leaks into the latency budget. If RTT jitters while handshake stays flat, the connection layer is the suspect.
-
Custom binary subprotocol needs a real
send/receive. A connect-and-101 probe will
pass against a SignalR or MQTT-over-WS endpoint that's
broken at the application layer. Level 3 with a
protocol-specific init payload (e.g.
{"type":"connection_init"}for graphql-ws) is what catches "the upgrade works but the framing handshake doesn't". -
Subprotocol case mismatch. The header
is case-sensitive in assertion.
graphql-wsandGraphQL-WSare different tokens. If your assertion suddenly fails after a server upgrade, check whether the new library version normalised the casing. - HTTP/2 silent downgrade. See above. The probe doesn't break, but you're testing the HTTP/1.1 path instead of the HTTP/2 path you intended.
For the full configuration surface — RTT modes, RTT timeout
and degraded thresholds, custom JSON headers, ignore-cert
toggle, the SSRF guard against
169.254.169.254 — see the
WebSocket probe help page.
And if you're building protocol-aware probes for sister
stacks, the gRPC Health
probe guide covers the same shape applied to gRPC's
standard health-check service.
Wrap-up
"The TCP port is open" is not a WebSocket health check.
A real websocket health probe has to speak the
protocol — perform the upgrade, validate the
101 Switching Protocols response, optionally
assert the negotiated subprotocol, and optionally exchange
a frame to confirm the application layer behind the socket
is responsive. Anything less leaves you discovering chat,
collaboration, trading, or multiplayer outages from support
tickets instead of from your monitoring.
The discipline is in the per-phase split. Connect, handshake, and RTT latency point at three different layers, and a probe that surfaces them separately turns "WebSocket is slow" into "the upgrade endpoint is slow but the application is fine" in a single chart glance. Subprotocol assertion catches the silent library-version regression; the upgrade-time auth header catches token and scope regressions; HTTP/2 (RFC 8441) is there when your stack supports it end-to-end. Wire those in and the "TCP-up but socket-down" class of incident moves from invisible to obvious.
Try StatusPulse's WebSocket probe
5 probes free; WebSocket probe from Pro ($19/mo). US or EU host — you choose.