Guide · Cron monitoring
Cron heartbeat patterns that catch silent failures
Exit-code monitoring is fine for the loud failures. The two failure modes that take down your business — jobs that ran but did nothing, and jobs that didn't run at all — need a different shape of probe. This is the cron job monitoring pattern that catches both.
The three failure modes of a cron job
Every team that has been running cron for more than a year has a war story that ends with "and we only found out three weeks later". They almost always trace back to one of three failure modes, and they are not equally easy to catch:
- (a) Ran and failed loudly. The script exits non-zero, cron emails root, your log aggregator sees a stack trace, your APM fires an alert. This is the easy case. Any half-decent monitoring picks it up.
-
(b) Ran but did nothing useful. The script
thinks it succeeded. It exited zero, no exceptions
were thrown — but a misconfigured filter selected zero rows,
an empty input bucket returned an empty list, an upstream
API silently returned a
200with no data. The backup ran for 0.4 seconds and uploaded an empty tarball. This is the partial-success failure, and it is the most common cause of "we discovered the broken pipeline three weeks later". -
(c) Didn't run at all. The cron daemon
wasn't restarted after the package upgrade. The Kubernetes
CronJob's
imagePullPolicypicked up a broken image and the pod has been crash-looping for a month. The GitHub Actions runner pool ran out of minutes. TheSYSTEMD_TIMERnever started because the unit file had a typo. This is the killer. The job is just gone and the silence is indistinguishable from "nothing interesting happened".
Mode (a) is the easy 20% of incidents. Modes (b) and (c) are the 80% — and they are exactly what exit-code monitoring misses.
Why exit-code monitoring misses two of them
The default mental model is: "my job exits zero on success, non-zero on failure, and I alert on the non-zeros." That model is missing two assumptions.
For mode (b): exit-zero only tells you the
process didn't throw. It says nothing about whether the process
did the thing it was supposed to do. A backup script can
cheerfully exit 0 after the find in
front of tar returned no files. A reconciliation
job can exit 0 after processing zero rows because
the cursor was already at the end of the table. An export
script can exit 0 after writing an empty CSV
because the SQL filter was wrong. Every one of those is an
outage and every one of those passes exit-code monitoring.
For mode (c): there is no exit code to monitor when the process never started. If you are watching for non-zero exits, the absence of any exit at all is the loudest possible failure and the quietest possible signal — there is literally nothing to alert on. The job stops running, the inbox stops receiving cron mail, and everything looks fine.
What you need is a probe that flips the polarity: instead of alerting when something bad happened, alert when something good stopped happening. That's a heartbeat.
The heartbeat pattern: absence is the signal
Every probe in a typical monitoring tool — HTTP, TCP, DNS,
certificate expiry, database SELECT 1 — is
pull-based. The monitor reaches out to your service and
checks the response. That's the wrong shape for cron jobs,
because the job has no inbound network surface to poll. The
heartbeat pattern inverts the flow.
Each cron job gets a unique URL. At the end of every successful
run, the job POSTs to that URL. The monitor knows
the expected schedule, and it tracks the time since the last
successful ping. If the ping doesn't arrive within
schedule + grace_period, the monitor flips Down and
pages your on-call.
The mechanics fall out of one observation: the absence of a signal is the signal. There is no "I'm failing!" message to send when the cron daemon dies, no exit code to capture when the pod was never scheduled. The only piece of information that survives every failure mode is "a ping we expected didn't arrive", and that's what the heartbeat probe is built around.
StatusPulse's Heartbeat probe
implements this directly: a 32-byte CSPRNG token per probe, a
unique URL of the form
https://api.statuspulse.ai/api/heartbeat/<token>,
a configurable expected schedule (simple interval or 5-field
cron), and a configurable grace period. Each successful
POST resets the "next expected" clock. The
Heartbeat probe is on the
Starter plan and above
($5/mo) — not on Free, where the probe budget is reserved for
outbound HTTP / SSL / TCP. If you've used Cronitor or Healthchecks.io,
this is the same shape; if you're comparing tools, our
UptimeRobot comparison
covers how heartbeat support differs across uptime vendors
(UptimeRobot has it, but capped and without partial-success
payloads).
Concrete recipes
The pattern is uniform: run the job, on success call the heartbeat URL, on failure either skip the call (the absence will alert) or explicitly POST a failure payload. The syntax varies by platform.
Linux cron + curl
The classic case. Daily backup at 03:00 UTC. The
&& is load-bearing — it means
curl runs only if the backup
exited zero. A semicolon would ping unconditionally, which is
exactly the silent-success failure we're trying to catch.
0 3 * * * /usr/local/bin/run-backup && \
curl -fsS --retry 3 https://api.statuspulse.ai/api/heartbeat/<token>
The -fsS flag combo is non-negotiable.
-f makes curl exit non-zero on a 5xx (without it,
curl exits 0 on a server error and your heartbeat reports
"fine" even when StatusPulse returned an error). -s
silences the progress meter so cron doesn't email you every
night. -S re-enables error output so when
something does break, you see it. --retry 3
absorbs transient network blips on the path between your
server and the receiver.
Kubernetes CronJob
Pod-restart loops and bad image pulls can break a CronJob without producing any signal in your APM. Heartbeat catches the silence. Two patterns work — inline curl, or a sidecar that runs after the main container. The inline version is simpler and almost always enough:
apiVersion: batch/v1
kind: CronJob
metadata:
name: scrape-upstream
spec:
schedule: "*/15 * * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: scrape
image: ghcr.io/acme/scrape:1.4.2
command: ["/bin/sh", "-c"]
args:
- |
/app/scrape.sh && \
curl -fsS --retry 3 "$HEARTBEAT_URL"
env:
- name: HEARTBEAT_URL
valueFrom:
secretKeyRef:
name: statuspulse-heartbeat
key: url
Store the URL as a Secret — it's a bearer
credential. Anyone with the URL can ping it and keep your
probe falsely Up. Treat it like an API key.
GitHub Actions scheduled workflow
cron monitoring for GitHub Actions is its own
sub-problem: Actions has no built-in alerting for a workflow
that simply stops running. If GitHub disables your scheduled
workflow after 60 days of repo inactivity (they will), or the
runner pool runs out, your workflow is gone and you find out
in the next quarterly review.
name: nightly-export
on:
schedule:
- cron: '15 2 * * *'
jobs:
export:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run export
run: ./scripts/export.sh
- name: Ping StatusPulse on success
if: success()
run: curl -fsS --retry 3 "$STATUSPULSE_HEARTBEAT_URL"
env:
STATUSPULSE_HEARTBEAT_URL: ${{ secrets.STATUSPULSE_HEARTBEAT_URL }}
The if: success() guard is the Actions equivalent
of cron's && — the step runs only when
every previous step in the job succeeded. Don't use
if: always() here unless you also POST an explicit
failure payload (see partial-success below), because
always() turns the heartbeat into "the workflow
file parsed", which is not what you want to assert.
Windows Task Scheduler + PowerShell
Same pattern, different syntax. Register the script below as the action of a scheduled task; Task Scheduler hands stdout and stderr off like any other task.
$ErrorActionPreference = "Stop"
try {
& "C:\Scripts\cleanup.ps1"
Invoke-WebRequest -Uri $env:HEARTBEAT_URL `
-Method POST -UseBasicParsing -TimeoutSec 10
} catch {
Invoke-WebRequest -Uri $env:HEARTBEAT_URL `
-Method POST -ContentType "application/json" `
-Body '{"success":false,"message":"cleanup threw"}' `
-UseBasicParsing -TimeoutSec 10
throw
}
Airflow DAGs
Two options. The simplest: a final BashOperator
task that fires curl, with
trigger_rule=TriggerRule.ALL_SUCCESS so it only
runs when every upstream task in the DAG succeeded.
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.trigger_rule import TriggerRule
with DAG("nightly_export", schedule="15 2 * * *", catchup=False) as dag:
export = BashOperator(task_id="export", bash_command="./export.sh")
heartbeat = BashOperator(
task_id="heartbeat",
bash_command="curl -fsS --retry 3 $HEARTBEAT_URL",
trigger_rule=TriggerRule.ALL_SUCCESS,
)
export >> heartbeat
The cleaner option: a DAG-level
on_success_callback that hits the URL, with a
matching on_failure_callback that POSTs the
failure payload. That keeps the DAG graph tidy and gets you
the partial-success payload pattern for free.
Partial-success POSTs for "ran but processed nothing"
The recipes above catch mode (c) — the job didn't run.
Mode (b) — the job ran but processed zero of the thousand
items it should have — needs one more piece. Empty
POST bodies say "I'm alive", but they say nothing
about whether the run was actually useful. The fix is to send
a tiny JSON payload with the run's outcome counts.
RESULT=$(/usr/local/bin/run-import) # prints e.g. "processed=842 skipped=12"
PROCESSED=$(echo "$RESULT" | grep -oP 'processed=\K\d+')
curl -fsS --retry 3 \
-H "Content-Type: application/json" \
-d "{\"processed\":$PROCESSED}" \
https://api.statuspulse.ai/api/heartbeat/<token>
With Capture payload enabled on the probe (it's off by
default — turn it on intentionally because the body is stored
up to 4 KB, and you don't want PII in there), the receiver
keeps the JSON. You can then either eyeball the recent payloads
in the probe's history, or wire a downstream rule: if the
processed count is below a threshold for N consecutive runs,
POST to …/<token>/fail from the job itself
and the probe flips Down with the message attached.
The pattern that works in production: every job ends with
either a success payload (with counts) or, if the counts are
implausibly low, an explicit failure payload. The
/fail endpoint flips the probe Down immediately
with the supplied message surfaced in the alert.
That covers mode (b) properly — a job that ran and processed
zero rows tells you so, in the same channel as a job that
crashed.
Grace windows and clock drift
The grace period is the window between "expected ping time" and "we flip Down". Setting it too tight is the most common way teams sour on heartbeat monitoring — they get false-Down alerts every other week, learn to ignore the alerts, and miss the real one.
Real-world cron has jitter. A backup that takes 5 minutes most nights can take 25 the night a large customer onboarded. Kubernetes CronJobs add pod-scheduling, image-pull, and node-pressure delays that can run to minutes on a cold node. GitHub Actions runners can take 2-5 minutes to pick up a job under load. The cron daemon itself drifts — clock-skewed VMs on a busy hypervisor can be a minute off, and DST shifts local-time crons by an hour twice a year.
Useful defaults that don't false-alarm:
- Daily / nightly jobs (backups, ETLs): 10-60 minutes of grace. The job's actual runtime variance is the floor; add a comfortable margin for runtime regressions and infrastructure jitter.
- Hourly jobs: 5-10 minutes.
- Every-15-minutes jobs: 2-5 minutes.
- High-frequency (every 1-5 minutes): 60-120 seconds. Tighter than that is fine if you also POST partial-success counts — the absence-of-payload signal catches outages even when the timing tolerance is loose.
Schedule the cron expression itself in UTC and stop thinking about local time for infrastructure. DST shifts have caused more 4am pages than they have caught real incidents.
What to alert on, what to ignore
Three rules that survive contact with a real on-call rotation:
- Alert on grace-period silence. This is the whole point of the probe. The ping didn't arrive within expected + grace — page someone. The probe goes Down, your watchers get the usual email / Slack / Teams / SMS notification, and the incident exists in the status page record.
-
Ignore single failed POSTs unless N in a row.
A transient network blip between your runner and the
receiver should not page you. The retry flag
(
--retry 3on curl, equivalent on every other client) absorbs the blip. Set the alert threshold on missed schedules, not on individual HTTP failures from the job to the receiver. -
Dedup duplicate runs from retries. If your
workflow retries on failure (Actions has it built in,
Kubernetes Jobs have
backoffLimit, Airflow hasretries=3), several pings can arrive within the same scheduled window. That's fine — the receiver treats them as idempotent, the probe stays Up after the first one. The rate-limit (200 requests per 10 minutes per token-hash) is the only ceiling, and you should not be anywhere near it.
One pattern to avoid: alerting the on-call on every individual
job failure (the /fail POST) and also on
missed schedules. That doubles the noise. Pick one channel for
"the job tried and failed" and one for "the job didn't try" —
most teams route the two to different severities, with missed
schedules being the higher one because mode (c) is harder to
recover from than mode (a).
Wrap-up
Exit-code monitoring is fine for the loud failures. The silent ones — the empty backup, the dead cron daemon, the disabled GitHub Actions workflow, the crash-looping CronJob — need a probe that asserts "something good happened recently" rather than "nothing bad happened". Heartbeats are that probe, and the payload-extended version (partial-success POSTs) covers the third failure mode too.
The recipe is uniform across platforms: run the job, on
success call a unique URL, on failure either skip the call or
POST an explicit failure. The differences across cron,
Kubernetes, Actions, Task Scheduler, and Airflow are
syntactic. The discipline is in the details:
&& not ;, -fsS
on every curl, a grace period that respects real-world
jitter, and treating the URL like the bearer credential it
is. Once it's wired, the silent-failure class of incident
collapses — you stop discovering broken pipelines three weeks
late, and the discovery moves to the next morning, where it
belongs.
Try StatusPulse's Heartbeat probe
5 probes free; Heartbeat probe from Starter ($5/mo). US or EU host — you choose.