Guide · 5 min read

Monitoring cron jobs: the failure mode you almost certainly have

Cron jobs fail silently. The standard observability stack does not catch them. Here is how to find out before your users do.

The default outcome of a broken cron job is that nothing happens. No alert, no email, no page. The next morning your batch report is missing, the cache is stale, the daily metric rollup is empty. You find out about it from a customer.

Standard monitoring catches the wrong things. Process uptime tells you cron is running, not that any specific job has run. Log volume looks normal because the job that didn't run produced no logs to be missing. CPU and memory look fine.

The pattern that actually works: deadman switches

Instead of monitoring "is the job running", monitor "did the job run by the time it was supposed to". The job pings a known endpoint at the end of every successful execution. The monitoring system expects that ping on a known cadence. If the ping is late by more than a grace window, it pages.

#!/bin/bash
set -euo pipefail
./run-nightly-rollup.sh
curl -fsS -m 10 --retry 3 "https://hc-ping.com/<your-uuid>"

Two lines, one curl, problem mostly solved. The free tier of Healthchecks.io and Better Stack's heartbeats both work this way. You can self-host either if you don't want a vendor in the loop.

For multi-step jobs: ping per phase

A job that runs three things back-to-back hides phase-level failures from a single heartbeat. Either split the job, or use the "start" / "success" / "fail" modes most heartbeat services support:

PING="https://hc-ping.com/<uuid>"
curl -fsS -m 10 "$PING/start"
./step-1.sh  || { curl -fsS -m 10 "$PING/fail"; exit 1; }
./step-2.sh  || { curl -fsS -m 10 "$PING/fail"; exit 1; }
./step-3.sh  || { curl -fsS -m 10 "$PING/fail"; exit 1; }
curl -fsS -m 10 "$PING"

GitHub Actions: the special case

GitHub's scheduler is best-effort and can skip runs entirely during peak load. A deadman switch on a GHA cron is mandatory if anyone is going to depend on it. The peak windows we have observed empirically are roughly :00 and :30 past the hour, UTC. If you schedule at 0 9 * * * you are competing with hundreds of thousands of other workflows. Offset by an odd minute (17 9 * * *) and your queue time drops by minutes, not seconds.

Kubernetes: also a special case

A CronJob with concurrencyPolicy: Allow and a slow job will pile up replicas when the cluster is under load. Set concurrencyPolicy: Forbid unless you have a specific reason not to, and set activeDeadlineSeconds so a runaway job can't pin the cluster.

Spotted something wrong? hello@cronpreview.com.