Heartbeat Monitoring: When Your Job Should Ping You

Most of 0cron works in one direction: we call your endpoints on a schedule. You define a job, give us a URL and a cron expression, and we make the HTTP request at the right time. Simple.

But there is a class of problems where the direction needs to reverse. You have a backup script running on your own server. A CI pipeline that should complete every hour. A data sync job managed by a third-party service. You cannot point 0cron at these because you do not control their invocation -- they are already running somewhere else. What you need to know is whether they are still running. Whether they completed. Whether something broke at 3am and nobody noticed until Monday.

This is heartbeat monitoring, and it is built into 0cron as a first-class feature. The concept is simple: we give you a URL. Your job pings that URL when it completes. If we do not receive a ping within the expected window plus a grace period, we alert you. Silence means failure.

The entire implementation is 105 lines of Rust.

The Data Model

Every heartbeat monitor needs to track a few things: who owns it, what schedule it expects, how much slack to allow, and when it last heard from the job. Here is the PostgreSQL schema.

sqlCREATE TABLE monitors (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    team_id UUID REFERENCES teams(id),
    name VARCHAR(255) NOT NULL,
    ping_token VARCHAR(64) UNIQUE NOT NULL,
    schedule_cron VARCHAR(100) NOT NULL,
    grace_period_seconds INTEGER DEFAULT 300,
    timezone VARCHAR(50) DEFAULT 'UTC',
    status VARCHAR(20) DEFAULT 'active',
    last_ping_at TIMESTAMPTZ,
    notification_config JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

The key fields deserve explanation.

ping_token is a unique 64-character hex string. This is the identifier embedded in the ping URL. When your job hits https://0cron.dev/v1/ping/abc123def..., the token maps to this monitor. We use tokens instead of monitor IDs for two reasons: tokens are unguessable (they are 32 random bytes, hex-encoded), and they decouple the ping endpoint from the internal data model. If we ever restructure our UUID scheme, ping URLs remain stable.

schedule_cron defines when pings are expected. If your backup runs at 0 2 * (2am daily), the monitor knows that a ping should arrive shortly after 2am. The cron expression is parsed and validated at creation time using the same cron crate that powers the job scheduler.

grace_period_seconds defaults to 300 (5 minutes). This is the window after the expected ping time during which we do not alert. Backup jobs take variable time. Network latency exists. A job that runs at 2:00am and pings at 2:04am is not late -- it is within tolerance. Five minutes is a sensible default; users can increase it for long-running jobs or decrease it for time-sensitive ones.

last_ping_at is nullable. A brand-new monitor has never received a ping. This null state is important -- we do not alert on monitors that have never pinged, because the user might still be setting up their integration. The first ping establishes the baseline.

notification_config is JSONB, which means each monitor can have its own alert routing. One monitor might notify via Slack, another via email, a third via a webhook. The same multi-channel notification system from article 5 in this series powers monitor alerts.

Creating a Monitor

Monitor creation validates the cron expression and generates the ping token. This is the entry point.

rustpub async fn create_monitor(
    team_id: Uuid, name: &str, schedule_cron: &str,
    grace_period_seconds: i32, timezone: &str,
) -> AppResult<Monitor> {
    schedule_cron.parse::<cron::Schedule>()
        .map_err(|e| AppError::Validation(format!("Invalid cron expression: {e}")))?;
    let ping_token = generate_ping_token();
    Ok(Monitor {
        id: Uuid::new_v4(), team_id, name: name.to_string(),
        ping_token, schedule_cron: schedule_cron.to_string(),
        grace_period_seconds, timezone: timezone.to_string(),
        status: "active".to_string(), last_ping_at: None,
        notification_config: None, created_at: Utc::now(),
    })
}

The cron expression validation happens immediately. If someone passes   (four fields instead of five) or 60  (minute 60 does not exist), the error is returned before the monitor is created. This is a pattern we use throughout 0cron: validate at the boundary, not in the processing pipeline. By the time data enters the database, it is known to be well-formed.

The function returns a Monitor struct rather than inserting directly. The caller (the HTTP handler) handles the database insert, which keeps the domain logic testable without requiring a database connection.

Token Generation

The ping token is the security boundary. Anyone who knows the token can ping the monitor, and anyone who cannot guess the token cannot send false pings. This is security through entropy, and it works because the token space is astronomically large.

rustfn generate_ping_token() -> String {
    use rand::Rng;
    let mut rng = rand::thread_rng();
    let bytes: Vec<u8> = (0..32).map(|_| rng.gen()).collect();
    hex::encode(bytes)
}

32 random bytes, hex-encoded to 64 characters. That is 256 bits of entropy. For context, there are approximately 10^77 possible tokens -- more than the estimated number of atoms in the observable universe. Brute-forcing a ping token is not a practical attack vector.

We use the system's cryptographic random number generator via rand::thread_rng(), which on Linux sources from /dev/urandom. This is not a hand-rolled PRNG with a seed. It is the same entropy source used for TLS key generation.

The hex encoding is deliberate. Tokens appear in URLs (/v1/ping/abc123...), so they must be URL-safe. Hex encoding produces only [0-9a-f] characters, which never need percent-encoding. We could use base64url for shorter tokens, but 64 hex characters are still reasonable for a URL, and hex is simpler to debug when looking at logs.

Recording Pings

When a job completes and hits the ping endpoint, the handler calls record_ping. This is intentionally the simplest possible operation.

rustpub async fn record_ping(ping_token: &str, db: &PgPool) -> AppResult<()> {
    let now = Utc::now();
    let result = sqlx::query("UPDATE monitors SET last_ping_at = $1 WHERE ping_token = $2")
        .bind(now).bind(ping_token).execute(db).await?;
    if result.rows_affected() == 0 {
        return Err(AppError::NotFound(format!("Monitor with token '{ping_token}' not found")));
    }
    Ok(())
}

One SQL statement. Update the last_ping_at timestamp where the token matches. If no rows were affected, the token does not exist.

This simplicity is a design choice. The ping endpoint is the most performance-sensitive path in the monitoring system. Your jobs call it after every execution, potentially thousands of times per day across all users. It needs to be fast and reliable. A single UPDATE statement with an indexed column (ping_token has a UNIQUE constraint, which implies an index) completes in microseconds.

We do not record ping history (yet). We do not store the HTTP headers or body of the ping request. We do not track ping frequency or compute statistics. All we do is record the most recent ping timestamp. This is sufficient for the core use case -- detecting silence -- and avoids the storage and complexity overhead of a full ping audit log.

The API endpoint that calls this function is GET /v1/ping/{token}. Yes, GET, not POST. This is another deliberate decision. GET requests are the simplest possible HTTP request. They work with curl https://0cron.dev/v1/ping/TOKEN -- no flags, no body, no content-type header. They work with wget. They work with a bare HTTP client in any language. They even work by pasting the URL in a browser. The lower the friction for pinging, the more likely users are to integrate it.

Detecting Overdue Monitors

The heart of the system is the overdue check. A background task runs periodically (every minute) and queries for monitors that should have been pinged but were not.

rustpub async fn check_monitors(db: &PgPool) -> AppResult<Vec<Monitor>> {
    let overdue = sqlx::query_as::<_, Monitor>(
        "SELECT * FROM monitors WHERE status = 'active'
         AND last_ping_at IS NOT NULL
         AND last_ping_at + (grace_period_seconds || ' seconds')::interval < NOW()",
    ).fetch_all(db).await?;
    Ok(overdue)
}

This query is worth unpacking because it does something clever with PostgreSQL's interval arithmetic.

The expression last_ping_at + (grace_period_seconds || ' seconds')::interval constructs a timestamp representing "the last ping time plus the grace period." If a monitor's last_ping_at is 2026-03-11 02:03:00 and its grace_period_seconds is 300, this evaluates to 2026-03-11 02:08:00. If the current time (NOW()) is past that deadline, the monitor is overdue.

The grace_period_seconds || ' seconds' concatenation builds the string '300 seconds', which PostgreSQL casts to an interval type. This is idiomatic PostgreSQL -- interval arithmetic is a first-class feature, not a hack. The database handles timezone conversions, leap seconds, and all the edge cases that would be error-prone in application code.

The last_ping_at IS NOT NULL clause is the "never pinged" guard we mentioned earlier. A new monitor that has not yet received its first ping is not considered overdue. This prevents false alerts during setup. Once the first ping arrives, the clock starts ticking.

The status = 'active' filter excludes paused monitors. Users can pause monitoring during maintenance windows without triggering alerts.

The Use Cases

Heartbeat monitoring is simple in concept but broad in application. Here are the scenarios we designed for.

Backup jobs. Your database backup runs via crontab on your own server. You add curl https://0cron.dev/v1/ping/TOKEN as the last line of the script. If the backup fails, crashes, or the server goes down entirely, the curl never executes, and 0cron alerts you. This catches failures that silent cron jobs hide -- the backup has been failing for two weeks, but nobody checked because cron does not notify on absence.

CI/CD pipelines. A deployment pipeline should complete within 30 minutes. Add a ping at the end of the pipeline, set the grace period to 1800 seconds. If a deployment hangs on a flaky test, a stuck container pull, or a deadlocked migration, the lack of a ping triggers an alert before users notice the stale deployment.

External service health. You rely on a third-party data provider that should deliver files to your SFTP server every 6 hours. Your processing script runs after ingestion and pings 0cron when complete. If the provider fails to deliver, your script never runs, and 0cron alerts you. You now have monitoring for a service you do not control.

Scheduled tasks in application frameworks. Django has management commands. Rails has rake tasks. Laravel has scheduled commands. These run inside the application server and can fail silently if the process crashes, the queue backs up, or a dependency times out. Adding a ping at the end of each task creates an external watchdog that does not depend on the application's own health.

IoT and edge devices. A sensor node should report every 15 minutes. If it goes silent, the battery may have died, the network may be down, or the firmware may have crashed. The ping-and-silence pattern works for any system that should communicate regularly.

Grace Periods: The Art of Tolerance

The grace period is the difference between a useful monitor and an annoying one. Without it, a job that runs at 2:00am and takes 4 minutes would trigger an alert at 2:01am. That is not a failure -- that is normal execution time.

The default grace period of 300 seconds (5 minutes) works for most short-running jobs. But the right value depends on the job.

For a quick health check that should complete in seconds, 60 seconds is appropriate. For a database backup that processes gigabytes, 3600 seconds (one hour) might be needed. For a weekly report that aggregates a month of data, you might set it even higher.

The grace period is not just about execution time. It also absorbs network jitter, DNS resolution delays, and transient outages in the ping path itself. If 0cron's ping endpoint is briefly unreachable (during a deployment, for example), a generous grace period prevents false alerts.

We deliberately did not implement "smart" grace periods that auto-adjust based on historical ping timing. That kind of feature sounds appealing but introduces unpredictability. When an ops engineer sets a 5-minute grace period, they want exactly 5 minutes. Adaptive thresholds that silently change would make incident response harder -- "why didn't the alert fire at 5 minutes like I configured?" is not a question anyone wants to answer at 3am.

The Notification Pipeline

When check_monitors returns a non-empty list of overdue monitors, each one enters the same notification pipeline used by job execution alerts. The notification_config JSONB field on each monitor specifies which channels to use.

A monitor might be configured to send alerts to Slack for the on-call team, email for the manager, and a webhook for the incident management system. The notification system fans out across all configured channels in parallel. This reuse is one of the benefits of building multi-channel notifications as a standalone subsystem (covered in article 5) rather than coupling it to a specific feature.

The alert message includes the monitor name, how long it has been since the last ping, what the expected schedule is, and a direct link to the monitor's detail page in the dashboard. Enough context to diagnose without logging into 0cron.

What We Did Not Build (Yet)

Heartbeat monitoring at 105 lines is deliberately minimal. Here is what we chose to defer.

Ping history. Currently, we only store last_ping_at. A full history would enable trend analysis (is this job getting slower over time?), uptime calculations, and retrospective incident reports. This is planned but not essential for launch.

Ping payload processing. The GET endpoint accepts no body. A future enhancement would let POST pings include exit codes, execution times, or error messages. The monitor could then distinguish between "the job ran but failed" and "the job did not run at all."

Automatic monitor creation from jobs. If you create a cron job in 0cron, we could automatically create a corresponding heartbeat monitor with matching schedule. This would give users push-based monitoring of their pull-based jobs -- double coverage with zero configuration.

Escalation policies. The current system alerts once when a monitor becomes overdue. A production-grade monitoring system would escalate: first Slack, then email, then phone call. This requires a time-based state machine for each alert, which adds significant complexity.

These are all good features. None of them are necessary for the core value proposition: tell me when my job stops running. The 105-line implementation delivers that value. Everything else is enhancement.

105 Lines

Let us put this in perspective. The entire heartbeat monitoring feature -- data model, token generation, ping recording, overdue detection -- is 105 lines of Rust. Not 105 lines excluding tests (there are no tests yet). Not 105 lines excluding comments. One hundred and five lines, total.

This is possible because we made aggressive scoping decisions. One timestamp instead of a history table. GET instead of POST. Fixed grace periods instead of adaptive thresholds. A flat query instead of a background state machine. Each decision removed complexity without removing value.

It is also possible because Rust's type system and PostgreSQL's interval arithmetic handle the hard parts. The cron expression is validated by the cron crate. The timestamp arithmetic is done by PostgreSQL. The random token generation uses the OS entropy source. Our code is glue between well-tested building blocks.

When someone asks how a two-person team (one human CEO, one AI CTO) ships features this fast, this is the answer: choose the right tools, scope ruthlessly, and write only the code that your dependencies do not already handle.

This is article 8 of 10 in the "How We Built 0cron" series.

Why the World Needs a $2 Cron Job Service
4 Agents, 1 Product: Building 0cron in a Single Session
Building a Cron Scheduler Engine in Rust
"Every Day at 9am": Natural Language Schedule Parsing
Multi-Channel Notifications: Email, Slack, Discord, Telegram, Webhooks
Stripe Integration for a $1.99/month SaaS
From Static HTML to SvelteKit Dashboard Overnight
Heartbeat Monitoring: When Your Job Should Ping You (you are here)
Encrypted Secrets, API Keys, and Security
From Abidjan to Production: Launching 0cron.dev

Heartbeat Monitoring: When Your Job Should Ping You

The Data Model

Creating a Monitor

Token Generation

Recording Pings

Detecting Overdue Monitors

The Use Cases

Grace Periods: The Art of Tolerance

The Notification Pipeline

What We Did Not Build (Yet)

105 Lines

Responses

Related Articles

Step Zero Wasn’t Enough: How Validating A Constructor But Not The Runtime Took Down Every Déblo Voice Session The Hour We Shipped Real-Time Camera Streaming

The Em-Dash That Killed Production: How One Marketing Tagline In An HTTP Header Took Down Déblo’s Chat For 24 Hours

Six Hours From Empty Page to Apple Review — How We Submitted Déblo to the App Store, Live