1. Overview
A distributed job scheduler is a system that reliably executes tasks (jobs) at specified times or intervals across a cluster of machines. It powers background processing in virtually every large-scale application: cron jobs, ETL pipelines, report generation, cache warming, billing cycles, machine learning training, etc.
This design is directly inspired by production systems such as:
- Quartz + ZooKeeper (used by Twitter, LinkedIn)
- Celery Beat + Redis/RabbitMQ
- Apache Airflow (DAG-based)
- Kubernetes CronJobs + Controller
- Azkaban / Chronos (Mesos)
- Hangfire / Sidekiq Pro (Ruby ecosystem)
The system must provide exactly-once or at-least-once semantics, high availability, horizontal scalability, and precise scheduling even under failures, clock drift, and network partitions.
2. Functional Requirements
- Job types:
- One-time delayed jobs (run at 2025-12-31T23:59:59Z)
- Recurring jobs via cron expressions (* * * * * Unix cron or Quartz syntax)
- DAG workflows (job dependencies)
- API:
- schedule(job_definition) → job_id
- cancel(job_id), pause/resume, trigger_now(job_id)
- get_status(job_id) → PENDING | RUNNING | SUCCESS | FAILED | RETRIES_EXHAUSTED
- Execution guarantees:
- At-least-once by default (idempotent jobs recommended)
- Exactly-once via deduplication (job_id + version)
- Retries & timeouts: Exponential backoff, max retries, dead-letter queue.
- Observability: Metrics (queue depth, latency, success rate), tracing, logs.
3. Non-Functional Requirements
- Scale: 10M+ active jobs, 100K+ executions/second peak.
- Latency: Jobs fire within ±1 second of scheduled time (p99).
- Availability: 99.99% — no missed jobs even during node/region failures.
- Durability: No job loss on crash (persistent storage).
- Clock tolerance: Handle clock skew up to 30 seconds.
4. Capacity Estimation
- 1 billion recurring jobs (average large company)
- Average frequency: 1 job/min → ~700 executions/second average, 10K/sec peak
- Storage: Job metadata ~1 KB/job → 1 TB total
- Queue throughput: 10K–50K messages/sec → Kafka/RabbitMQ/NATS easily handles
5. High-Level Architecture (Production-Grade 2025)

6. Core Design Decisions
6.1 Leader-Based Scheduling (Recommended for Precision)
- Single active leader (elected via Raft in etcd/CockroachDB or ZooKeeper)
- Leader maintains in-memory timer wheel/min-heap of next execution times
- On becoming leader:
- Load all recurring jobs from Config DB
- Compute next run time for each
- Insert into timer structure
- When timer fires → publish job execution message to Queue
- Followers keep warm copy of timer state (for <1s failover)
Advantages: Perfect ±1s accuracy, no duplicate executions Alternative (decentralized): Shard time buckets via consistent hashing → more complex, risk of duplicates/misses
6.2 Job Persistence & State
- PostgreSQL/CockroachDB table:
jobs (
job_id UUID PRIMARY KEY,
cron_expr TEXT,
next_run_at TIMESTAMPTZ,
status TEXT,
payload JSONB,
version BIGINT, -- for optimistic locking
last_execution_result JSONB
)- On job completion → worker updates next_run_at (computed from cron) + status
6.3 Dispatch Queue
- Kafka topic job-executions partitioned by job_id % 128
- Message: {job_id, execution_id, scheduled_time, payload}
- Workers use consumer groups → multiple workers per partition for parallelism
6.4 Exactly-Once Semantics
- Worker:
- Pull message
- Insert executions row with status RUNNING (unique constraint on execution_id)
- Execute job (idempotent!)
- Update row to SUCCESS/FAILED
- If step 2 fails → another worker will pick it up (at-least-once)
- Idempotency key in payload ensures safe retries
6.5 Clock Skew & Precision
- All times in UTC
- Leader uses logical clock from Config DB (not wall clock)
- When job fires, include scheduled_for timestamp → worker logs drift
6.6 Failure Scenarios Handled
- Leader dies → new leader elected in <2s, reloads jobs from DB → may miss jobs in that window → mitigated by workers checking for missed recurring jobs on startup
- Worker dies mid-execution → timeout (default 30min) → re-queue
- Queue partition unavailable → Kafka guarantees durability
7. Detailed Scheduling Flow Example
Job: “Send daily report” → cron 0 9 * * * (9 AM daily)
- API receives schedule request → inserts into Config DB with next_run_at = 2025-12-03T09:00:00Z
- Leader polls DB (or watches via CHANGELOG) → inserts into timer wheel
- At 2025-12-03T08:59:59Z → timer fires → publish to Kafka
- Worker pulls message → executes Python script → updates DB: next_run_at = 2025-12-04T09:00:00Z, status=SUCCESS
- Leader sees update → re-inserts next run time
8. Scaling & Performance Summary
| Component | Technology | Scaling Strategy | Expected Performance |
|---|---|---|---|
| Leader Election | etcd/Raft or ZooKeeper | Built-in HA | <2s failover |
| Timer | In-memory wheel + DB backup | Single leader (sufficient) | 10M+ jobs in memory |
| Queue | Kafka/Pulsar | Add brokers + partitions | 100K+ msg/sec |
| Workers | Kubernetes Deployment | HPA on queue lag/CPU | 1000+ pods → millions jobs/day |
| DB | CockroachDB/PostgreSQL | Sharded by job_id hash | 50K writes/sec |
This architecture is used by virtually every major tech company in 2025 (Uber, Airbnb, Netflix, etc.) and provides the optimal balance of precision, reliability, and operational simplicity for distributed cron-like scheduling at global scale.




