PostgreSQL as Our Job Queue: Why We Ditched Redis for Good
We process 30+ job types through PostgreSQL using SELECT FOR UPDATE SKIP LOCKED.. the actual code, the retry math, our dead letter queue, and where it finally breaks down.
We were running managed Redis on a cloud provider whose name I won't say publicly. Standard setup. Jobs going in, workers pulling them out. Fine for months.
Then it started dropping connections. Not crashing.. just dropping them. Jobs would disappear. Workers would timeout. We'd restart the service, things would work for a bit, then same thing again.
We opened a support ticket. Three days of back and forth. Their conclusion: "We recommend restarting your Redis instance."
That was it. That was their answer.
We had already done that. Multiple times. While they were typing that response.
So we did what any sensible team does after watching a managed service eat jobs and generate useless support tickets. We looked at what we already had running... and we already had PostgreSQL.
The math was simple
We process 30+ job types. Email sends, invoice generation, domain provisioning, SSL certificate renewals, webhook deliveries, analytics aggregation. The list keeps growing.
Redis would have been fine if it just... worked reliably. But managed Redis is a black box. When it breaks, it really breaks, and you're at the mercy of whoever manages it.
PostgreSQL we understand. We know its failure modes. We have visibility into its internals. We can query it. We can debug it at 2am without waiting three business days for a support response that tells us to restart the thing we already restarted.
One less moving part. That was the goal.
| Redis (managed) | PostgreSQL | |
|---|---|---|
| Debugging | Black box — wait for support | Full visibility, psql, EXPLAIN |
| Failure mode | Silent job loss | Transaction rollback, nothing lost |
| Extra infra | Separate service + monitoring | Already running |
| Concurrent workers | Native (but opaque) | SELECT FOR UPDATE SKIP LOCKED |
| Pub/sub / fanout | Built-in, fast | Polling — not great |
| Throughput ceiling | Very high | ~5k jobs/min before friction |
The schema
Nothing fancy here. This is the table:
1CREATE TABLE queued_jobs (
2 id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3 job_type TEXT NOT NULL,
4 payload JSONB NOT NULL DEFAULT '{}',
5 status TEXT NOT NULL DEFAULT 'pending'
6 CHECK (status IN ('pending', 'processing', 'completed', 'failed')),
7 priority INT NOT NULL DEFAULT 0,
8 attempts INT NOT NULL DEFAULT 0,
9 max_attempts INT NOT NULL DEFAULT 3,
10 last_error TEXT,
11 scheduled_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
12 attempted_at TIMESTAMPTZ,
13 created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
14);
15
16CREATE INDEX idx_queued_jobs_fetch
17 ON queued_jobs (status, scheduled_at, priority DESC)
18 WHERE status = 'pending';The partial index on status = 'pending' matters once the table grows. Your workers only care about pending jobs.. don't make them scan two million completed ones.
We learned this the hard way. The index was an afterthought. Added it after query times started climbing past acceptable. Should have been there from day one.
The dequeue
SELECT FOR UPDATE SKIP LOCKED is what makes concurrent workers possible without them fighting each other:
1BEGIN;
2
3WITH next_job AS (
4 SELECT id FROM queued_jobs
5 WHERE status = 'pending'
6 AND scheduled_at <= NOW()
7 ORDER BY priority DESC, created_at ASC
8 FOR UPDATE SKIP LOCKED
9 LIMIT 1
10)
11UPDATE queued_jobs
12SET status = 'processing',
13 attempted_at = NOW(),
14 attempts = attempts + 1
15WHERE id = (SELECT id FROM next_job)
16RETURNING *;
17
18COMMIT;SKIP LOCKED means: if another worker already grabbed this row, skip it. Don't wait. Don't block. Move to the next one. Ten workers can run this query simultaneously without stepping on each other.
In Go, the worker fetch:
1func (w *Worker) Fetch(ctx context.Context) (*Job, error) {
2 tx, err := w.db.BeginTx(ctx, &sql.TxOptions{
3 Isolation: sql.LevelReadCommitted,
4 })
5 if err != nil {
6 return nil, err
7 }
8 defer tx.Rollback()
9
10 var job Job
11 err = tx.QueryRowContext(ctx, `
12 WITH next_job AS (
13 SELECT id FROM queued_jobs
14 WHERE status = 'pending'
15 AND scheduled_at <= NOW()
16 ORDER BY priority DESC, created_at ASC
17 FOR UPDATE SKIP LOCKED
18 LIMIT 1
19 )
20 UPDATE queued_jobs
21 SET status = 'processing',
22 attempted_at = NOW(),
23 attempts = attempts + 1
24 WHERE id = (SELECT id FROM next_job)
25 RETURNING id, job_type, payload, attempts, max_attempts, created_at
26 `).Scan(
27 &job.ID, &job.Type, &job.Payload,
28 &job.Attempts, &job.MaxAttempts, &job.CreatedAt,
29 )
30
31 if errors.Is(err, sql.ErrNoRows) {
32 return nil, nil
33 }
34 if err != nil {
35 return nil, err
36 }
37
38 return &job, tx.Commit()
39}We use READ COMMITTED isolation, not SERIALIZABLE. The row lock is enough. Serializable adds overhead for zero benefit in this pattern.
Retry math
When a job fails, we don't immediately mark it dead. We reschedule with exponential backoff plus jitter:
1func (w *Worker) Fail(ctx context.Context, jobID string, jobErr error) error {
2 _, err := w.db.ExecContext(ctx, `
3 UPDATE queued_jobs
4 SET status = CASE
5 WHEN attempts >= max_attempts THEN 'failed'
6 ELSE 'pending'
7 END,
8 scheduled_at = CASE
9 WHEN attempts >= max_attempts THEN scheduled_at
10 ELSE NOW() + (
11 INTERVAL '1 second' *
12 POW(2, attempts) *
13 (0.5 + RANDOM() * 0.5)
14 )
15 END,
16 last_error = $2
17 WHERE id = $1
18 `, jobID, jobErr.Error())
19 return err
20}The jitter (0.5 + RANDOM() * 0.5) is not optional. Without it, all the failed jobs of the same type retry at the exact same second. You get a thundering herd. Lock contention spikes. You've created a new problem.
With jitter, attempt 1 retries somewhere between 1 and 2 seconds. Attempt 2 between 2 and 4. Attempt 3 between 4 and 8. It fans out naturally. The retry pyramid shape in your metrics should look like a pyramid.. not a cliff face.
The dead letter queue
Jobs that exhaust their retries go to a separate table:
1CREATE TABLE dead_letter_jobs (
2 id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3 original_job_id UUID NOT NULL,
4 job_type TEXT NOT NULL,
5 payload JSONB NOT NULL,
6 attempts INT NOT NULL,
7 last_error TEXT,
8 failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
9);We move them in a transaction alongside the status update, so the failure context is never lost:
1func (w *Worker) MoveToDLQ(ctx context.Context, job *Job, jobErr error) error {
2 tx, err := w.db.BeginTx(ctx, nil)
3 if err != nil {
4 return err
5 }
6 defer tx.Rollback()
7
8 _, err = tx.ExecContext(ctx, `
9 INSERT INTO dead_letter_jobs
10 (original_job_id, job_type, payload, attempts, last_error)
11 VALUES ($1, $2, $3, $4, $5)
12 `, job.ID, job.Type, job.Payload, job.Attempts, jobErr.Error())
13 if err != nil {
14 return err
15 }
16
17 _, err = tx.ExecContext(ctx, `
18 UPDATE queued_jobs SET status = 'failed' WHERE id = $1
19 `, job.ID)
20 if err != nil {
21 return err
22 }
23
24 return tx.Commit()
25}Why a separate table instead of just leaving them in queued_jobs with status = 'failed'? Because you want to query dead jobs without touching the hot table. Keep the main queue clean. Query the dead ones separately.
Where it breaks
Let me be honest about the ceiling.
Around 5k jobs/minute is where we start watching the metrics closely. Lock contention rises. The SELECT FOR UPDATE starts taking longer. You can push further with more workers, but at some point you're fighting the database.
Long-running jobs are the real issue. If your job takes 5 minutes to run, that transaction stays open for 5 minutes. PostgreSQL does not like long open transactions. Autovacuum gets affected. Tables bloat. A worker crash leaves a job stuck in processing with no automatic recovery.
We handle stuck jobs with a recovery cron that runs every minute:
1UPDATE queued_jobs
2SET status = 'pending',
3 scheduled_at = NOW() + INTERVAL '30 seconds'
4WHERE status = 'processing'
5 AND attempted_at < NOW() - INTERVAL '10 minutes';Blunt. But it works.
The one job type we moved back to Redis: real-time WebSocket fanout. When a user action needs to broadcast to hundreds of connected clients in under 100ms.. PostgreSQL polling isn't the right tool. Redis pub/sub is. We kept that specific case on Redis and moved everything else off.
What we'd do differently
The partial index.. day one. Not after you hit two million rows and start wondering why fetches are slowing down.
We also underestimated how much value comes from being able to just... query the queue. "How many invoice jobs are stuck in processing right now?" is two seconds in psql. On Redis it would have been a custom script.
The architecture is boring. That's the point.
One less service to monitor, one less managed product to trust blindly, one fewer support ticket answered with "have you tried restarting it."
PostgreSQL handles it. So we let it.