Engineering DevOps Cloud Company

PostgreSQL as Our Job Queue: Why We Ditched Redis for Good

We process 30+ job types through PostgreSQL using SELECT FOR UPDATE SKIP LOCKED.. the actual code, the retry math, our dead letter queue, and where it finally breaks down.

Ahmed A. Al Jamal March 24, 2026 · 5 min read

We were running managed Redis on a cloud provider whose name I won't say publicly. Standard setup. Jobs going in, workers pulling them out. Fine for months.

Then it started dropping connections. Not crashing.. just dropping them. Jobs would disappear. Workers would timeout. We'd restart the service, things would work for a bit, then same thing again.

We opened a support ticket. Three days of back and forth. Their conclusion: "We recommend restarting your Redis instance."

That was it. That was their answer.

We had already done that. Multiple times. While they were typing that response.

So we did what any sensible team does after watching a managed service eat jobs and generate useless support tickets. We looked at what we already had running... and we already had PostgreSQL.

The math was simple

We process 30+ job types. Email sends, invoice generation, domain provisioning, SSL certificate renewals, webhook deliveries, analytics aggregation. The list keeps growing.

Redis would have been fine if it just... worked reliably. But managed Redis is a black box. When it breaks, it really breaks, and you're at the mercy of whoever manages it.

PostgreSQL we understand. We know its failure modes. We have visibility into its internals. We can query it. We can debug it at 2am without waiting three business days for a support response that tells us to restart the thing we already restarted.

One less moving part. That was the goal.

	Redis (managed)	PostgreSQL
Debugging	Black box — wait for support	Full visibility, psql, EXPLAIN
Failure mode	Silent job loss	Transaction rollback, nothing lost
Extra infra	Separate service + monitoring	Already running
Concurrent workers	Native (but opaque)	SELECT FOR UPDATE SKIP LOCKED
Pub/sub / fanout	Built-in, fast	Polling — not great
Throughput ceiling	Very high	~5k jobs/min before friction

The schema

Nothing fancy here. This is the table:

sql

CREATE TABLE queued_jobs (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_type     TEXT NOT NULL,
    payload      JSONB NOT NULL DEFAULT '{}',
    status       TEXT NOT NULL DEFAULT 'pending'
                     CHECK (status IN ('pending', 'processing', 'completed', 'failed')),
    priority     INT  NOT NULL DEFAULT 0,
    attempts     INT  NOT NULL DEFAULT 0,
    max_attempts INT  NOT NULL DEFAULT 3,
    last_error   TEXT,
    scheduled_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    attempted_at TIMESTAMPTZ,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_queued_jobs_fetch
    ON queued_jobs (status, scheduled_at, priority DESC)
    WHERE status = 'pending';

The partial index on status = 'pending' matters once the table grows. Your workers only care about pending jobs.. don't make them scan two million completed ones.

We learned this the hard way. The index was an afterthought. Added it after query times started climbing past acceptable. Should have been there from day one.

▲Warning

If your queue table will grow past a few thousand rows, add a partial index on status = 'pending' from day one. Not after you start wondering why fetches are slow.

The dequeue

SELECT FOR UPDATE SKIP LOCKED is what makes concurrent workers possible without them fighting each other:

sql

BEGIN;

WITH next_job AS (
    SELECT id FROM queued_jobs
    WHERE status = 'pending'
      AND scheduled_at <= NOW()
    ORDER BY priority DESC, created_at ASC
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
UPDATE queued_jobs
SET status       = 'processing',
    attempted_at = NOW(),
    attempts     = attempts + 1
WHERE id = (SELECT id FROM next_job)
RETURNING *;

COMMIT;

SKIP LOCKED means: if another worker already grabbed this row, skip it. Don't wait. Don't block. Move to the next one. Ten workers can run this query simultaneously without stepping on each other.

In Go, the worker fetch:

func (w *Worker) Fetch(ctx context.Context) (*Job, error) {
    tx, err := w.db.BeginTx(ctx, &sql.TxOptions{
        Isolation: sql.LevelReadCommitted,
    })
    if err != nil {
        return nil, err
    }
    defer tx.Rollback()

    var job Job
    err = tx.QueryRowContext(ctx, `
        WITH next_job AS (
            SELECT id FROM queued_jobs
            WHERE status = 'pending'
              AND scheduled_at <= NOW()
            ORDER BY priority DESC, created_at ASC
            FOR UPDATE SKIP LOCKED
            LIMIT 1
        )
        UPDATE queued_jobs
        SET status       = 'processing',
            attempted_at = NOW(),
            attempts     = attempts + 1
        WHERE id = (SELECT id FROM next_job)
        RETURNING id, job_type, payload, attempts, max_attempts, created_at
    `).Scan(
        &job.ID, &job.Type, &job.Payload,
        &job.Attempts, &job.MaxAttempts, &job.CreatedAt,
    )

    if errors.Is(err, sql.ErrNoRows) {
        return nil, nil
    }
    if err != nil {
        return nil, err
    }

    return &job, tx.Commit()
}

We use READ COMMITTED isolation, not SERIALIZABLE. The row lock is enough. Serializable adds overhead for zero benefit in this pattern.

Retry math

When a job fails, we don't immediately mark it dead. We reschedule with exponential backoff plus jitter:

func (w *Worker) Fail(ctx context.Context, jobID string, jobErr error) error {
    _, err := w.db.ExecContext(ctx, `
        UPDATE queued_jobs
        SET status       = CASE
                               WHEN attempts >= max_attempts THEN 'failed'
                               ELSE 'pending'
                           END,
            scheduled_at = CASE
                               WHEN attempts >= max_attempts THEN scheduled_at
                               ELSE NOW() + (
                                   INTERVAL '1 second' *
                                   POW(2, attempts) *
                                   (0.5 + RANDOM() * 0.5)
                               )
                           END,
            last_error   = $2
        WHERE id = $1
    `, jobID, jobErr.Error())
    return err
}

The jitter (0.5 + RANDOM() * 0.5) is not optional. Without it, all the failed jobs of the same type retry at the exact same second. You get a thundering herd. Lock contention spikes. You've created a new problem.

With jitter, attempt 1 retries somewhere between 1 and 2 seconds. Attempt 2 between 2 and 4. Attempt 3 between 4 and 8. It fans out naturally. The retry pyramid shape in your metrics should look like a pyramid.. not a cliff face.

✦Tip

The jitter formula (0.5 + RANDOM() * 0.5) means attempt N retries between 2^N * 0.5 and 2^N seconds. Enough spread to prevent thundering herd, tight enough to retry fast.

The dead letter queue

Jobs that exhaust their retries go to a separate table:

sql

CREATE TABLE dead_letter_jobs (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    original_job_id UUID NOT NULL,
    job_type        TEXT NOT NULL,
    payload         JSONB NOT NULL,
    attempts        INT  NOT NULL,
    last_error      TEXT,
    failed_at       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

We move them in a transaction alongside the status update, so the failure context is never lost:

func (w *Worker) MoveToDLQ(ctx context.Context, job *Job, jobErr error) error {
    tx, err := w.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    _, err = tx.ExecContext(ctx, `
        INSERT INTO dead_letter_jobs
            (original_job_id, job_type, payload, attempts, last_error)
        VALUES ($1, $2, $3, $4, $5)
    `, job.ID, job.Type, job.Payload, job.Attempts, jobErr.Error())
    if err != nil {
        return err
    }

    _, err = tx.ExecContext(ctx, `
        UPDATE queued_jobs SET status = 'failed' WHERE id = $1
    `, job.ID)
    if err != nil {
        return err
    }

    return tx.Commit()
}

Why a separate table instead of just leaving them in queued_jobs with status = 'failed'? Because you want to query dead jobs without touching the hot table. Keep the main queue clean. Query the dead ones separately.

Where it breaks

Let me be honest about the ceiling.

Around 5k jobs/minute is where we start watching the metrics closely. Lock contention rises. The SELECT FOR UPDATE starts taking longer. You can push further with more workers, but at some point you're fighting the database.

Long-running jobs are the real issue. If your job takes 5 minutes to run, that transaction stays open for 5 minutes. PostgreSQL does not like long open transactions. Autovacuum gets affected. Tables bloat. A worker crash leaves a job stuck in processing with no automatic recovery.

We handle stuck jobs with a recovery cron that runs every minute:

sql

UPDATE queued_jobs
SET status       = 'pending',
    scheduled_at = NOW() + INTERVAL '30 seconds'
WHERE status     = 'processing'
  AND attempted_at < NOW() - INTERVAL '10 minutes';

Blunt. But it works.

●Important

Long-running jobs are the real killer. A 5-minute transaction blocks autovacuum, bloats tables, and a worker crash leaves a job permanently stuck in 'processing'. Keep job execution under 60 seconds if you can.

The one job type we moved back to Redis: real-time WebSocket fanout. When a user action needs to broadcast to hundreds of connected clients in under 100ms.. PostgreSQL polling isn't the right tool. Redis pub/sub is. We kept that specific case on Redis and moved everything else off.

What we'd do differently

The partial index.. day one. Not after you hit two million rows and start wondering why fetches are slowing down.

We also underestimated how much value comes from being able to just... query the queue. "How many invoice jobs are stuck in processing right now?" is two seconds in psql. On Redis it would have been a custom script.

The architecture is boring. That's the point.

One less service to monitor, one less managed product to trust blindly, one fewer support ticket answered with "have you tried restarting it."

PostgreSQL handles it. So we let it.