Despite the critique in my previous post, I still believe Redlock is a valuable algorithm, both as a practical tool and as an entry point into distributed systems thinking.

In my experience, its theoretical downsides rarely translate into real-world catastrophes. Why? Because engineers who choose Redis typically aren’t betting everything on it. They treat Redis as fast but fallible: a caching layer, a rate limiter. When you design with that mindset, you naturally build in safeguards, such as database unique constraints, idempotency keys, retry-safe operations. The failure modes Kleppmann describes are real, but they’re survivable in systems that don’t assume perfection.

Put differently: if your system would suffer catastrophic damage from a rare double-lock scenario, Redis probably wasn’t the right choice anyway. But for the vast majority of use cases, where “mostly correct, very fast” beats “perfectly correct, noticeably slower”, Redlock earns its place in your toolkit.

That said, in this post we will explore how to implement Redlock using Go and Redis.

Why Redlock?

In the single instance post, we explored how to implement a distributed lock using a single Redis instance. We addressed the question: what if the client crashes? But we left one critical question unanswered: what if Redis itself fails?

Antirez introduced Redlock to address this gap. Rather than relying on a single Redis instance, the algorithm uses multiple independent instances to achieve higher availability through quorum-based consensus. It’s also storage-agnostic, you can implement it using MySQL, PostgreSQL, MongoDB, or any data store that supports atomic conditional writes.

The algorithm treats the servers (Redis instances) as “dumb” and the client (which acquires and releases the lock) as “smart”, placing all coordination logic on the client side.

The Algorithm

The Redlock algorithm operates in three phases: acquire, validate, and release. Here’s the step-by-step breakdown:

Acquire Phase

  1. Generate a token: The client generates a unique random value (e.g., a UUID). This token serves as proof of ownership and prevents other clients from accidentally releasing the lock.

  2. Attempt to lock all instances: The client sends SET key token NX PX ttl to all N Redis instances, either sequentially or in parallel, where:

    • NX ensures the key is only set if it doesn’t already exist
    • PX ttl sets the expiration in milliseconds
  3. Track elapsed time: The client records the start time before attempting any locks. This is critical because lock validity decreases as acquisition takes longer.

Validate Phase

  1. Check for quorum: The lock is considered acquired if and only if:

    • The client successfully locked at least N/2 + 1 instances (a strict majority)
    • The total elapsed time is less than the lock TTL
  2. Calculate remaining validity: The effective lock validity is TTL - elapsed_time. If this remaining time is too short for the protected operation, the client should release the lock immediately rather than proceed.

Release Phase

  1. Release on all instances: The client sends a release command to all instances (not just the ones it successfully locked). The release uses a Lua script to ensure atomic check-and-delete:

    if redis.call("GET", KEYS[1]) == ARGV[1] then
        return redis.call("DEL", KEYS[1])
    end
    
  2. Handle partial failures: If some releases fail due to network issues or crashed nodes, the locks will expire naturally via TTL. This is safe because the token ensures no other client can accidentally release them.

Note: The client MUST issue the release command to all instances, including the ones where acquisition failed, not just the ones it successfully locked.

Redlock algorithm: quorum-based acquire and release

The Clock Drift Problem

As noted above, we track elapsed time during the acquire phase. But there’s a subtle problem: clocks lie.

The Problem

In a distributed system, each machine has its own clock, and these clocks drift apart over time. Even with NTP synchronization, clocks can differ by tens or hundreds of milliseconds. This matters for Redlock because:

  1. Client’s clock vs. Redis’s clock: When the client measures “30 seconds have passed,” Redis might think only 29.5 seconds have passed, or 30.5.

  2. Redis instances disagree: Each Redis instance uses its own clock for TTL expiration. If you lock 5 instances with a 30-second TTL, they won’t all expire at exactly the same moment.

  3. The danger zone: Imagine Client A acquires a lock with a 10-second TTL. If Client A’s clock runs slow, what it perceives as 8 seconds (seemingly safe) might actually be 10.5 seconds of wall-clock time. The lock has already expired on Redis, but Client A still believes it holds the lock.

The following diagram illustrates a more concrete scenario: the client acquires locks on Redis A quickly, but Redis B has severe network lag. By the time Redis B responds, the lock on Redis A has already expired:

Clock drift validity problem: lock expires before client knows

The Solution: Drift Factor

Redlock addresses this by introducing a clock drift factor. Instead of using the full TTL, the algorithm subtracts a safety margin:

validity_time = TTL - elapsed_time - (TTL * CLOCK_DRIFT_FACTOR) - CLOCK_DRIFT_BUFFER

Where:

  • CLOCK_DRIFT_FACTOR is typically 0.01 (1% of TTL)
  • CLOCK_DRIFT_BUFFER is a small constant (e.g., 2ms) for network round-trip variance

For a 30-second lock with 500ms acquisition time:

validity = 30000 - 500 - (30000 * 0.01) - 2
         = 30000 - 500 - 300 - 2
         = 29198ms

This conservative estimate ensures that even if clocks drift by up to 1%, the client won’t overstay its lock.

Note: Clock drift is one of Kleppmann’s main critiques of Redlock. The drift factor is a heuristic that must be tuned through experimentation, not a guarantee.

Implementation

Just like the single-instance version, I have implemented the Redlock algorithm in Go. The code is available at github.com/trviph/redlock. The library provides a DistributedLock type that implements the full Redlock algorithm with quorum consensus and clock drift validation.

Here is how you can use it:

// 1. Create a client for each Redis instance
client1 := redis.NewClient(&redis.Options{Addr: "redis1:6379"})
client2 := redis.NewClient(&redis.Options{Addr: "redis2:6379"})
client3 := redis.NewClient(&redis.Options{Addr: "redis3:6379"})

// 2. Wrap them in Lock instances
locks := []*redlock.Lock{
    redlock.NewLock(client1),
    redlock.NewLock(client2),
    redlock.NewLock(client3),
}

// 3. Create the DistributedLock
// You can customize drift factor and retry behavior via options
dl := redlock.NewDistributedLock(locks,
    redlock.WithClockDriftFactor(0.01), // 1% drift factor
)

// 4. Acquire the lock
// This will return successfully only if N/2 + 1 instances are locked
// AND the validity check passes (considering drift and network time).
fencing, err := dl.Acquire(ctx, "my-critical-resource", 10*time.Second)
if err != nil {
    // Handle error (e.g., ErrLockAlreadyHeld, ErrValidityExpired)
    return
}
defer dl.Release(ctx, "my-critical-resource", fencing)

// Do critical work...

The DistributedLock API is designed to be drop-in compatible with the single-instance Lock. It supports:

  • Quorum-based Acquire: Automatically retries until a majority is locked.
  • Drift Validation: Ensures the lock is strictly safe even with clock skew.
  • Fail-safe Release: Releases on all instances, ensuring no zombie locks remain if possible.
  • Watchdog Support: You can use the same Watch pattern to automatically extend the lock while your work is running.

Check out the repository for more comprehensive documentation and examples.

References