Beyond the Lock: Why Fencing Tokens Are Essential – ██FR█████ █INTELL███████████

This content originally appeared on Level Up Coding – Medium and was authored by Konstantin Tarkus

A lock isn’t enough. Discover how fencing tokens prevent data corruption from stale locks and “zombie” processes.

Your payment processor just charged a customer twice. Your inventory system thinks you have 47 widgets when there are only 23. Both disasters happened despite using distributed locks. The culprit? Your process held a lock that had already expired, became a “zombie,” and corrupted data while believing it was still protected.

This isn’t a rare edge case. It’s a fundamental property of distributed systems that most developers don’t discover until production.

The Illusion of Safety

Distributed locks feel safe. You acquire a lock from Redis or your database, perform your critical work, and release it. The pattern is simple:

import { createLock } from "syncguard/redis";
import Redis from "ioredis";

const redis = new Redis();
const lock = createLock(redis);

// UNSAFE: No protection against stale locks
async function processPayment(orderId: string) {
  await lock(
    async () => {
      const order = await db.getOrder(orderId);
      await paymentGateway.charge(order.amount);
      await db.updateOrder(orderId, { status: "paid" });
    },
    { key: `payment:${orderId}`, ttlMs: 30000 },
  );
}

This code looks correct. You’re using a proper distributed lock with a 30-second timeout. But there’s a critical flaw that becomes visible only under production conditions.

The problem is subtle: you’re treating a distributed lock as a binary state (locked/unlocked), just like an in-process mutex. But a distributed lock isn’t a mutex. It’s a lease — a time-bound grant of exclusive access that can expire while you’re still using it.

When Locks Lie: The Zombie Process Problem

Here’s what actually happens in production:

Timeline:

T=0s: Process A acquires the lock with a 30-second TTL and starts processing a payment.

T=5s: Process A enters a stop-the-world garbage collection pause. In the JVM, these pauses can last for minutes in pathological cases. But even a 35-second pause is enough to break everything.

T=30s: The lock expires in Redis. From the lock service’s perspective, Process A has died.

T=31s: Process B successfully acquires the same lock and begins processing the payment.

T=33s: Process B charges the customer’s card and updates the database.

T=40s: Process A wakes up from its GC pause, completely unaware that 35 seconds have passed. From its perspective, only microseconds elapsed. It believes it still holds the lock.

T=41s: Process A charges the customer’s card again and overwrites Process B’s database update.

Result: The customer is charged twice. Data corruption. A production incident.

This isn’t a bug in your lock implementation. This isn’t a bug in Redis. This is how distributed systems work. A process can be paused at any moment by:

Garbage collection (stop-the-world pauses lasting seconds or even minutes)
OS process preemption (your process gets swapped out)
Virtual memory page faults (requires slow disk I/O)
Network delays (requests hang for seconds or minutes)

The fundamental issue: only two parties are involved — the client and the lock manager. The client thinks it holds the lock. The lock manager knows the lease expired. But there’s no one to stop the client from proceeding with stale authorization.

The Three-Party Protocol: Enter Fencing Tokens

The solution requires a mental model shift. We need a third party to validate whether operations should be accepted: the resource itself.

A fencing token is a monotonically increasing number issued by the lock service with every successful lock acquisition. Each time any process acquires the lock for a given resource, the token increases. Process A gets token 33. When the lock expires and Process B acquires it, Process B gets token 34.

The protocol works like this:

Client acquires lock and receives a token: { ok: true, lockId: "…", fence: "000000000000033" }
Client includes the token in every write: All operations to the protected resource must carry the fence token
Resource checks the token: Before executing any write, the resource compares the incoming token against the last token it saw
Resource rejects stale tokens: If incoming_token <= last_seen_token, reject the write
Resource accepts and updates: If incoming_token > last_seen_token, accept the write and store the new token

Now let’s replay the zombie process scenario with fencing tokens:

Timeline:

T=0s: Process A acquires the lock and receives token 33, then enters a GC pause.

T=31s: Lock expires. Process B acquires the lock and receives token 34.

T=33s: Process B charges the payment gateway (unfenced operation) and writes to the database with token 34. The database validates 34 > null, accepts the write, and stores 34 as the last-seen token.

T=40s: Process A wakes up from its GC pause. It charges the payment gateway again (creating a duplicate charge), then attempts to update the database with its stale token 33.

T=41s: The database validates: 33 > 34 is false. Write rejected. The database responds with an error.

Result: Database integrity preserved — the zombie process cannot corrupt order state. However, the payment gateway was charged twice because it doesn’t participate in fencing. This demonstrates why idempotency keys are needed for external APIs (covered in “When Fencing Isn’t Possible”).

The key insight: the resource doesn’t trust the client’s claim of holding the lock. The resource validates the token against reality. Even a process with a stale view of the world cannot corrupt data.

How SyncGuard Implements Fencing Tokens

SyncGuard provides fencing tokens out-of-the-box for all its backends (Redis, PostgreSQL, Firestore). The implementation varies by backend, but the API is consistent.

Backend Implementation

Redis: Uses atomic INCR on a per-key fence counter. The increment and lock acquisition happen in a single Lua script for atomicity:

-- Within the atomic acquire script
local fenceKey = KEYS[3]  -- Per-resource counter key
local fence = string.format("%015d", redis.call('INCR', fenceKey))
-- Store fence in lock data and return it

PostgreSQL: Uses a dedicated fence_counters table with database-enforced atomicity. The counter increment happens within the same transaction as lock acquisition.

Firestore: Uses Firestore transactions with per-key counter documents. The transaction ensures the counter increment and lock creation are atomic.

All backends return fence tokens as 15-digit zero-padded strings (e.g., "000000000000042"):

Monotonically increasing per resource key
Lexicographically comparable (use string comparison: fenceA > fenceB)
Guaranteed unique even across crashes and restarts
No parsing needed — just compare strings directly

Resource-Side Implementation

The resource (your database, file system, or API) must actively participate in the fencing protocol. This requires three steps:

1. Add a fence token column to your data model:

ALTER TABLE orders ADD COLUMN last_fence_token VARCHAR(15);

2. Validate fence tokens on every write:

UPDATE orders
SET
  status = $1,
  last_fence_token = $2,
  updated_at = NOW()
WHERE
  order_id = $3
  AND last_fence_token < $2  -- CRITICAL: only accept strictly greater tokens
RETURNING *;

3. Check the result to detect fenced-out operations:

async function updateOrderWithFencing(
  orderId: string,
  updates: { status: string },
  fence: string,
): Promise<boolean> {
  const result = await db.query(
    `UPDATE orders
     SET status = $1, last_fence_token = $2, updated_at = NOW()
     WHERE order_id = $3 AND last_fence_token < $2
     RETURNING *`,
    [updates.status, fence, orderId],
  );

  // If no rows updated, our fence token was stale
  return result.rowCount > 0;
}

Putting It All Together

Here’s the safe pattern with SyncGuard:

import { createRedisBackend } from "syncguard/redis";
import Redis from "ioredis";

const redis = new Redis();
const backend = createRedisBackend(redis);

// SAFE: Database validates fence token before accepting writes
async function processPayment(orderId: string) {
  await using lock = await backend.acquire({
    key: `payment:${orderId}`,
    ttlMs: 30000,
  });

  if (!lock.ok) {
    throw new Error("Could not acquire lock");
  }

  // Extract the fence token - a monotonically increasing number
  const { fence } = lock; // e.g., "000000000000042"

  const order = await db.getOrder(orderId);
  await paymentGateway.charge(order.amount);

  // Database validates: only accepts writes with fence > last_seen_fence
  const updated = await db.updateOrderWithFencing(
    orderId,
    { status: "paid" },
    fence,
  );

  if (!updated) {
    // Our fence token was stale - another process with a higher token won
    // This means our lock expired and we're a "zombie process"
    throw new Error("Stale lock - operation rejected by resource");
  }

  // Lock automatically released when exiting 'await using' block
}

If Process A pauses during payment processing and its lock expires, Process B will acquire a new lock with a higher fence token. When Process A wakes up and attempts to update the database with its stale fence token, the database rejects the write. The payment is processed exactly once.

When Fencing Isn’t Possible

Not all systems can participate in the fencing protocol. Third-party REST APIs, legacy systems, or external services may not support custom token validation. In these cases, you have several options:

Idempotency Keys: Many payment gateways and external APIs support idempotency keys. Use a unique request ID (like {orderId}-{fence}) to prevent duplicate processing:

await paymentGateway.charge({
  amount: order.amount,
  idempotencyKey: `order-${orderId}-${fence}`,
});

Optimistic Concurrency Control: Use version numbers or ETags if the external system supports them. Before updating, check that the version hasn’t changed since you read it.

Move to a Fencing-Capable Resource: Use your own database as a proxy. Instead of writing directly to the external API, write to your database with fence token validation, then have a separate process (idempotent worker) sync to the external system.

Compensating Transactions: Design operations to be reversible. If you detect a duplicate operation after the fact, have a process to undo it.

The key principle: if you can’t validate fence tokens at the resource, you must use another mechanism to ensure idempotency.

When You Need Fencing Tokens

Not every lock requires fencing tokens. The decision depends on what the lock is protecting.

You NEED fencing tokens when:

The lock is for correctness, not just efficiency
Failures would cause data corruption, not just duplicate work
Examples: financial transactions, inventory updates, order state machines, account balance modifications

You might NOT need fencing tokens when:

The lock is for efficiency only (e.g., preventing duplicate cache computations)
Your system can tolerate occasional duplicates
Idempotency alone provides sufficient protection
Operations are commutative (order doesn’t matter)

Architectural alternatives to consider:

Idempotency keys: For external APIs that support them
Optimistic concurrency control: Use version numbers or timestamps
Event sourcing: Immutable append-only logs eliminate update conflicts
CRDTs: For operations that are naturally commutative

The rule of thumb: if a duplicate or out-of-order operation would corrupt your data, you need either fencing tokens or an equivalent mechanism.

The Bigger Picture: Locks vs Leases

The fundamental lesson is a shift in mental model. A distributed lock is not a mutex. It’s a lease — a time-bound, probabilistic grant of permission.

Leases can expire while you’re using them. This is not a failure mode. This is normal operation in distributed systems. Process pauses, network delays, and clock skew are not bugs to be fixed — they are fundamental properties of the environment.

Fencing tokens upgrade this probabilistic safety to deterministic correctness. Instead of hoping your process doesn’t pause, you build a system where even a paused process cannot cause harm. The resource becomes the final arbiter of operation validity.

This is the essence of defensive programming in distributed systems: assume your view of the world is stale. Don’t trust the client’s claim of holding a lock. Validate at the resource level with monotonically increasing tokens.

Conclusion

If you’re using distributed locks for data correctness, and you’re not using fencing tokens (or an equivalent mechanism), you have a latent data corruption bug. It’s not a matter of “if” but “when.”

The zombie process problem is real. GC pauses, network delays, and process preemption happen in production. Your distributed lock will expire while your process is paused. Without fencing tokens, that process will wake up and corrupt your data.

Fencing tokens solve this problem by making the resource an active participant in the safety protocol. The resource doesn’t trust the client’s claim of authorization — it validates every operation against a monotonically increasing token.

The cost is modest: an extra column in your database, an extra check in your write queries. The benefit is enormous: deterministic correctness instead of probabilistic hope.

Build systems that are safe by design. Use fencing tokens.

SyncGuard is a TypeScript library that provides distributed locking with built-in fencing token support for Redis, PostgreSQL, and Firestore. Learn more at kriasoft.com/syncguard/ or check out the source code on GitHub.

Beyond the Lock: Why Fencing Tokens Are Essential was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding – Medium and was authored by Konstantin Tarkus