Resilience patterns that earn their keep in production

Most outages I've traced back to a root cause didn't start as a total failure. They started as one slow dependency. A database that got a little sluggish, a downstream API that began timing out on a fraction of requests, a cache node that fell over. The interesting question is never why that one thing degraded. It's why the degradation of one component took the whole system down with it.

That gap, between a single dependency wobbling and an entire platform going dark, is what resilience patterns are supposed to close. Circuit breakers, bulkheads, retry budgets, backpressure: each one exists to keep a local failure local. The catch is that every one of them adds moving parts, and a misconfigured resilience pattern is fully capable of causing the exact outage it was meant to prevent. So the patterns worth adding are the ones whose failure modes you actually understand. This is a tour of the four I reach for, what they cost, and the ways they bite back.

The failure you're actually defending against

The enemy is the cascading failure, and the mechanism is almost always resource exhaustion. Picture a request path three services deep. The deepest service slows down. The middle service is now holding open connections and threads waiting on it, so its own pool drains. The front service, waiting on the middle one, drains too. Within seconds a problem isolated to one leaf node has consumed every thread in the call chain, and requests that never even touch the slow dependency start failing because there's nothing left to serve them.

Retries make this worse, not better, when they're naive. Say each layer retries a failed call three times. One genuine failure at the bottom becomes three attempts from the layer above, and each of those, seen as a failure, becomes three more from the layer above that. Three layers of three retries turns a single failed request into 3 × 3 × 3 = 27 requests slamming a service that is already struggling. That's retry amplification, and it's why a brief blip can harden into a sustained outage: the recovering service gets buried under the retry traffic before it can catch its breath.

So the goal of every pattern below is the same: contain the blast radius. Fail fast instead of failing slow, and make sure your recovery behavior never becomes the thing that prevents recovery.

Timeouts: the pattern everyone skips

Before the named patterns, the unglamorous one. A huge share of cascading failures trace back to a missing or absurdly generous timeout. If you call a dependency with no timeout, or a 30-second one because that's the framework default, you've handed that dependency the power to hold your threads hostage. A slow response is often worse than an error, because an error returns immediately and a slow response ties up a worker for the full duration.

Set aggressive, explicit timeouts on every network call, sized to the dependency's real latency rather than a round number that felt safe. If a service answers in 50ms at p99, a 500ms timeout is generous; a 30-second one is a liability. The timeout is what converts "hang until everything is exhausted" into "fail quickly so the patterns downstream can do their job." Everything else here assumes you've already done this. None of the fancier patterns can save a system whose threads are all parked on calls that will never return in time.

Circuit breakers, and the flapping problem

A circuit breaker wraps a dependency and watches its failure rate. While failures stay low it's closed and calls pass through normally. Once failures cross a threshold it trips open, and every subsequent call fails instantly without even touching the dependency. After a cooldown it goes half-open, letting a small number of probe requests through to test whether the dependency has recovered. If the probes succeed it closes again; if they fail it snaps back open.

The point is failing fast. When the breaker is open, callers get an immediate error instead of waiting on a timeout, which means they stop burning threads and connections on a service that can't answer. A minimal version in TypeScript makes the state machine concrete:

type State = "closed" | "open" | "half-open";

class CircuitBreaker {
  private state: State = "closed";
  private failures = 0;
  private openedAt = 0;
  private halfOpenProbes = 0;

  constructor(
    private readonly failureThreshold = 5,
    private readonly cooldownMs = 30_000,
    private readonly probeLimit = 3,
  ) {}

  async call<T>(fn: () => Promise<T>, now: number): Promise<T> {
    if (this.state === "open") {
      if (now - this.openedAt < this.cooldownMs) {
        throw new Error("circuit open");
      }
      this.state = "half-open";
      this.halfOpenProbes = 0;
    }

    if (this.state === "half-open" && this.halfOpenProbes >= this.probeLimit) {
      throw new Error("circuit open");
    }

    try {
      if (this.state === "half-open") this.halfOpenProbes++;
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure(now);
      throw err;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = "closed";
  }

  private onFailure(now: number) {
    this.failures++;
    if (this.failures >= this.failureThreshold) {
      this.state = "open";
      this.openedAt = now;
    }
  }
}

I'm passing now in rather than calling the clock inside the breaker, which keeps it trivially testable. You feed it timestamps and assert on the state transitions without faking timers.

Now the failure mode nobody warns you about: flapping. A service that's partially degraded, healthy enough to pass a couple of probe requests but not healthy enough to handle real load, will pass the half-open test, get the full traffic dumped back on it, fall over again, and trip the breaker open once more. The breaker oscillates between open and closed every cooldown cycle, and from the outside the service looks like it's stuttering on and off rather than recovering.

The fix is hysteresis. Don't close on a single successful probe. Require several consecutive successful probe windows before declaring the dependency healthy, and consider ramping traffic back gradually instead of restoring it all at once. The naive breaker above closes on the first success, which is exactly the behavior that flaps under partial degradation. That's the difference between a textbook circuit breaker and one that survives a real brownout.

This is one of those patterns where reaching for complexity early genuinely pays off, the same way I argued for justifying simpler designs in my piece on moving from microservices back to a modular monolith. The breaker is worth it precisely at the integration boundaries where one flaky dependency can take down everything calling it.

Bulkheads: isolating the blast radius

A circuit breaker stops you from hammering a failing dependency. A bulkhead stops one failing dependency from starving every other call you make. The name comes from ship design, where the hull is divided into watertight compartments so a breach in one doesn't flood the whole vessel.

In practice a bulkhead is resource isolation. Instead of one shared connection pool or thread pool serving every downstream dependency, you carve out a separate, bounded pool per dependency. If the recommendation service goes slow and saturates its pool, the checkout calls running through a different pool keep flowing. The slow dependency can fully exhaust its own allocation without touching anyone else's.

class Bulkhead {
  private active = 0;
  private queue: Array<() => void> = [];

  constructor(
    private readonly maxConcurrent: number,
    private readonly maxQueue: number,
  ) {}

  async run<T>(fn: () => Promise<T>): Promise<T> {
    if (this.active >= this.maxConcurrent) {
      if (this.queue.length >= this.maxQueue) {
        throw new Error("bulkhead full");
      }
      await new Promise<void>((resolve) => this.queue.push(resolve));
    }
    this.active++;
    try {
      return await fn();
    } finally {
      this.active--;
      this.queue.shift()?.();
    }
  }
}

The trade-off is real: you're partitioning capacity ahead of time, so each dependency gets a fixed slice rather than drawing freely from one big pool. That means lower peak throughput for any single dependency in exchange for guaranteed isolation. I reach for bulkheads when a service fans out to several dependencies of unequal importance, where I can't let a slow non-critical call drown the critical path. Pair it with the circuit breaker and the two cover different failures: the bulkhead caps how much damage a slow dependency can do while it's degrading, and the breaker cuts it off entirely once it's clearly down.

Retries without the storm

Retries are the most dangerous resilience pattern because they feel free and they aren't. Done well they paper over genuinely transient failures, a dropped packet, a momentary leader election, a brief GC pause. Done badly they're an amplifier pointed at a service that's already on fire.

Three rules keep retries from becoming the outage. First, exponential backoff: wait longer between each attempt, doubling the delay, so you're not machine-gunning a struggling dependency. Second, jitter: add randomness to those delays. Without it, every client that failed at the same instant retries at the same instant, and you get a thundering herd, synchronized spikes that knock the service back down the moment it recovers. Full jitter, where you pick a random delay between zero and the backoff ceiling, is a strong default. I went deep on this exact problem, backoff and jitter for syncs that don't lose records, in my write-up on handling n8n pagination, rate limits, and retries.

Third, and most overlooked, a retry budget. Backoff and jitter spread retries out in time but don't cap their total volume. A retry budget does: it ties the number of retries to a percentage of normal traffic, so retries can never balloon past a fixed fraction of your real request rate. Google's SRE book suggests starting around 10%, meaning retries are allowed to add at most a tenth on top of baseline load. Once that budget is spent in a given window, further retries are simply dropped. This is the single mechanism that defuses the 27-request amplification from earlier: no matter how many layers are retrying, the system as a whole can't multiply its own traffic into the ground.

Two more retry rules worth burning in. Only retry idempotent operations, or operations you've made idempotent with an idempotency key, because retrying a non-idempotent write is how you double-charge a customer. And never retry through an open circuit breaker. The breaker's whole job is to fail fast; if your retry logic catches that fast failure and tries again, you've defeated the breaker and re-armed the amplifier. Wire them so a tripped breaker short-circuits the retry loop entirely.

Backpressure: when you're the one drowning

Circuit breakers, bulkheads, and retries are about protecting yourself from a dependency. Backpressure is the inverse: protecting yourself, and your callers, when you're the one being overwhelmed. A service that accepts work faster than it can process it has only bad options. It can buffer indefinitely until memory runs out, or it can keep accepting and let latency climb until every request times out anyway.

Backpressure makes the queue an explicit, bounded thing and pushes the limit back upstream. The cleanest version is an asynchronous boundary: instead of synchronous request-response all the way down, you put a bounded queue or a log between the fast producer and the slower consumer, and the consumer pulls work at the rate it can actually handle. When the buffer fills, producers are told to slow down or back off rather than piling on more. This is one of the structural arguments for event-driven designs, and I covered the production mechanics of that decoupling in my piece on Kafka patterns that hold up in production.

When you can't decouple and you're still over capacity, the honest move is load shedding: deliberately rejecting a fraction of requests with a fast error so the requests you do accept get served properly. A service that sheds 20% of load and cleanly serves the other 80% is in far better shape than one that accepts everything and degrades all of it into timeout territory. Shedding the lowest-priority traffic first, health-check pings and best-effort background work before user-facing requests, keeps the degradation graceful. I watched this play out in the e-commerce backends under Black Friday load write-up: the systems that stayed up were the ones willing to say no early, not the ones that tried to absorb everything.

What I actually reach for, and in what order

The mistake I see most often isn't skipping resilience patterns. It's bolting on all four at once, day one, before there's any traffic to justify them, and ending up with a system so wrapped in breakers and pools that nobody can reason about its behavior under load. Each pattern is operational complexity you have to tune, monitor, and debug. So I add them in roughly this order.

Timeouts first, always, on every network call, no exceptions. They're the cheapest insurance and the precondition for everything else. Then circuit breakers at the genuine integration boundaries, the calls to other teams' services and third-party APIs where a failure isn't yours to fix and failing fast is the only sane response. Bulkheads come next, but only once I have a service fanning out to multiple dependencies where one can realistically starve the others. And retries I treat with suspicion: a retry budget at the system level before per-call retries, never retries without jitter, never retries through an open breaker.

Test all of this the way it'll actually be exercised, which means injecting the failures rather than hoping. Chaos experiments, killing a dependency or adding latency to it in a controlled window, are the only way to find out whether your breaker actually trips, your bulkhead actually isolates, and your retries actually back off instead of storming. A resilience pattern you've never seen fire is a resilience pattern you don't know works.

The libraries are mature and worth using over hand-rolled code in real systems. On the JVM, Resilience4j covers breakers, bulkheads, retries, and rate limiting in one toolkit. On .NET, Polly does the same. If you run a service mesh, a lot of this, timeouts, retries with budgets, circuit breaking, lives in the mesh layer and can be configured per-route without touching application code. The hand-rolled examples here are for understanding the mechanism, not for shipping.

None of these patterns make a system reliable on their own. What they do is bound the damage, so the failure you can't prevent stays the size it started at instead of growing to swallow everything around it. Start with the failure mode you've actually watched happen. Add the one pattern that would have contained it. Tune it against an injected fault until you trust it. Then, and only then, reach for the next one.

Resilience patterns that earn their keep in production

The failure you're actually defending against

Timeouts: the pattern everyone skips

Circuit breakers, and the flapping problem

Bulkheads: isolating the blast radius

Retries without the storm

Backpressure: when you're the one drowning

What I actually reach for, and in what order

Topics Covered

You Might Also Like

Event-driven architecture with Kafka: production patterns that hold up

CQRS and event sourcing in production: when it's worth the complexity

REST vs GraphQL vs gRPC vs tRPC: how to pick in 2025

More from Refactix

New articles, straight to your inbox