Skip to content

Pattern: Circuit Breaker

Prevent cascading failures when calling downstream services by tracking failure rates and temporarily halting requests when a service is overwhelmed or unavailable. Protects both the caller (from wasting resources on retries) and the downstream service (from further overload).

When to Use

  • Calling an external or cross-module service (Nutrient API, uploader queue, external API) that may be temporarily unavailable
  • The downstream service has its own scaling constraints (connection pool limits, queue depth limits, rate limits)
  • Failed requests consume significant resources (Lambda concurrency, SQS visibility timeouts, API rate limit quotas)
  • You want to fail fast rather than queue up retries that will also fail

When NOT to Use

  • Internal function calls within the same Lambda/container
  • Database operations (use @retry_on_db_conflict instead)
  • SQS message processing (use DLQ + max receive count instead)
  • One-off requests where retry is acceptable

Architecture

Caller (Lambda/Processor)
┌─────────────────────────────┐
│      Circuit Breaker        │
│                             │
│  CLOSED ──→ OPEN ──→ HALF_OPEN
│    │          │          │
│    ▼          ▼          ▼
│  Pass all   Reject    Test one
│  requests   all       request
│             (fail fast)
└─────────┬───────────────────┘
  Downstream Service
  (uploader queue, Nutrient API, etc.)

State Machine

           ┌─────────┐
           │ CLOSED  │ ← Normal operation
           └────┬────┘
                │ failure count >= threshold (5)
           ┌─────────┐
           │  OPEN   │ ← All requests rejected immediately
           └────┬────┘
                │ timeout period elapsed (30s)
           ┌───────────┐
           │ HALF_OPEN │ ← Test with one request
           └─────┬─────┘
        ┌────────┴────────┐
        │                 │
    success            failure
        │                 │
        ▼                 ▼
   ┌─────────┐      ┌─────────┐
   │ CLOSED  │      │  OPEN   │
   └─────────┘      └─────────┘

Implementation

class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private failureCount = 0;
  private lastFailureTime: number | null = null;

  constructor(
    private readonly failureThreshold: number = 5,
    private readonly timeoutMs: number = 30_000,
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime! > this.timeoutMs) {
        this.state = 'HALF_OPEN';
      } else {
        throw new CircuitOpenError('Circuit breaker is OPEN — request rejected');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

Configuration

Parameter Default Purpose
failureThreshold 5 Consecutive failures before opening circuit
timeoutMs 30,000 (30s) Time in OPEN state before testing with HALF_OPEN

Tuning guidance: - High-latency services (Nutrient API): Higher timeout (60s+) to give the service time to recover - Queue-based services (uploader SQS): Lower threshold (3) since queue failures indicate systemic issues - Rate-limited APIs: Align timeout with rate limit reset window

Design Considerations

  1. Circuit breaker is in-memory — resets on Lambda cold start. This is acceptable because cold starts indicate a fresh execution context and the downstream service may have recovered.

  2. Combine with DLQ — messages that fail due to an open circuit should still be retried via SQS visibility timeout or DLQ redrive after the service recovers.

  3. Log state transitions — CLOSED→OPEN and OPEN→HALF_OPEN transitions should be logged at WARN level for operational visibility.

  4. Don't use for idempotent retries — if the downstream service might have partially processed the request, the circuit breaker should not prevent retries. Use it only for requests that are safe to skip or defer.

Real-World Usage

  • documentexchanger — UploaderCoordinator wraps Nutrient uploader queue messaging in a circuit breaker (threshold: 5, timeout: 30s). Prevents the exchanger from flooding the uploader when it's overwhelmed. See reference-implementations/documentexchanger.md.
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.