Article Review: Group 2 — Retry, Backoff, and Jitter¶
Articles Reviewed¶
- Timeouts, retries, and backoff with jitter — Marc Brooker / Amazon Builders' Library
- Retry with backoff pattern — AWS Prescriptive Guidance
Key Concepts¶
Timeout Selection (Brooker)¶
Choose timeouts based on downstream service latency percentiles. At Amazon, they pick an acceptable false-timeout rate (e.g., 0.1%) and use the corresponding latency percentile (p99.9). Two pitfalls: - Connection timeout vs request timeout must be set separately — TLS handshakes can be slow - Pre-establish connections before receiving traffic to avoid cold-start timeout spikes
Retries Are "Selfish"¶
Retries demand more server resources. When failures are caused by overload, retries make the overload worse and can prevent recovery entirely.
Retry Amplification in Deep Stacks¶
A 5-layer call chain with 3 retries at each layer = 3^5 = 243x load on the bottom service. Amazon's best practice: retry at a single point in the stack, not at every layer.
Token Bucket Over Circuit Breakers¶
Brooker argues circuit breakers are modal and hard to test. Amazon prefers token bucket retry limiting — allow retries while tokens exist, then throttle to a fixed rate. AWS SDK has this built in since 2016.
Jitter — The Critical Missing Piece¶
Jitter adds randomness to backoff to prevent thundering herd on retry. Two key insights:
- Jitter is not just for retries — apply to all periodic work (timers, cron jobs, polling, health checks) to spread load spikes
- Use consistent jitter, not random — same host should produce same jitter value every time. This makes synchronized retries produce reproducible patterns, which humans can debug. Random jitter makes failures happen randomly, making troubleshooting harder.
When to Retry¶
- Only retry on transient/server errors (429, 503, 504)
- Fail fast on client errors (4xx except 429)
- Operations MUST be idempotent for safe retry
- Eventual consistency blurs the client/server error line
Mapping to NGE Architecture¶
What We Do Right¶
-
Two-tier retry strategy — DB-level retry (
@retry_on_db_conflict) handles MySQL deadlocks in-process; SQS requeue handles processing failures at the message level. These are at different layers but non-amplifying because they serve different failure modes. -
Capped exponential backoff — SQS requeue caps at 900s (15 min). DB retry caps at 3 attempts. Both prevent unbounded retry.
-
Exception-based retry decisions — Our exception hierarchy (PermanentFailure → DLQ, RecoverableException → requeue, SilentSuccess → skip) maps cleanly to the "know which failures are worth retrying" guidance.
-
Idempotent handlers — Our handler pattern supports safe retries because document processing is idempotent (processing the same document twice produces the same result).
-
Single-layer retry for SQS — SQS requeue happens at the handler level only, not at intermediate layers. No amplification risk in the SNS→SQS→Lambda chain.
Issues Found¶
1. MEDIUM: No Jitter in DB Retry or SQS Requeue¶
DB retry (@retry_on_db_conflict):
SQS requeue:
Neither adds jitter. When multiple Lambda invocations hit a MySQL deadlock simultaneously (common during bulk imports), they all retry at exactly the same intervals — causing another deadlock storm. The SQS requeue has the same issue at a larger time scale.
Recommendation: Add full jitter to both:
# DB retry with full jitter
delay = random.uniform(0, base_delay * (2 ** attempt))
# SQS requeue with full jitter
delay = random.uniform(0, min(base_delay * (2 ** retry_count), max_delay))
Full jitter (random between 0 and the exponential cap) provides the best spread according to the companion "Exponential Backoff and Jitter" blog post.
2. MEDIUM: NgePageService Has No Timeout or Retry (Existing Backlog Item)¶
NgePageService makes synchronous HTTP calls with no configured timeout and no retry. Per Brooker's guidance: - Set connection timeout + request timeout based on p99.9 latency - Add retry with jittered backoff for transient failures (5xx, timeouts) - Use token bucket or max retry count to limit amplification
This is already backlog item #4. The articles reinforce that this needs both timeout AND retry with jitter, not just retry.
3. LOW: No Retry Budget / Token Bucket¶
Our retry mechanisms don't have a global rate limit. During a sustained downstream failure (e.g., RDS failover), every in-flight message will retry with backoff simultaneously. While the backoff spreads retries over time, there's no cap on total retry throughput.
AWS SDK's built-in token bucket provides this. For our custom SQS requeue, we could add a per-Lambda retry counter that stops retrying if the error rate exceeds a threshold.
Recommendation: Low priority — our SQS maxReceiveCount + DLQ acts as a natural circuit breaker. Monitor for now; implement if we see retry storms during RDS failovers.
4. INFO: Consistent Jitter for Periodic Tasks¶
Brooker specifically calls out that jitter for scheduled work should be consistent per host (deterministic, same value each time) rather than random. This makes load patterns reproducible for debugging.
Relevant if we add periodic health checks, polling, or cron-triggered batch processing in future modules. Not applicable to our current retry patterns (where randomness is desirable).
New Backlog Items¶
| Item | Priority | Related Backlog |
|---|---|---|
Add full jitter to @retry_on_db_conflict |
MEDIUM | New |
| Add full jitter to SQS requeue backoff | MEDIUM | New |
| Add timeout + jittered retry to NgePageService HTTP calls | MEDIUM | #4 (existing) — enhanced with timeout guidance |
Summary¶
The Amazon Builders' Library article provides the theoretical foundation — retries are selfish, jitter prevents thundering herd, token buckets beat circuit breakers, and retry at one layer only. The AWS Prescriptive Guidance article adds implementation patterns. Our NGE architecture follows most best practices (capped backoff, exception-based retry decisions, idempotent handlers, single-layer retry). The primary gap is no jitter anywhere — both the DB retry decorator and SQS requeue use pure exponential backoff, which can cause synchronized retry storms during concurrent failures like deadlock cascades.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.