Black Friday hits your backend like nothing else. Traffic spikes to 10x normal volumes within minutes, and every slow query, unindexed lookup, or misconfigured cache becomes a customer-facing outage. The architecture patterns that work fine at steady-state traffic need specific preparation for peak events: aggressive caching, queue-based order processing, and graceful degradation for non-critical features.
What breaks first
The failure modes are pretty consistent across the e-commerce platforms I've seen put under peak load:
- Traffic spikes that hit 10-50x normal volume within minutes
- Database contention from concurrent sessions and transactions
- Payment processing slows under high transaction volume
- Inventory conflicts from concurrent purchases of the same SKU
- Cache invalidation races when popular items get edited mid-sale
Steady-state architecture rarely handles this without preparation. The cache hit ratio you live with at normal traffic isn't good enough when 10x as many requests are competing for the same hot keys.
The Cost of Getting It Wrong
A site that goes down on Black Friday doesn't just lose that day's revenue. Customers who hit a 503 during checkout don't come back and try again later. They go to a competitor, and they remember.
The numbers are stark. Cart abandonment rates jump from ~70% baseline to 85%+ when page load times exceed 3 seconds. A 1-second delay in checkout reduces conversions by 7%. And if your site actually crashes, recovery takes hours while your competitors absorb your traffic.
The flip side: if your site stays fast while others buckle, you pick up their customers. That's why companies that sell physical products invest disproportionately in peak-traffic infrastructure relative to normal capacity. Peak-traffic windows also attract fraud, which makes a real-time fraud detection pipeline worth having in place before the holiday rush.
Building a backend that holds at 10x
Here's the architecture I lean on. It's not exotic, just deliberate.
Step 1: horizontal scaling with proper pool sizing
The default SQLAlchemy pool settings won't survive Black Friday. Size the pool for your peak concurrent request count, not your steady state.
from flask import Flask, request, jsonify
from flask_sqlalchemy import SQLAlchemy
from redis import Redis
import time
app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'postgresql://user:pass@localhost/ecommerce'
app.config['SQLALCHEMY_POOL_SIZE'] = 20
app.config['SQLALCHEMY_POOL_OVERFLOW'] = 30
db = SQLAlchemy(app)
redis_client = Redis(host='localhost', port=6379, db=0, decode_responses=True)
class Product(db.Model):
__tablename__ = 'products'
id = db.Column(db.Integer, primary_key=True)
name = db.Column(db.String(100), nullable=False)
price = db.Column(db.Float, nullable=False)
inventory = db.Column(db.Integer, nullable=False)
category = db.Column(db.String(50), nullable=False)
class BlackFridayBackend:
def __init__(self):
self.cache_ttl = 300 # 5 minutes
self.rate_limit_requests = 100
self.rate_limit_window = 60 # 1 minute
def get_product_with_cache(self, product_id):
"""Get product with Redis caching for Black Friday traffic"""
cache_key = f"product:{product_id}"
# Try cache first
cached_product = redis_client.get(cache_key)
if cached_product:
return jsonify({"data": eval(cached_product), "source": "cache"})
# Fallback to database with connection pooling
product = Product.query.get(product_id)
if not product:
return jsonify({"error": "Product not found"}), 404
product_data = {
"id": product.id,
"name": product.name,
"price": product.price,
"inventory": product.inventory,
"category": product.category
}
# Cache the result
redis_client.setex(cache_key, self.cache_ttl, str(product_data))
return jsonify({"data": product_data, "source": "database"})
def handle_concurrent_purchases(self, product_id, quantity):
"""Handle concurrent purchases with inventory locking"""
cache_key = f"inventory_lock:{product_id}"
# Try to acquire lock
if redis_client.set(cache_key, "locked", nx=True, ex=10):
try:
# Check inventory
product = Product.query.get(product_id)
if not product or product.inventory < quantity:
return {"error": "Insufficient inventory"}, 400
# Update inventory
product.inventory -= quantity
db.session.commit()
# Invalidate cache
redis_client.delete(f"product:{product_id}")
return {"success": True, "remaining_inventory": product.inventory}
finally:
# Release lock
redis_client.delete(cache_key)
else:
return {"error": "Product is being purchased, please try again"}, 429
Step 2: layered caching with sane TTLs
Different data has different volatility, so it should have different TTLs. Hot product pages can sit in cache longer than search results. Sessions get the longest TTL of all because regenerating them is expensive.
import hashlib
from functools import wraps
class BlackFridayCache:
def __init__(self):
self.redis_client = redis_client
self.cache_layers = {
'hot_products': 600, # 10 minutes for popular products
'categories': 1800, # 30 minutes for category data
'user_sessions': 3600, # 1 hour for user sessions
'search_results': 300 # 5 minutes for search results
}
def cache_with_invalidation(self, cache_type, key_func):
"""Decorator for intelligent cache invalidation"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
cache_key = f"{cache_type}:{key_func(*args, **kwargs)}"
ttl = self.cache_layers.get(cache_type, 300)
# Try cache first
cached_result = self.redis_client.get(cache_key)
if cached_result:
return jsonify({"data": eval(cached_result), "source": "cache"})
# Execute function and cache result
result = func(*args, **kwargs)
if result and not result.get_json().get('error'):
self.redis_client.setex(cache_key, ttl, str(result.get_json()['data']))
return result
return wrapper
return decorator
def preload_black_friday_data(self):
"""Preload critical data before Black Friday traffic"""
# Preload popular products
popular_products = Product.query.filter(
Product.category.in_(['electronics', 'clothing', 'home'])
).limit(100).all()
for product in popular_products:
cache_key = f"product:{product.id}"
product_data = {
"id": product.id,
"name": product.name,
"price": product.price,
"inventory": product.inventory,
"category": product.category
}
self.redis_client.setex(cache_key, 600, str(product_data))
# Preload category data
categories = db.session.query(Product.category).distinct().all()
for category in categories:
cache_key = f"category:{category[0]}"
category_products = Product.query.filter_by(category=category[0]).limit(20).all()
category_data = [{"id": p.id, "name": p.name, "price": p.price} for p in category_products]
self.redis_client.setex(cache_key, 1800, str(category_data))
Splitting hot paths into separate services
Once your monolith is saturating CPU during peak, the next move is splitting the hot paths. Product reads, inventory updates, and payment processing have different scaling profiles, so co-locating them on the same fleet wastes capacity. A round-robin client across product/inventory/payment service pools, with async failover when an instance returns an error, is usually enough.
from flask import Flask
import requests
import asyncio
import aiohttp
class BlackFridayMicroservices:
def __init__(self):
self.services = {
'product_service': ['http://product-1:5000', 'http://product-2:5000'],
'inventory_service': ['http://inventory-1:5001', 'http://inventory-2:5001'],
'payment_service': ['http://payment-1:5002', 'http://payment-2:5002']
}
self.service_index = {service: 0 for service in self.services}
def get_service_url(self, service_name):
"""Round-robin load balancing for services"""
urls = self.services[service_name]
index = self.service_index[service_name]
self.service_index[service_name] = (index + 1) % len(urls)
return urls[index]
async def call_service_async(self, service_name, endpoint, data=None):
"""Async service calls with failover"""
url = f"{self.get_service_url(service_name)}{endpoint}"
try:
async with aiohttp.ClientSession() as session:
if data:
async with session.post(url, json=data) as response:
return await response.json()
else:
async with session.get(url) as response:
return await response.json()
except Exception as e:
# Failover to next service instance
url = f"{self.get_service_url(service_name)}{endpoint}"
async with aiohttp.ClientSession() as session:
if data:
async with session.post(url, json=data) as response:
return await response.json()
else:
async with session.get(url) as response:
return await response.json()
What I've learned the hard way
Scale before the spike, not during. Reactive autoscaling has a cold-start tax that you cannot afford when traffic is doubling every two minutes. Pre-warm your fleet to expected peak the morning of the sale and treat it as fixed capacity.
Build monitoring that's actually useful during an incident. Average response time is useless during a spike, you need p95 and p99 split by endpoint. CPU is misleading without queue depth alongside it.
Plan for graceful degradation. If recommendations are down, the product page still loads. If reviews are slow, the page renders without them and fetches them async. Pick the features you can disable and wire feature flags ahead of time so you can flip them from a dashboard during the event.
Load test the way traffic actually arrives. A linear ramp tells you almost nothing useful. Real Black Friday traffic looks like a step function: zero, then 10x, then sustained for hours. Run the test that mirrors that shape.
Write the runbook before the day. Who pages whom at 3am, who has authority to disable a feature, what the rollback steps are. The day of the sale is not when to figure out who has database admin access.
CDN everything that can be cached. Product images, CSS, JS, marketing pages, even the JSON for the homepage product grid if you can stand the TTL. Every request you serve from the edge is one your origin doesn't see.
Database is usually where it breaks. Make sure your hot queries are indexed, your connection pool is sized correctly, and you've got read replicas absorbing the catalog reads.
Deployment notes
Auto-scaling groups need to be configured for rapid scale-out, not scale-in. Set the scale-out cooldown low (60 seconds) and the scale-in cooldown high (10 minutes) so a momentary dip doesn't return capacity you'll need 30 seconds later. The ECS Fargate vs EKS comparison covers the tradeoffs when picking the underlying platform, and zero-downtime deployments become non-negotiable once you can't afford a maintenance window.
On monitoring, set alerts on leading indicators (queue depth, p99 latency, error rate trending up) rather than lagging ones (CPU at 95%, traffic at 10x). By the time CPU is pinned, you're already in the incident.
For backups, verify your point-in-time recovery works before the event, not after. A backup you've never restored is a hope, not a backup.
Run a peak-load drill 1-2 weeks before. Push synthetic traffic to your staging environment at the volume you expect, with the same traffic shape. Whatever falls over is what you fix in the remaining time.
Where these patterns show up
The same patterns apply outside Black Friday. Anything with a peak-traffic shape benefits from the same toolkit: multi-vendor marketplaces during sale events, flash-sale apps with hard time windows, ticket platforms during on-sale moments, gaming services during launch day, streaming during a big game. The numbers differ, the failure modes don't.
Wrap-up
The pattern that I keep coming back to: caching, queues, locks, async fan-out, graceful degradation. None of it is novel. What's hard is being disciplined enough to apply all of it before the sale, not after the first 503 hits production.
Two things matter more than any specific technical choice. First, you have to actually run the load test, with realistic traffic shape, in something close to production. Second, you have to write the runbook before the day. Everything else is implementation detail.
Next steps
- Pick the right autoscaling shape (fast scale-out, slow scale-in) and pre-warm capacity before the event
- Add the caching layers, sized to your hottest pages, with TTLs you've actually tested
- Run a load test that matches the real traffic shape, not a clean linear ramp
- Write the incident runbook and walk the on-call team through it before sale day