Performance Debugging War Stories
Some performance bugs are obvious: a missing index, an N+1 query, a memory leak. Others are demons that hide in the shadows, appearing only under specific conditions, laughing at your profilers and mocking your metrics.
Here are the war stories from the trenches — the performance bugs that took days to find and minutes to fix, along with the systematic approaches that finally cornered them.
War Story #1: The Ghost in the Load Balancer
⚔️ The Battlefield
Symptoms: API responses randomly took 30+ seconds, but only for 2% of requests. No pattern in timing, user ID, or request type. 98% of requests were lightning fast (<100ms).
The red herrings:
- Database queries looked fine (all under 50ms)
- Application metrics showed normal CPU and memory usage
- Third-party API calls were fast
- No correlation with request size or complexity
The breakthrough: Added request-level tracing with correlation IDs. Discovered that slow requests were spending 29 seconds in "network time" — not in our application at all.
// Added timing middleware to track request journey
app.use((req, res, next) => {
req.startTime = process.hrtime.bigint()
req.correlationId = generateId()
const originalSend = res.send
res.send = function(data) {
const duration = Number(process.hrtime.bigint() - req.startTime) / 1000000
logger.info('request_completed', {
correlationId: req.correlationId,
method: req.method,
path: req.path,
duration: duration,
statusCode: res.statusCode
})
return originalSend.call(this, data)
}
next()
})
The culprit: AWS Application Load Balancer was occasionally routing requests to unhealthy instances that were in the process of shutting down. The instances would hold connections for 30 seconds before timing out.
The fix: Proper health check configuration and graceful shutdown handling.
Lesson learned: Performance problems aren't always in your code. Sometimes they're in the infrastructure layer you can't see.
War Story #2: The Midnight Memory Thief
⚔️ The Battlefield
Symptoms: Node.js application would run perfectly for 3-4 days, then suddenly crash with out-of-memory errors. Always happened between 2-4 AM EST.
The investigation tools:
// Added memory tracking middleware
const memoryUsage = () => {
const used = process.memoryUsage()
const memInfo = {
rss: Math.round(used.rss / 1024 / 1024 * 100) / 100,
heapTotal: Math.round(used.heapTotal / 1024 / 1024 * 100) / 100,
heapUsed: Math.round(used.heapUsed / 1024 / 1024 * 100) / 100,
external: Math.round(used.external / 1024 / 1024 * 100) / 100,
timestamp: new Date().toISOString()
}
// Log every minute
if (memInfo.heapUsed > 200) {
logger.warn('high_memory_usage', memInfo)
}
return memInfo
}
setInterval(memoryUsage, 60000)
The false leads:
- No obvious memory leaks in application code
- Garbage collection seemed to be working
- Memory usage was stable during normal business hours
The breakthrough: Noticed that crashes always happened during automated backup jobs. The backup process was triggering a CSV export that loaded entire database tables into memory.
// The memory killer
async function exportUsers() {
const users = await db.users.findAll() // Loading 2M+ records
const csvData = users.map(user =>
`${user.id},${user.email},${user.name}`
).join('\n')
return csvData
}
The fix: Streaming CSV export with batched queries.
// Memory-efficient streaming export
async function* streamUsers() {
let offset = 0
const batchSize = 1000
while (true) {
const users = await db.users.findAll({
limit: batchSize,
offset: offset
})
if (users.length === 0) break
for (const user of users) {
yield `${user.id},${user.email},${user.name}\n`
}
offset += batchSize
}
}
Lesson learned: Scheduled jobs and background processes are often performance blind spots. Monitor them separately from user-facing traffic.
War Story #3: The Query That Broke Physics
⚔️ The Battlefield
Symptoms: Database query that should return in 5ms was taking 15+ seconds, but only for certain user IDs. Same query structure, same indexes, completely different performance.
The mystery deepens:
-- Fast for most users (< 5ms)
SELECT * FROM orders
WHERE user_id = 12345
AND created_at > '2023-01-01'
ORDER BY created_at DESC
LIMIT 10
-- Slow for some users (15+ seconds)
SELECT * FROM orders
WHERE user_id = 67890 -- Same query, different user_id
AND created_at > '2023-01-01'
ORDER BY created_at DESC
LIMIT 10
The investigation: Used PostgreSQL's query execution plans to understand the difference.
-- Analyze query plans
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE user_id = 67890
AND created_at > '2023-01-01'
ORDER BY created_at DESC
LIMIT 10
The revelation: PostgreSQL was choosing different execution plans based on data distribution. Some users had 1,000 orders, others had 100,000. The query planner was making suboptimal choices for high-volume users.
The data distribution problem:
- 90% of users had <100 orders (fast queries)
- 5% of users had 1,000-10,000 orders (slow queries)
- 5% of users had 10,000+ orders (very slow queries)
The fix: Composite index optimized for the actual query pattern.
-- Original index
CREATE INDEX idx_orders_user_id ON orders(user_id)
CREATE INDEX idx_orders_created_at ON orders(created_at)
-- Optimized composite index
CREATE INDEX idx_orders_user_created_desc
ON orders(user_id, created_at DESC)
-- Query planner now uses optimal path for all users
Lesson learned: Data distribution matters as much as data structure. Test your queries against realistic data volumes and patterns.
The Systematic Approach to Performance Debugging
Phase 1: Establish Baseline
Before you can fix performance, you need to know what "normal" looks like.
// Comprehensive performance baseline
const performanceMetrics = {
// Application metrics
responseTime: {
p50: 0,
p95: 0,
p99: 0
},
// System metrics
cpu: process.cpuUsage(),
memory: process.memoryUsage(),
// Business metrics
throughput: 0, // requests per second
errorRate: 0, // percentage of failed requests
// Infrastructure metrics
dbConnectionPool: db.pool.totalCount,
activeConnections: db.pool.idleCount
}
Phase 2: Reproduce the Problem
If you can't reproduce it consistently, you can't fix it reliably.
Load testing strategy:
// Artillery.js load test configuration
{
"config": {
"target": "http://localhost:3000",
"phases": [
{ "duration": 60, "arrivalRate": 10 },
{ "duration": 120, "arrivalRate": 20 },
{ "duration": 60, "arrivalRate": 50 }
]
},
"scenarios": [
{
"name": "User journey simulation",
"weight": 70,
"flow": [
{ "get": { "url": "/api/dashboard" }},
{ "think": 2 },
{ "post": { "url": "/api/search", "json": {"query": "test"} }},
{ "think": 3 },
{ "get": { "url": "/api/profile" }}
]
}
]
}
Phase 3: Instrument Everything
Add logging and metrics at every layer to understand where time is spent.
// Request tracing through entire stack
class RequestTracer {
constructor(correlationId) {
this.correlationId = correlationId
this.spans = []
this.startTime = Date.now()
}
startSpan(name) {
const span = {
name,
startTime: Date.now(),
correlationId: this.correlationId
}
this.spans.push(span)
return span
}
endSpan(span) {
span.endTime = Date.now()
span.duration = span.endTime - span.startTime
logger.info('span_completed', {
correlationId: this.correlationId,
spanName: span.name,
duration: span.duration
})
}
getSummary() {
const totalDuration = Date.now() - this.startTime
return {
correlationId: this.correlationId,
totalDuration,
spans: this.spans.map(s => ({
name: s.name,
duration: s.duration,
percentage: (s.duration / totalDuration * 100).toFixed(2)
}))
}
}
}
Phase 4: Eliminate Variables
Isolate the problem by removing complexity systematically.
The binary search approach:
- Comment out half the middleware — problem still exists?
- Comment out half the remaining middleware — problem gone?
- The bug is in the middleware you just commented out
- Repeat until you find the exact cause
Common Performance Anti-Patterns
The Hidden N+1 Query
The most common performance killer, often hidden in seemingly innocent code.
// Looks innocent, kills performance
async function getUsersWithPosts() {
const users = await User.findAll() // 1 query
for (const user of users) {
user.posts = await Post.findByUserId(user.id) // N queries
}
return users
}
// Optimized version
async function getUsersWithPosts() {
return await User.findAll({
include: [{ model: Post }] // 1 query with JOIN
})
}
The Accidental Full Table Scan
When your WHERE clause doesn't match your indexes.
-- Full table scan (slow)
SELECT * FROM users WHERE LOWER(email) = 'user@example.com'
-- Index scan (fast)
SELECT * FROM users WHERE email = 'user@example.com'
-- Requires: CREATE INDEX idx_users_email ON users(email)
The Memory Leak Disguised as Caching
Caches that grow forever eventually become memory leaks.
// Memory leak waiting to happen
const cache = new Map()
function getCachedData(key) {
if (!cache.has(key)) {
cache.set(key, expensiveOperation(key))
}
return cache.get(key)
}
// Bounded cache with TTL
const LRU = require('lru-cache')
const cache = new LRU({
max: 1000,
ttl: 1000 * 60 * 10 // 10 minutes
})
The Performance Debugging Toolkit
Essential Tools
- Application Performance Monitoring: DataDog, New Relic, Sentry
- Database Profiling: pg_stat_statements, MySQL slow query log
- Memory Profiling: Node.js heap snapshots, Chrome DevTools
- Load Testing: Artillery, k6, Apache Bench
- Network Analysis: Wireshark, tcpdump
The Performance Debugging Checklist
- ✅ Establish baseline metrics
- ✅ Add comprehensive logging
- ✅ Profile under realistic load
- ✅ Check database query plans
- ✅ Monitor memory usage patterns
- ✅ Analyze network latency
- ✅ Test with production-like data
The Bottom Line
Performance debugging is detective work. The obvious suspects are usually innocent, and the real culprit is hiding in plain sight.
Build observability into your systems from day one. Instrument your code like you're building a flight recorder — when things go wrong, you'll need the data to understand what happened.
Remember: The best performance bug is the one you prevent by measuring and monitoring from the beginning. The second-best is the one you can reproduce consistently.
Dealing with mysterious performance issues? Let's debug it together — I love a good performance mystery.
💬 Share Your Thoughts
Have insights to share or questions about this post? I'd love to hear from you!