Performance Debugging War Stories

December 2024 • 9 min read • Performance Engineering

Some performance bugs are obvious: a missing index, an N+1 query, a memory leak. Others are demons that hide in the shadows, appearing only under specific conditions, laughing at your profilers and mocking your metrics.

Here are the war stories from the trenches — the performance bugs that took days to find and minutes to fix, along with the systematic approaches that finally cornered them.

War Story #1: The Ghost in the Load Balancer

⚔️ The Battlefield

Symptoms: API responses randomly took 30+ seconds, but only for 2% of requests. No pattern in timing, user ID, or request type. 98% of requests were lightning fast (<100ms).

The red herrings:

Database queries looked fine (all under 50ms)
Application metrics showed normal CPU and memory usage
Third-party API calls were fast
No correlation with request size or complexity

The breakthrough: Added request-level tracing with correlation IDs. Discovered that slow requests were spending 29 seconds in "network time" — not in our application at all.

// Added timing middleware to track request journey
app.use((req, res, next) => {
    req.startTime = process.hrtime.bigint()
    req.correlationId = generateId()
    
    const originalSend = res.send
    res.send = function(data) {
        const duration = Number(process.hrtime.bigint() - req.startTime) / 1000000
        logger.info('request_completed', {
            correlationId: req.correlationId,
            method: req.method,
            path: req.path,
            duration: duration,
            statusCode: res.statusCode
        })
        return originalSend.call(this, data)
    }
    next()
})

The culprit: AWS Application Load Balancer was occasionally routing requests to unhealthy instances that were in the process of shutting down. The instances would hold connections for 30 seconds before timing out.

The fix: Proper health check configuration and graceful shutdown handling.

Lesson learned: Performance problems aren't always in your code. Sometimes they're in the infrastructure layer you can't see.

War Story #2: The Midnight Memory Thief

⚔️ The Battlefield

Symptoms: Node.js application would run perfectly for 3-4 days, then suddenly crash with out-of-memory errors. Always happened between 2-4 AM EST.

The investigation tools:

// Added memory tracking middleware
const memoryUsage = () => {
    const used = process.memoryUsage()
    const memInfo = {
        rss: Math.round(used.rss / 1024 / 1024 * 100) / 100,
        heapTotal: Math.round(used.heapTotal / 1024 / 1024 * 100) / 100,
        heapUsed: Math.round(used.heapUsed / 1024 / 1024 * 100) / 100,
        external: Math.round(used.external / 1024 / 1024 * 100) / 100,
        timestamp: new Date().toISOString()
    }
    
    // Log every minute
    if (memInfo.heapUsed > 200) {
        logger.warn('high_memory_usage', memInfo)
    }
    
    return memInfo
}

setInterval(memoryUsage, 60000)

The false leads:

No obvious memory leaks in application code
Garbage collection seemed to be working
Memory usage was stable during normal business hours

The breakthrough: Noticed that crashes always happened during automated backup jobs. The backup process was triggering a CSV export that loaded entire database tables into memory.

// The memory killer
async function exportUsers() {
    const users = await db.users.findAll() // Loading 2M+ records
    const csvData = users.map(user => 
        `${user.id},${user.email},${user.name}`
    ).join('\n')
    return csvData
}

The fix: Streaming CSV export with batched queries.

// Memory-efficient streaming export
async function* streamUsers() {
    let offset = 0
    const batchSize = 1000
    
    while (true) {
        const users = await db.users.findAll({
            limit: batchSize,
            offset: offset
        })
        
        if (users.length === 0) break
        
        for (const user of users) {
            yield `${user.id},${user.email},${user.name}\n`
        }
        
        offset += batchSize
    }
}

Lesson learned: Scheduled jobs and background processes are often performance blind spots. Monitor them separately from user-facing traffic.

War Story #3: The Query That Broke Physics

⚔️ The Battlefield

Symptoms: Database query that should return in 5ms was taking 15+ seconds, but only for certain user IDs. Same query structure, same indexes, completely different performance.

The mystery deepens:

-- Fast for most users (< 5ms)
SELECT * FROM orders 
WHERE user_id = 12345 
AND created_at > '2023-01-01' 
ORDER BY created_at DESC 
LIMIT 10

-- Slow for some users (15+ seconds)
SELECT * FROM orders 
WHERE user_id = 67890  -- Same query, different user_id
AND created_at > '2023-01-01' 
ORDER BY created_at DESC 
LIMIT 10

The investigation: Used PostgreSQL's query execution plans to understand the difference.

-- Analyze query plans
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders 
WHERE user_id = 67890 
AND created_at > '2023-01-01' 
ORDER BY created_at DESC 
LIMIT 10

The revelation: PostgreSQL was choosing different execution plans based on data distribution. Some users had 1,000 orders, others had 100,000. The query planner was making suboptimal choices for high-volume users.

The data distribution problem:

90% of users had <100 orders (fast queries)
5% of users had 1,000-10,000 orders (slow queries)
5% of users had 10,000+ orders (very slow queries)

The fix: Composite index optimized for the actual query pattern.

-- Original index
CREATE INDEX idx_orders_user_id ON orders(user_id)
CREATE INDEX idx_orders_created_at ON orders(created_at)

-- Optimized composite index
CREATE INDEX idx_orders_user_created_desc 
ON orders(user_id, created_at DESC)

-- Query planner now uses optimal path for all users

Lesson learned: Data distribution matters as much as data structure. Test your queries against realistic data volumes and patterns.

The Systematic Approach to Performance Debugging

Phase 1: Establish Baseline

Before you can fix performance, you need to know what "normal" looks like.

// Comprehensive performance baseline
const performanceMetrics = {
    // Application metrics
    responseTime: {
        p50: 0,
        p95: 0,
        p99: 0
    },
    
    // System metrics
    cpu: process.cpuUsage(),
    memory: process.memoryUsage(),
    
    // Business metrics
    throughput: 0, // requests per second
    errorRate: 0,  // percentage of failed requests
    
    // Infrastructure metrics
    dbConnectionPool: db.pool.totalCount,
    activeConnections: db.pool.idleCount
}

Phase 2: Reproduce the Problem

If you can't reproduce it consistently, you can't fix it reliably.

Load testing strategy:

// Artillery.js load test configuration
{
  "config": {
    "target": "http://localhost:3000",
    "phases": [
      { "duration": 60, "arrivalRate": 10 },
      { "duration": 120, "arrivalRate": 20 },
      { "duration": 60, "arrivalRate": 50 }
    ]
  },
  "scenarios": [
    {
      "name": "User journey simulation",
      "weight": 70,
      "flow": [
        { "get": { "url": "/api/dashboard" }},
        { "think": 2 },
        { "post": { "url": "/api/search", "json": {"query": "test"} }},
        { "think": 3 },
        { "get": { "url": "/api/profile" }}
      ]
    }
  ]
}

Phase 3: Instrument Everything

Add logging and metrics at every layer to understand where time is spent.

// Request tracing through entire stack
class RequestTracer {
    constructor(correlationId) {
        this.correlationId = correlationId
        this.spans = []
        this.startTime = Date.now()
    }
    
    startSpan(name) {
        const span = {
            name,
            startTime: Date.now(),
            correlationId: this.correlationId
        }
        this.spans.push(span)
        return span
    }
    
    endSpan(span) {
        span.endTime = Date.now()
        span.duration = span.endTime - span.startTime
        
        logger.info('span_completed', {
            correlationId: this.correlationId,
            spanName: span.name,
            duration: span.duration
        })
    }
    
    getSummary() {
        const totalDuration = Date.now() - this.startTime
        return {
            correlationId: this.correlationId,
            totalDuration,
            spans: this.spans.map(s => ({
                name: s.name,
                duration: s.duration,
                percentage: (s.duration / totalDuration * 100).toFixed(2)
            }))
        }
    }
}

Phase 4: Eliminate Variables

Isolate the problem by removing complexity systematically.

The binary search approach:

Comment out half the middleware — problem still exists?
Comment out half the remaining middleware — problem gone?
The bug is in the middleware you just commented out
Repeat until you find the exact cause

Common Performance Anti-Patterns

The Hidden N+1 Query

The most common performance killer, often hidden in seemingly innocent code.

// Looks innocent, kills performance
async function getUsersWithPosts() {
    const users = await User.findAll() // 1 query
    
    for (const user of users) {
        user.posts = await Post.findByUserId(user.id) // N queries
    }
    
    return users
}

// Optimized version
async function getUsersWithPosts() {
    return await User.findAll({
        include: [{ model: Post }] // 1 query with JOIN
    })
}

The Accidental Full Table Scan

When your WHERE clause doesn't match your indexes.

-- Full table scan (slow)
SELECT * FROM users WHERE LOWER(email) = 'user@example.com'

-- Index scan (fast) 
SELECT * FROM users WHERE email = 'user@example.com'
-- Requires: CREATE INDEX idx_users_email ON users(email)

The Memory Leak Disguised as Caching

Caches that grow forever eventually become memory leaks.

// Memory leak waiting to happen
const cache = new Map()

function getCachedData(key) {
    if (!cache.has(key)) {
        cache.set(key, expensiveOperation(key))
    }
    return cache.get(key)
}

// Bounded cache with TTL
const LRU = require('lru-cache')
const cache = new LRU({
    max: 1000,
    ttl: 1000 * 60 * 10 // 10 minutes
})

The Performance Debugging Toolkit

Essential Tools

Application Performance Monitoring: DataDog, New Relic, Sentry
Database Profiling: pg_stat_statements, MySQL slow query log
Memory Profiling: Node.js heap snapshots, Chrome DevTools
Load Testing: Artillery, k6, Apache Bench
Network Analysis: Wireshark, tcpdump

The Performance Debugging Checklist

✅ Establish baseline metrics
✅ Add comprehensive logging
✅ Profile under realistic load
✅ Check database query plans
✅ Monitor memory usage patterns
✅ Analyze network latency
✅ Test with production-like data

The Bottom Line

Performance debugging is detective work. The obvious suspects are usually innocent, and the real culprit is hiding in plain sight.

Build observability into your systems from day one. Instrument your code like you're building a flight recorder — when things go wrong, you'll need the data to understand what happened.

Remember: The best performance bug is the one you prevent by measuring and monitoring from the beginning. The second-best is the one you can reproduce consistently.

Dealing with mysterious performance issues? Let's debug it together — I love a good performance mystery.

War Story #1: The Ghost in the Load Balancer

⚔️ The Battlefield

War Story #2: The Midnight Memory Thief

⚔️ The Battlefield

War Story #3: The Query That Broke Physics

⚔️ The Battlefield

The Systematic Approach to Performance Debugging

Phase 1: Establish Baseline

Phase 2: Reproduce the Problem

Phase 3: Instrument Everything

Phase 4: Eliminate Variables

Common Performance Anti-Patterns

The Hidden N+1 Query

The Accidental Full Table Scan

The Memory Leak Disguised as Caching

The Performance Debugging Toolkit

Essential Tools

The Performance Debugging Checklist

The Bottom Line

💬 Share Your Thoughts