Back to Newtonix.bot

Monitoring That Actually Helps

December 2024 • 10 min read • DevOps Strategy

At 3 AM, you get an alert: "System Down." You scramble to your laptop, heart racing, only to find that everything looks fine. Users are happy, servers are running, but some obscure metric crossed a threshold nobody remembers setting.

Bad monitoring is worse than no monitoring. It creates alert fatigue, wastes engineering time, and fails when you actually need it. Here's how to build observability that catches real problems before your users do.

The Four Pillars of Useful Monitoring

1. User-Impact Metrics (The Only Ones That Matter)

Stop monitoring your servers. Start monitoring your user experience.

What users actually care about:

  • Page load time: How fast does your app feel?
  • Error rate: What percentage of requests fail?
  • Feature availability: Can users complete core workflows?
  • Business metrics: Are signups, purchases, or key actions working?
Real example: CPU was at 90%, memory was high, disk I/O was maxed. The monitoring system was screaming. But response times were still under 200ms and error rates were 0.01%. Users didn't care about our resource utilization — they cared that the app was fast and worked.

2. Leading Indicators (Problems Before They Happen)

Don't wait for things to break. Monitor the metrics that predict failures.

// Instead of alerting when disk is 95% full
if (diskUsage > 0.95) {
    alert("DISK FULL!")
}

// Alert when disk will be full in 4 hours
const growthRate = calculateDiskGrowthRate()
const timeToFull = (diskSpace - diskUsed) / growthRate
if (timeToFull < 4 * HOURS) {
    alert(`Disk will be full in ${timeToFull} hours`)
}

3. Context-Rich Alerts

Your alerts should tell a story, not just report a number.

Bad alert: "Response time: 2.3s"

Good alert: "API response time is 2.3s (normal: 150ms). Started 10 minutes ago. Affects login endpoint. 45% of users impacted. Runbook: link"

4. Actionable Intelligence

Every alert should have a clear next step. If there's nothing you can do about it at 3 AM, don't wake people up.

The Anti-Patterns That Kill Teams

Metric Explosion

Tracking everything is tracking nothing. I've seen dashboards with 200+ metrics where finding useful information was impossible.

The 5-Dashboard Rule: If you need more than 5 dashboards to understand your system health, you're tracking too much noise.

Alert Fatigue

When everything is urgent, nothing is urgent. Teams start ignoring alerts, which defeats the entire purpose.

The phone test: Would you want to be woken up at 3 AM for this alert? If not, don't make it an alert — make it a dashboard metric.

Vanity Metrics

Tracking impressive-sounding metrics that don't correlate with user experience or business outcomes.

Vanity metrics to avoid:

  • Server uptime (users care about service uptime)
  • Total requests per second (without context about normal levels)
  • Memory usage percentage (without impact on performance)
  • Database connection count (without relating to user experience)

Building a Monitoring Strategy That Works

Start with the Golden Signals

Google's Site Reliability Engineering team identified four metrics that matter most:

  • Latency: How long requests take
  • Traffic: How much demand is on your system
  • Errors: The rate of failed requests
  • Saturation: How close to capacity you are

Get these four right before adding anything else.

The Service Level Objective (SLO) Approach

Define what "working" means in measurable terms.

// Example SLOs for a web application
const SLOs = {
    availability: {
        target: 99.9, // 99.9% uptime
        measurement: "successful HTTP responses / total HTTP responses"
    },
    latency: {
        target: 95, // 95th percentile under 200ms
        measurement: "95th percentile response time < 200ms"
    },
    freshness: {
        target: 99, // 99% of data updates within 5 minutes
        measurement: "data updates processed within 5 minutes"
    }
}

Error Budgets: Making Reliability a Feature

Instead of aiming for 100% uptime (impossible), budget for acceptable failure rates.

99.9% availability = 43 minutes of downtime per month

Use your error budget to balance new features vs. stability work. If you're burning through your error budget too quickly, prioritize reliability. If you're not using it, you can afford to move faster.

Tools and Implementation

The Three-Layer Monitoring Stack

Layer 1: Infrastructure Monitoring

  • CPU, memory, disk, network basics
  • Tools: Prometheus + Grafana, DataDog, New Relic
  • Focus: Resource saturation and capacity planning

Layer 2: Application Performance Monitoring (APM)

  • Request traces, database query performance, error tracking
  • Tools: Sentry, DataDog APM, New Relic, Honeycomb
  • Focus: User experience and performance bottlenecks

Layer 3: Business Metrics

  • Signups, purchases, feature usage, customer success metrics
  • Tools: Custom dashboards, Mixpanel, Amplitude
  • Focus: Business impact and user behavior

Custom Metrics That Matter

Build domain-specific monitoring for your unique business logic.

// Example: E-commerce checkout monitoring
app.post('/checkout', async (req, res) => {
    const startTime = Date.now()
    
    try {
        const order = await processOrder(req.body)
        
        // Success metrics
        metrics.increment('checkout.success')
        metrics.histogram('checkout.duration', Date.now() - startTime)
        metrics.increment('revenue', order.total)
        
        res.json({ success: true, orderId: order.id })
    } catch (error) {
        // Failure metrics
        metrics.increment('checkout.failure')
        metrics.increment(`checkout.failure.${error.type}`)
        
        // Alert on payment processing failures specifically
        if (error.type === 'payment_processing') {
            alerting.critical('Payment processing failure', {
                error: error.message,
                userId: req.user.id,
                amount: req.body.total
            })
        }
        
        res.status(500).json({ error: 'Checkout failed' })
    }
})

Alert Design Principles

The Three Types of Alerts

1. Critical Alerts (Page someone immediately)

  • User-facing services are down
  • Data loss in progress
  • Security breach detected

2. Warning Alerts (Handle during business hours)

  • Services degraded but functional
  • Capacity approaching limits
  • Non-critical features failing

3. Info Alerts (Log for analysis)

  • Deployment notifications
  • Capacity changes
  • Unusual but not harmful patterns

Alert Fatigue Prevention

Time-based suppression: Don't alert on the same issue multiple times within a short window.

Dependency-aware alerting: If the database is down, don't also alert about all the services that depend on it.

Escalation policies: Start with the person on call, escalate to the team lead after 15 minutes, escalate to management after 45 minutes.

Observability for Debugging

Distributed Tracing

When requests span multiple services, you need to trace the entire journey.

// Add correlation IDs to track requests across services
app.use((req, res, next) => {
    req.correlationId = req.headers['x-correlation-id'] || generateId()
    res.setHeader('x-correlation-id', req.correlationId)
    next()
})

// Log with correlation ID for tracing
logger.info('Processing payment', {
    correlationId: req.correlationId,
    userId: req.user.id,
    amount: req.body.amount
})

Structured Logging

Make your logs searchable and analyzable.

// Bad: String concatenation logs
console.log(`User ${userId} purchased ${productName} for ${amount}`)

// Good: Structured logs
logger.info('purchase_completed', {
    event_type: 'purchase',
    user_id: userId,
    product_id: productId,
    product_name: productName,
    amount: amount,
    currency: 'USD',
    timestamp: new Date().toISOString()
})

The Monitoring Maturity Model

Level 1: Basic Health Checks

  • Is the server up?
  • Are core services responding?
  • Basic error rate tracking

Level 2: Performance Monitoring

  • Response time percentiles
  • Resource utilization trends
  • Database performance tracking

Level 3: User Experience Monitoring

  • Real user monitoring (RUM)
  • Business metrics integration
  • Customer journey tracking

Level 4: Predictive Monitoring

  • Anomaly detection
  • Capacity forecasting
  • Proactive issue prevention

Common Monitoring Mistakes

Monitoring the Wrong Layer

Monitoring CPU usage when you should be monitoring user experience. Infrastructure metrics are useful for capacity planning, but user-impact metrics should drive alerting.

Tool Sprawl

Using 12 different monitoring tools that don't talk to each other. Pick one or two that integrate well and give you a unified view.

No Runbooks

Alerts without runbooks are just stress generators. Every alert should have a runbook that explains how to investigate and resolve the issue.

The Bottom Line

Good monitoring is invisible when everything is working and invaluable when things break. It should help you understand your system's behavior, catch problems early, and debug issues quickly.

Start simple: monitor what users care about. Add complexity only when you feel the pain of not having more detailed observability.

Remember: The goal isn't to have pretty dashboards. It's to keep your users happy and your systems reliable with minimal human intervention.

Struggling with alert fatigue or monitoring strategy? Let's chat about building observability that actually helps instead of hurts.

💬 Share Your Thoughts

Have insights to share or questions about this post? I'd love to hear from you!