Back to Blog

The Developer's Guide to Web Application Monitoring and Alerting

Master web application monitoring, performance monitoring, uptime monitoring, and error tracking to ensure application health in production environments.

ScanlyApp Team

QA Testing and Automation Experts

Published

12 min read

Reading time

#The Developer's Guide to Web Application Monitoring and Alerting

It's 2:47 AM. Your phone explodes with alerts. Users can't access your application. The dashboard shows everything green. Your monitoring tools say "All Systems Operational." But Twitter tells a different story—hundreds of users reporting errors.

This disconnect between monitoring systems and reality destroys businesses. You're not truly monitoring your application if you learn about outages from angry users on social media.

Web application monitoring isn't just about collecting metrics—it's about gaining actionable insights that let you fix problems before users notice them. Proper performance monitoring, uptime monitoring, and error tracking transform reactive firefighting into proactive application health management.

In this comprehensive guide, you'll learn how to build a monitoring and alerting strategy that actually works—catching issues early, providing context for debugging, and keeping your team informed without drowning them in noise.

Why Most Monitoring Fails

Before building better monitoring, understand why traditional approaches fall short:

The False Positive Problem

Scenario: Alerts fire constantly for non-issues. Team learns to ignore notifications. When a real outage occurs, alerts blend into noise.

Alert Noise Level Team Response Time Incident Severity
<5 alerts/week 2-5 minutes Issues caught early
10-20 alerts/week 15-30 minutes Some issues missed
50+ alerts/week Hours or ignored Critical outages unnoticed

The principle: Every alert must be actionable. If it doesn't require immediate action, it's not an alert—it's a metric.

The Vanity Metric Trap

Tracking impressive-looking numbers that don't inform decisions:

Vanity Metrics (look good, mean little):

  • Total page views
  • Raw server uptime percentage
  • Average response time (hides outliers)
  • Total error count (without context)

Actionable Metrics (drive improvement):

  • 95th/99th percentile response times
  • Error rate as percentage of requests
  • Conversion funnel drop-off points
  • User-impacting incidents
  • Time to resolution

The Alert Fatigue Cycle

More metrics added
      ↓
More alerts configured
      ↓
Too many alerts fire
      ↓
Team ignores alerts
      ↓
Critical issues missed
      ↓
"We need better monitoring!"
      ↓
(Cycle repeats)

Breaking the cycle: Start with fewer, high-quality alerts. Add selectively based on actual incidents.

The Four Pillars of Effective Monitoring

Comprehensive web application monitoring rests on four foundations:

1. Application Performance Monitoring (APM)

Track how your application performs from the user's perspective:

Key Metrics:

  • Response Time: How long requests take (p50, p95, p99)
  • Throughput: Requests per second
  • Error Rate: Failed requests / total requests
  • Apdex Score: User satisfaction metric
// Instrument your application with APM
const apm = require('elastic-apm-node').start({
  serviceName: 'my-app',
  serverUrl: 'https://apm.example.com',
});

app.use((req, res, next) => {
  const transaction = apm.startTransaction(req.path, 'request');

  res.on('finish', () => {
    transaction.result = res.statusCode >= 400 ? 'error' : 'success';
    transaction.end();
  });

  next();
});

// Tracking custom performance metrics
function processOrder(order) {
  const span = apm.startSpan('Process Order');

  try {
    // Business logic
    validateOrder(order);
    chargePayment(order);
    createShipment(order);

    span.end();
  } catch (error) {
    apm.captureError(error);
    span.end();
    throw error;
  }
}

2. Infrastructure Monitoring

Track the health of servers, databases, and infrastructure:

System Metrics:

  • CPU utilization
  • Memory usage
  • Disk I/O
  • Network throughput

Database Metrics:

  • Query performance
  • Connection pool usage
  • Slow queries
  • Replication lag

Container/Orchestration Metrics:

  • Pod/container health
  • Resource limits
  • Restart frequency
  • Deployment status
# Prometheus monitoring configuration
scrape_configs:
  - job_name: 'application'
    static_configs:
      - targets: ['app:3000']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

3. Error Tracking and Logging

Capture and contextualize application errors:

const Sentry = require('@sentry/node');

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1, // Sample 10% of transactions for performance

  beforeSend(event, hint) {
    // Add custom context
    event.tags = {
      ...event.tags,
      version: process.env.APP_VERSION,
      deployment: process.env.DEPLOY_ID,
    };
    return event;
  },
});

// Structured logging with context
const winston = require('winston');

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.json(),
  defaultMeta: {
    service: 'my-app',
    version: process.env.APP_VERSION,
  },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
  ],
});

// Log with context
logger.error('Payment processing failed', {
  orderId: order.id,
  userId: user.id,
  amount: order.total,
  errorCode: error.code,
  errorMessage: error.message,
});

4. User Experience Monitoring (Real User Monitoring - RUM)

Track actual user experiences in production:

// Client-side RUM implementation
import { onCLS, onFID, onLCP, onFCP, onTTFB } from 'web-vitals';

function sendToAnalytics({ name, value, id }) {
  const body = JSON.stringify({
    metric: name,
    value: Math.round(value),
    id,
    page: window.location.pathname,
    timestamp: Date.now(),
  });

  // Use sendBeacon for reliability
  navigator.sendBeacon('/api/metrics', body);
}

// Track Core Web Vitals
onCLS(sendToAnalytics); // Cumulative Layout Shift
onFID(sendToAnalytics); // First Input Delay
onLCP(sendToAnalytics); // Largest Contentful Paint
onFCP(sendToAnalytics); // First Contentful Paint
onTTFB(sendToAnalytics); // Time to First Byte

// Track custom user actions
function trackUserAction(action, metadata) {
  sendToAnalytics({
    name: 'user_action',
    value: Date.now(),
    id: generateId(),
    action,
    ...metadata,
  });
}

// Usage
document.getElementById('checkoutButton').addEventListener('click', () => {
  trackUserAction('checkout_initiated', {
    cartValue: calculateCartTotal(),
    itemCount: getCartItemCount(),
  });
});

Building Your Monitoring Stack

Modern Monitoring Architecture

┌──────────────────────────────────────────────┐
│  Application Instrumentation                 │
│  • APM SDK (OpenTelemetry, Datadog, etc.)   │
│  • Error tracking (Sentry, Rollbar)         │
│  • Custom metrics                            │
└──────────────┬───────────────────────────────┘
               ↓
┌──────────────────────────────────────────────┐
│  Metrics Collection & Storage                │
│  • Time-series database (Prometheus, InfluxDB)│
│  • Log aggregation (ELK, Loki)              │
│  • Traces (Jaeger, Tempo)                    │
└──────────────┬───────────────────────────────┘
               ↓
┌──────────────────────────────────────────────┐
│  Visualization & Dashboards                  │
│  • Grafana                                   │
│  • Kibana                                    │
│  • Custom dashboards                         │
└──────────────┬───────────────────────────────┘
               ↓
┌──────────────────────────────────────────────┐
│  Alerting & Notification                     │
│  • AlertManager                              │
│  • PagerDuty                                 │
│  • Slack/Teams/Email                         │
└──────────────────────────────────────────────┘

Choosing Your Monitoring Tools

Tool Category Open Source Option Commercial Option Best For
APM OpenTelemetry Datadog, New Relic Performance tracking
Metrics Prometheus Datadog, SignalFx Infrastructure monitoring
Logs ELK Stack (Elasticsearch, Logstash, Kibana) Splunk, Sumo Logic Log analysis
Error Tracking Sentry (self-hosted) Sentry, Rollbar, Bugsnag Exception monitoring
Uptime Blackbox Exporter Pingdom, UptimeRobot Availability checks
RUM Plausible Analytics Google Analytics, Amplitude User behavior
Synthetic Playwright/Puppeteer ScanlyApp, Datadog Synthetics Proactive testing

Creating Effective Alerts

The Alert Design Framework

Every alert should answer:

  1. What is happening? (specific, actionable description)
  2. Why does it matter? (business/user impact)
  3. What should I do? (runbook link or next steps)
# Well-designed alert configuration
- alert: HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    ) > 0.01
  for: 5m
  labels:
    severity: critical
    team: backend
  annotations:
    summary: 'High error rate detected: {{ $value | humanizePercentage }}'
    description: |
      Error rate is {{ $value | humanizePercentage }} (threshold: 1%)

      **Impact**: Users experiencing failures
      **Affected service**: {{ $labels.service }}
      **Dashboard**: https://grafana.example.com/d/errors
      **Runbook**: https://wiki.example.com/runbooks/high-error-rate

    # Provide context for debugging
    helpful_queries: |
      Recent errors: `kubectl logs -l app={{ $labels.service }} --since=10m | grep ERROR`
      Error breakdown: Check Sentry dashboard

Alert Severity Levels

Severity Response Time Notification Examples
Critical Immediate (page on-call) Phone call, SMS, Slack Site down, payment processing broken, data loss
High 15 minutes Slack, email Elevated error rate, performance degradation, dependency failure
Medium 1 hour Slack, email (business hours) Increased latency, elevated CPU, disk space warning
Low 1 day Email, ticket Certificate expiring soon, deprecated feature usage

Alert Thresholds

Setting appropriate thresholds prevents noise:

Baseline Establishment:

// Collect baseline metrics over 2-4 weeks
const baseline = {
  errorRate: {
    p50: 0.001, // 0.1%
    p95: 0.005, // 0.5%
    p99: 0.01, // 1%
  },
  responseTime: {
    p50: 120, // 120ms
    p95: 450, // 450ms
    p99: 1200, // 1.2s
  },
};

// Set alerts at p99 + margin
const alertThresholds = {
  errorRate: baseline.errorRate.p99 * 1.5, // 1.5%
  responseTime: baseline.responseTime.p99 * 1.3, // 1.56s
};

Dynamic Thresholds:

# Alert on anomalies rather than fixed thresholds
- alert: AnomalousResponseTime
  expr: |
    (
      rate(http_request_duration_seconds_sum[5m])
      /
      rate(http_request_duration_seconds_count[5m])
    ) > (
      avg_over_time(
        rate(http_request_duration_seconds_sum[5m])[1h:5m]
      ) * 1.5  # 50% above 1-hour average
    )

Building Dashboards That Matter

The Dashboard Hierarchy

1. Executive Dashboard (for leadership):

  • Key business metrics
  • Overall system health
  • Recent incidents
  • Week-over-week trends

2. Service Health Dashboard (for on-call engineers):

  • Real-time error rates
  • Response time percentiles
  • Request throughput
  • Dependency status

3. Deep Dive Dashboards (for debugging):

  • Detailed traces
  • Log correlation
  • Resource utilization
  • Database performance

Dashboard Design Principles

Do This:

  • Show trends, not just current values
  • Use percentiles (p95, p99), not averages
  • Include comparison periods (day/week/month ago)
  • Group related metrics together
  • Add links to relevant log queries
  • Display SLO progress

Avoid This:

  • Too many metrics on one screen
  • Metrics without context
  • Unclear axis labels
  • Mixing drastically different scales
  • Showing every data point (overwhelming)

Sample Dashboard Configuration

{
  "dashboard": {
    "title": "Application Health",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "queries": ["sum(rate(http_requests_total[5m])) by (service)"],
        "yAxisLabel": "requests/second"
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "queries": ["sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"],
        "yAxisLabel": "error rate",
        "thresholds": [{ "value": 0.01, "color": "red" }]
      },
      {
        "title": "Response Time (95th percentile)",
        "type": "graph",
        "queries": ["histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))"],
        "yAxisLabel": "seconds"
      }
    ]
  }
}

Incident Response Workflow

The Incident Lifecycle

1. Detection → 2. Acknowledgment → 3. Investigation → 4. Mitigation → 5. Resolution → 6. Post-Mortem

Automated Incident Management

// Incident detection and escalation
class IncidentManager {
  async detectIncident(alert) {
    // Check if this is a new incident or continuation
    const existingIncident = await this.findRelatedIncident(alert);

    if (existingIncident) {
      await this.updateIncident(existingIncident, alert);
    } else {
      await this.createIncident(alert);
    }
  }

  async createIncident(alert) {
    const incident = {
      id: generateId(),
      severity: alert.severity,
      title: alert.summary,
      description: alert.description,
      startTime: Date.now(),
      status: 'open',
      affectedServices: alert.services,
      oncallEngineer: await this.getOncallEngineer(alert.team),
    };

    // Create incident in tracking system
    await incidentTracker.create(incident);

    // Notify on-call engineer
    await this.notifyOncall(incident);

    // Create communication channel
    await slack.createChannel(`incident-${incident.id}`);

    // Post to status page if user-impacting
    if (this.isUserImpacting(incident)) {
      await statusPage.postIncident(incident);
    }

    return incident;
  }

  async notifyOncall(incident) {
    const engineer = incident.oncallEngineer;

    // Escalation path
    const notifications = [
      { method: 'slack', delay: 0 },
      { method: 'sms', delay: 2 * 60 * 1000 }, // 2 min
      { method: 'phone', delay: 5 * 60 * 1000 }, // 5 min
    ];

    for (const notification of notifications) {
      await sleep(notification.delay);

      // Check if acknowledged
      const ack = await this.checkAcknowledgment(incident.id);
      if (ack) break;

      await this.sendNotification(engineer, notification.method, incident);
    }
  }
}

SLOs and Error Budgets

Defining Service Level Objectives

SLO Structure:

  • SLI (Service Level Indicator): Metric being measured
  • SLO (Service Level Objective): Target for that metric
  • SLA (Service Level Agreement): Customer promise with consequences
# Example SLOs
slos:
  - name: api_availability
    description: 'API returns successful responses'
    sli:
      metric: http_requests_total
      good_filter: 'status!~"5.."'
      total_filter: ''
    slo_target: 0.999 # 99.9% success rate
    window: 30d

  - name: api_latency
    description: 'API responds within acceptable time'
    sli:
      metric: http_request_duration_seconds
      threshold: 0.5 # 500ms
    slo_target: 0.95 # 95% under threshold
    window: 30d

Error Budget Calculation

function calculateErrorBudget(slo) {
  const targetUptime = slo.target; // e.g., 0.999 (99.9%)
  const windowDays = 30;

  const totalMinutes = windowDays * 24 * 60;
  const allowedDowntimeMinutes = totalMinutes * (1 - targetUptime);

  // Query actual uptime
  const actualUptime = getActualUptime(windowDays);
  const actualDowntimeMinutes = totalMinutes * (1 - actualUptime);

  const remainingBudget = allowedDowntimeMinutes - actualDowntimeMinutes;
  const budgetPercentage = (remainingBudget / allowedDowntimeMinutes) * 100;

  return {
    allowedDowntime: allowedDowntimeMinutes,
    actualDowntime: actualDowntimeMinutes,
    remaining: remainingBudget,
    percentage: budgetPercentage,
    status: budgetPercentage > 0 ? 'healthy' : 'exceeded',
  };
}

// Example for 99.9% SLO over 30 days
// Allowed downtime: 43.2 minutes/month
// If actual downtime: 20 minutes
// Remaining budget: 23.2 minutes (53.7%)

Using Error Budgets for Decision Making

Policy example:

  • >50% budget remaining: Ship features freely
  • 25-50% budget: Increase testing, review changes carefully
  • 10-25% budget: Feature freeze, focus on reliability
  • <10% budget: Emergency mode, halt all feature work

Connecting Monitoring to Quality Culture

Web application monitoring isn't separate from quality assurance—it's continuous validation that your application works in production. Implementing continuous testing in CI/CD catches issues before deployment, while monitoring validates behavior after.

Understanding common website bugs helps you know what to monitor. And combining synthetic monitoring with real user monitoring provides complete visibility.

Build Monitoring That Works

You now understand how to implement comprehensive web application monitoring that provides actionable insights, not just data. You know how to track application health, set up effective alerts, build meaningful dashboards, and respond to incidents systematically.

The difference between reactive firefighting and proactive reliability engineering is monitoring done right.

Comprehensive Application Monitoring with ScanlyApp

ScanlyApp augments your monitoring stack with synthetic uptime monitoring and comprehensive application testing:

24/7 Synthetic Monitoring – Continuous testing of critical user journeys
Multi-Region Uptime Checks – Global availability validation
Performance Tracking – Response time monitoring and regression detection
Error Detection – JavaScript console monitoring and exception tracking
Visual Regression Monitoring – Catch UI breaks automatically
Integrated Alerting – Smart notifications when issues arise

Start Your Free Trial →

Add comprehensive synthetic monitoring to your observability stack in 2 minutes.


Questions about building a monitoring strategy for your specific architecture? Talk to our observability experts—we're here to help you gain true visibility into your application.