#The Developer's Guide to Web Application Monitoring and Alerting
It's 2:47 AM. Your phone explodes with alerts. Users can't access your application. The dashboard shows everything green. Your monitoring tools say "All Systems Operational." But Twitter tells a different story—hundreds of users reporting errors.
This disconnect between monitoring systems and reality destroys businesses. You're not truly monitoring your application if you learn about outages from angry users on social media.
Web application monitoring isn't just about collecting metrics—it's about gaining actionable insights that let you fix problems before users notice them. Proper performance monitoring, uptime monitoring, and error tracking transform reactive firefighting into proactive application health management.
In this comprehensive guide, you'll learn how to build a monitoring and alerting strategy that actually works—catching issues early, providing context for debugging, and keeping your team informed without drowning them in noise.
Why Most Monitoring Fails
Before building better monitoring, understand why traditional approaches fall short:
The False Positive Problem
Scenario: Alerts fire constantly for non-issues. Team learns to ignore notifications. When a real outage occurs, alerts blend into noise.
| Alert Noise Level | Team Response Time | Incident Severity |
|---|---|---|
| <5 alerts/week | 2-5 minutes | Issues caught early |
| 10-20 alerts/week | 15-30 minutes | Some issues missed |
| 50+ alerts/week | Hours or ignored | Critical outages unnoticed |
The principle: Every alert must be actionable. If it doesn't require immediate action, it's not an alert—it's a metric.
The Vanity Metric Trap
Tracking impressive-looking numbers that don't inform decisions:
❌ Vanity Metrics (look good, mean little):
- Total page views
- Raw server uptime percentage
- Average response time (hides outliers)
- Total error count (without context)
✅ Actionable Metrics (drive improvement):
- 95th/99th percentile response times
- Error rate as percentage of requests
- Conversion funnel drop-off points
- User-impacting incidents
- Time to resolution
The Alert Fatigue Cycle
More metrics added
↓
More alerts configured
↓
Too many alerts fire
↓
Team ignores alerts
↓
Critical issues missed
↓
"We need better monitoring!"
↓
(Cycle repeats)
Breaking the cycle: Start with fewer, high-quality alerts. Add selectively based on actual incidents.
The Four Pillars of Effective Monitoring
Comprehensive web application monitoring rests on four foundations:
1. Application Performance Monitoring (APM)
Track how your application performs from the user's perspective:
Key Metrics:
- Response Time: How long requests take (p50, p95, p99)
- Throughput: Requests per second
- Error Rate: Failed requests / total requests
- Apdex Score: User satisfaction metric
// Instrument your application with APM
const apm = require('elastic-apm-node').start({
serviceName: 'my-app',
serverUrl: 'https://apm.example.com',
});
app.use((req, res, next) => {
const transaction = apm.startTransaction(req.path, 'request');
res.on('finish', () => {
transaction.result = res.statusCode >= 400 ? 'error' : 'success';
transaction.end();
});
next();
});
// Tracking custom performance metrics
function processOrder(order) {
const span = apm.startSpan('Process Order');
try {
// Business logic
validateOrder(order);
chargePayment(order);
createShipment(order);
span.end();
} catch (error) {
apm.captureError(error);
span.end();
throw error;
}
}
2. Infrastructure Monitoring
Track the health of servers, databases, and infrastructure:
System Metrics:
- CPU utilization
- Memory usage
- Disk I/O
- Network throughput
Database Metrics:
- Query performance
- Connection pool usage
- Slow queries
- Replication lag
Container/Orchestration Metrics:
- Pod/container health
- Resource limits
- Restart frequency
- Deployment status
# Prometheus monitoring configuration
scrape_configs:
- job_name: 'application'
static_configs:
- targets: ['app:3000']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
3. Error Tracking and Logging
Capture and contextualize application errors:
const Sentry = require('@sentry/node');
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1, // Sample 10% of transactions for performance
beforeSend(event, hint) {
// Add custom context
event.tags = {
...event.tags,
version: process.env.APP_VERSION,
deployment: process.env.DEPLOY_ID,
};
return event;
},
});
// Structured logging with context
const winston = require('winston');
const logger = winston.createLogger({
level: 'info',
format: winston.format.json(),
defaultMeta: {
service: 'my-app',
version: process.env.APP_VERSION,
},
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
],
});
// Log with context
logger.error('Payment processing failed', {
orderId: order.id,
userId: user.id,
amount: order.total,
errorCode: error.code,
errorMessage: error.message,
});
4. User Experience Monitoring (Real User Monitoring - RUM)
Track actual user experiences in production:
// Client-side RUM implementation
import { onCLS, onFID, onLCP, onFCP, onTTFB } from 'web-vitals';
function sendToAnalytics({ name, value, id }) {
const body = JSON.stringify({
metric: name,
value: Math.round(value),
id,
page: window.location.pathname,
timestamp: Date.now(),
});
// Use sendBeacon for reliability
navigator.sendBeacon('/api/metrics', body);
}
// Track Core Web Vitals
onCLS(sendToAnalytics); // Cumulative Layout Shift
onFID(sendToAnalytics); // First Input Delay
onLCP(sendToAnalytics); // Largest Contentful Paint
onFCP(sendToAnalytics); // First Contentful Paint
onTTFB(sendToAnalytics); // Time to First Byte
// Track custom user actions
function trackUserAction(action, metadata) {
sendToAnalytics({
name: 'user_action',
value: Date.now(),
id: generateId(),
action,
...metadata,
});
}
// Usage
document.getElementById('checkoutButton').addEventListener('click', () => {
trackUserAction('checkout_initiated', {
cartValue: calculateCartTotal(),
itemCount: getCartItemCount(),
});
});
Building Your Monitoring Stack
Modern Monitoring Architecture
┌──────────────────────────────────────────────┐
│ Application Instrumentation │
│ • APM SDK (OpenTelemetry, Datadog, etc.) │
│ • Error tracking (Sentry, Rollbar) │
│ • Custom metrics │
└──────────────┬───────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ Metrics Collection & Storage │
│ • Time-series database (Prometheus, InfluxDB)│
│ • Log aggregation (ELK, Loki) │
│ • Traces (Jaeger, Tempo) │
└──────────────┬───────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ Visualization & Dashboards │
│ • Grafana │
│ • Kibana │
│ • Custom dashboards │
└──────────────┬───────────────────────────────┘
↓
┌──────────────────────────────────────────────┐
│ Alerting & Notification │
│ • AlertManager │
│ • PagerDuty │
│ • Slack/Teams/Email │
└──────────────────────────────────────────────┘
Choosing Your Monitoring Tools
| Tool Category | Open Source Option | Commercial Option | Best For |
|---|---|---|---|
| APM | OpenTelemetry | Datadog, New Relic | Performance tracking |
| Metrics | Prometheus | Datadog, SignalFx | Infrastructure monitoring |
| Logs | ELK Stack (Elasticsearch, Logstash, Kibana) | Splunk, Sumo Logic | Log analysis |
| Error Tracking | Sentry (self-hosted) | Sentry, Rollbar, Bugsnag | Exception monitoring |
| Uptime | Blackbox Exporter | Pingdom, UptimeRobot | Availability checks |
| RUM | Plausible Analytics | Google Analytics, Amplitude | User behavior |
| Synthetic | Playwright/Puppeteer | ScanlyApp, Datadog Synthetics | Proactive testing |
Creating Effective Alerts
The Alert Design Framework
Every alert should answer:
- What is happening? (specific, actionable description)
- Why does it matter? (business/user impact)
- What should I do? (runbook link or next steps)
# Well-designed alert configuration
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: 'High error rate detected: {{ $value | humanizePercentage }}'
description: |
Error rate is {{ $value | humanizePercentage }} (threshold: 1%)
**Impact**: Users experiencing failures
**Affected service**: {{ $labels.service }}
**Dashboard**: https://grafana.example.com/d/errors
**Runbook**: https://wiki.example.com/runbooks/high-error-rate
# Provide context for debugging
helpful_queries: |
Recent errors: `kubectl logs -l app={{ $labels.service }} --since=10m | grep ERROR`
Error breakdown: Check Sentry dashboard
Alert Severity Levels
| Severity | Response Time | Notification | Examples |
|---|---|---|---|
| Critical | Immediate (page on-call) | Phone call, SMS, Slack | Site down, payment processing broken, data loss |
| High | 15 minutes | Slack, email | Elevated error rate, performance degradation, dependency failure |
| Medium | 1 hour | Slack, email (business hours) | Increased latency, elevated CPU, disk space warning |
| Low | 1 day | Email, ticket | Certificate expiring soon, deprecated feature usage |
Alert Thresholds
Setting appropriate thresholds prevents noise:
Baseline Establishment:
// Collect baseline metrics over 2-4 weeks
const baseline = {
errorRate: {
p50: 0.001, // 0.1%
p95: 0.005, // 0.5%
p99: 0.01, // 1%
},
responseTime: {
p50: 120, // 120ms
p95: 450, // 450ms
p99: 1200, // 1.2s
},
};
// Set alerts at p99 + margin
const alertThresholds = {
errorRate: baseline.errorRate.p99 * 1.5, // 1.5%
responseTime: baseline.responseTime.p99 * 1.3, // 1.56s
};
Dynamic Thresholds:
# Alert on anomalies rather than fixed thresholds
- alert: AnomalousResponseTime
expr: |
(
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
) > (
avg_over_time(
rate(http_request_duration_seconds_sum[5m])[1h:5m]
) * 1.5 # 50% above 1-hour average
)
Building Dashboards That Matter
The Dashboard Hierarchy
1. Executive Dashboard (for leadership):
- Key business metrics
- Overall system health
- Recent incidents
- Week-over-week trends
2. Service Health Dashboard (for on-call engineers):
- Real-time error rates
- Response time percentiles
- Request throughput
- Dependency status
3. Deep Dive Dashboards (for debugging):
- Detailed traces
- Log correlation
- Resource utilization
- Database performance
Dashboard Design Principles
✅ Do This:
- Show trends, not just current values
- Use percentiles (p95, p99), not averages
- Include comparison periods (day/week/month ago)
- Group related metrics together
- Add links to relevant log queries
- Display SLO progress
❌ Avoid This:
- Too many metrics on one screen
- Metrics without context
- Unclear axis labels
- Mixing drastically different scales
- Showing every data point (overwhelming)
Sample Dashboard Configuration
{
"dashboard": {
"title": "Application Health",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"queries": ["sum(rate(http_requests_total[5m])) by (service)"],
"yAxisLabel": "requests/second"
},
{
"title": "Error Rate",
"type": "graph",
"queries": ["sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"],
"yAxisLabel": "error rate",
"thresholds": [{ "value": 0.01, "color": "red" }]
},
{
"title": "Response Time (95th percentile)",
"type": "graph",
"queries": ["histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))"],
"yAxisLabel": "seconds"
}
]
}
}
Incident Response Workflow
The Incident Lifecycle
1. Detection → 2. Acknowledgment → 3. Investigation → 4. Mitigation → 5. Resolution → 6. Post-Mortem
Automated Incident Management
// Incident detection and escalation
class IncidentManager {
async detectIncident(alert) {
// Check if this is a new incident or continuation
const existingIncident = await this.findRelatedIncident(alert);
if (existingIncident) {
await this.updateIncident(existingIncident, alert);
} else {
await this.createIncident(alert);
}
}
async createIncident(alert) {
const incident = {
id: generateId(),
severity: alert.severity,
title: alert.summary,
description: alert.description,
startTime: Date.now(),
status: 'open',
affectedServices: alert.services,
oncallEngineer: await this.getOncallEngineer(alert.team),
};
// Create incident in tracking system
await incidentTracker.create(incident);
// Notify on-call engineer
await this.notifyOncall(incident);
// Create communication channel
await slack.createChannel(`incident-${incident.id}`);
// Post to status page if user-impacting
if (this.isUserImpacting(incident)) {
await statusPage.postIncident(incident);
}
return incident;
}
async notifyOncall(incident) {
const engineer = incident.oncallEngineer;
// Escalation path
const notifications = [
{ method: 'slack', delay: 0 },
{ method: 'sms', delay: 2 * 60 * 1000 }, // 2 min
{ method: 'phone', delay: 5 * 60 * 1000 }, // 5 min
];
for (const notification of notifications) {
await sleep(notification.delay);
// Check if acknowledged
const ack = await this.checkAcknowledgment(incident.id);
if (ack) break;
await this.sendNotification(engineer, notification.method, incident);
}
}
}
SLOs and Error Budgets
Defining Service Level Objectives
SLO Structure:
- SLI (Service Level Indicator): Metric being measured
- SLO (Service Level Objective): Target for that metric
- SLA (Service Level Agreement): Customer promise with consequences
# Example SLOs
slos:
- name: api_availability
description: 'API returns successful responses'
sli:
metric: http_requests_total
good_filter: 'status!~"5.."'
total_filter: ''
slo_target: 0.999 # 99.9% success rate
window: 30d
- name: api_latency
description: 'API responds within acceptable time'
sli:
metric: http_request_duration_seconds
threshold: 0.5 # 500ms
slo_target: 0.95 # 95% under threshold
window: 30d
Error Budget Calculation
function calculateErrorBudget(slo) {
const targetUptime = slo.target; // e.g., 0.999 (99.9%)
const windowDays = 30;
const totalMinutes = windowDays * 24 * 60;
const allowedDowntimeMinutes = totalMinutes * (1 - targetUptime);
// Query actual uptime
const actualUptime = getActualUptime(windowDays);
const actualDowntimeMinutes = totalMinutes * (1 - actualUptime);
const remainingBudget = allowedDowntimeMinutes - actualDowntimeMinutes;
const budgetPercentage = (remainingBudget / allowedDowntimeMinutes) * 100;
return {
allowedDowntime: allowedDowntimeMinutes,
actualDowntime: actualDowntimeMinutes,
remaining: remainingBudget,
percentage: budgetPercentage,
status: budgetPercentage > 0 ? 'healthy' : 'exceeded',
};
}
// Example for 99.9% SLO over 30 days
// Allowed downtime: 43.2 minutes/month
// If actual downtime: 20 minutes
// Remaining budget: 23.2 minutes (53.7%)
Using Error Budgets for Decision Making
Policy example:
- >50% budget remaining: Ship features freely
- 25-50% budget: Increase testing, review changes carefully
- 10-25% budget: Feature freeze, focus on reliability
- <10% budget: Emergency mode, halt all feature work
Connecting Monitoring to Quality Culture
Web application monitoring isn't separate from quality assurance—it's continuous validation that your application works in production. Implementing continuous testing in CI/CD catches issues before deployment, while monitoring validates behavior after.
Understanding common website bugs helps you know what to monitor. And combining synthetic monitoring with real user monitoring provides complete visibility.
Build Monitoring That Works
You now understand how to implement comprehensive web application monitoring that provides actionable insights, not just data. You know how to track application health, set up effective alerts, build meaningful dashboards, and respond to incidents systematically.
The difference between reactive firefighting and proactive reliability engineering is monitoring done right.
Comprehensive Application Monitoring with ScanlyApp
ScanlyApp augments your monitoring stack with synthetic uptime monitoring and comprehensive application testing:
✅ 24/7 Synthetic Monitoring – Continuous testing of critical user journeys
✅ Multi-Region Uptime Checks – Global availability validation
✅ Performance Tracking – Response time monitoring and regression detection
✅ Error Detection – JavaScript console monitoring and exception tracking
✅ Visual Regression Monitoring – Catch UI breaks automatically
✅ Integrated Alerting – Smart notifications when issues arise
Add comprehensive synthetic monitoring to your observability stack in 2 minutes.
Questions about building a monitoring strategy for your specific architecture? Talk to our observability experts—we're here to help you gain true visibility into your application.