Related articles: Also see the observability stack that measures performance against your SLOs, monitoring and alerting configured around SLO thresholds, and using chaos engineering to validate your SLO buffer before incidents hit.
SLOs and Error Budgets: The Developer Guide to Shipping Faster Without Breaking Things
Your team ships fast. Maybe too fast. Last week's deployment caused a 30-minute outage. The week before, a performance regression made the app unusable for premium customers. Your VP of Engineering wants "more stability," but your product manager is pushing for faster feature delivery. How do you quantify what's acceptable?
Enter Service Level Objectives (SLOs) and error budgets�the framework that transforms subjective reliability discussions ("we need more uptime!") into objective, measurable targets ("we commit to 99.9% availability, which allows 43 minutes of downtime per month").
SLOs represent a commitment to your users about the service quality they can expect. Error budgets quantify how much failure is acceptable. Together, they create a framework for making data-driven decisions about:
- When to deploy (is the error budget exhausted?)
- When to halt features and fix tech debt (error budget burned)
- How much risk to take (error budget remaining)
- Whether to roll back or forward (impact on SLO)
This guide explains SLOs and error budgets from first principles, shows you how to define meaningful objectives for your service, and provides practical implementation examples to start using them today.
Understanding SLI, SLO, and SLA
Three related but distinct concepts form the foundation:
graph TD
A[Service Level Indicator<br/>SLI] --> B[Service Level Objective<br/>SLO]
B --> C[Service Level Agreement<br/>SLA]
A1[Measurement<br/>What we measure] --> A
B1[Target<br/>What we promise internally] --> B
C1[Contract<br/>What we promise customers] --> C
style A fill:#bbdefb
style B fill:#c5e1a5
style C fill:#fff9c4
Service Level Indicator (SLI)
A quantitative measure of service behavior.
Examples:
- Request success rate
- Request latency (p95, p99)
- System throughput
- Data durability
// Example SLI definitions
interface SLI {
name: string;
description: string;
measurement: () => Promise<number>;
}
const requestSuccessRateSLI: SLI = {
name: 'request_success_rate',
description: 'Percentage of HTTP requests that return 2xx or 3xx status',
measurement: async () => {
const total = await metrics.query('sum(http_requests_total)');
const successful = await metrics.query('sum(http_requests_total{status=~"2..|3.."})');
return (successful / total) * 100;
},
};
const requestLatencySLI: SLI = {
name: 'request_latency_p95',
description: '95th percentile of request duration',
measurement: async () => {
return await metrics.query('histogram_quantile(0.95, http_request_duration_seconds)');
},
};
Service Level Objective (SLO)
A target value or range for an SLI.
Examples:
- 99.9% of requests succeed (availability SLO)
- 95% of requests complete in < 200ms (latency SLO)
- 99% of writes are durable within 1 minute (durability SLO)
interface SLO {
name: string;
sli: SLI;
target: number;
window: string; // time window
unit: string;
}
const availabilitySLO: SLO = {
name: 'API Availability',
sli: requestSuccessRateSLI,
target: 99.9, // 99.9%
window: '30d', // rolling 30 days
unit: '%',
};
const latencySLO: SLO = {
name: 'API Latency P95',
sli: requestLatencySLI,
target: 200, // 200ms
window: '30d',
unit: 'ms',
};
Service Level Agreement (SLA)
A contractual commitment to customers, often with financial penalties.
Example:
- "We guarantee 99.95% uptime. If we fail, you get a 10% service credit."
Critical distinction: SLOs should be stricter than SLAs to provide a buffer.
| Metric | SLA | SLO | Buffer |
|---|---|---|---|
| Availability | 99.95% | 99.99% | 4x safety margin |
| Latency P95 | < 500ms | < 200ms | 2.5x safety margin |
Reason: The SLO buffer allows you to catch and fix issues before violating the SLA.
Calculating Error Budgets
Error budget = (1 - SLO) � time window
It represents the amount of failure you can tolerate while still meeting your SLO.
Availability Error Budget
// error-budget-calculator.ts
interface ErrorBudget {
slo: number; // percentage (e.g., 99.9)
windowDays: number;
allowedDowntimeMinutes: number;
allowedFailedRequests: number;
totalRequests: number;
}
function calculateErrorBudget(sloPercent: number, windowDays: number, requestsPerSecond: number): ErrorBudget {
// Total time in window
const totalMinutes = windowDays * 24 * 60;
// Allowed downtime
const allowedUptimePercent = sloPercent;
const allowedDowntimePercent = 100 - allowedUptimePercent;
const allowedDowntimeMinutes = (totalMinutes * allowedDowntimePercent) / 100;
// Total requests in window
const totalRequests = requestsPerSecond * windowDays * 24 * 60 * 60;
// Allowed failed requests
const allowedFailedRequests = Math.floor((totalRequests * allowedDowntimePercent) / 100);
return {
slo: sloPercent,
windowDays,
allowedDowntimeMinutes,
allowedFailedRequests,
totalRequests,
};
}
// Example: 99.9% SLO over 30 days, 1000 req/s
const budget = calculateErrorBudget(99.9, 30, 1000);
console.log(`SLO: ${budget.slo}%`);
console.log(`Time window: ${budget.windowDays} days`);
console.log(`Allowed downtime: ${budget.allowedDowntimeMinutes.toFixed(2)} minutes`);
console.log(`Total requests: ${budget.totalRequests.toLocaleString()}`);
console.log(`Allowed failures: ${budget.allowedFailedRequests.toLocaleString()}`);
// Output:
// SLO: 99.9%
// Time window: 30 days
// Allowed downtime: 43.2 minutes
// Total requests: 2,592,000,000
// Allowed failures: 2,592,000
SLO vs Downtime Lookup Table
| SLO | Downtime per Year | Downtime per Month | Downtime per Week | Downtime per Day |
|---|---|---|---|---|
| 90% | 36.5 days | 3 days | 16.8 hours | 2.4 hours |
| 95% | 18.25 days | 1.5 days | 8.4 hours | 1.2 hours |
| 99% | 3.65 days | 7.2 hours | 1.68 hours | 14.4 minutes |
| 99.5% | 1.83 days | 3.6 hours | 50.4 minutes | 7.2 minutes |
| 99.9% | 8.76 hours | 43.2 minutes | 10.1 minutes | 1.44 minutes |
| 99.95% | 4.38 hours | 21.6 minutes | 5.04 minutes | 43.2 seconds |
| 99.99% | 52.6 minutes | 4.32 minutes | 1.01 minutes | 8.64 seconds |
| 99.999% | 5.26 minutes | 25.9 seconds | 6.05 seconds | 0.86 seconds |
Error Budget Consumption Tracking
Real-Time Budget Monitoring
// error-budget-monitor.ts
import { Prometheus } from 'prom-client';
interface BudgetStatus {
slo: number;
windowStart: Date;
windowEnd: Date;
totalRequests: number;
failedRequests: number;
currentSuccessRate: number;
errorBudgetAllowed: number;
errorBudgetConsumed: number;
errorBudgetRemaining: number;
percentConsumed: number;
projectedBudgetBurn: number;
}
async function getErrorBudgetStatus(slo: SLO, windowDays: number = 30): Promise<BudgetStatus> {
const windowEnd = new Date();
const windowStart = new Date(windowEnd.getTime() - windowDays * 24 * 60 * 60 * 1000);
// Query metrics
const totalRequests = await queryMetric(`sum(increase(http_requests_total[${windowDays}d]))`);
const failedRequests = await queryMetric(`sum(increase(http_requests_total{status=~"5.."}[${windowDays}d]))`);
const currentSuccessRate = ((totalRequests - failedRequests) / totalRequests) * 100;
// Calculate budget
const errorBudgetAllowed = Math.floor((totalRequests * (100 - slo.target)) / 100);
const errorBudgetConsumed = failedRequests;
const errorBudgetRemaining = errorBudgetAllowed - errorBudgetConsumed;
const percentConsumed = (errorBudgetConsumed / errorBudgetAllowed) * 100;
// Project future burn rate
const daysElapsed = (new Date().getTime() - windowStart.getTime()) / (1000 * 60 * 60 * 24);
const burnRate = errorBudgetConsumed / daysElapsed;
const projectedBudgetBurn = ((burnRate * windowDays) / errorBudgetAllowed) * 100;
return {
slo: slo.target,
windowStart,
windowEnd,
totalRequests,
failedRequests,
currentSuccessRate,
errorBudgetAllowed,
errorBudgetConsumed,
errorBudgetRemaining,
percentConsumed,
projectedBudgetBurn,
};
}
// Usage with alerting
async function checkErrorBudget(slo: SLO) {
const status = await getErrorBudgetStatus(slo, 30);
console.log(`\n?? Error Budget Status for ${slo.name}`);
console.log(`SLO Target: ${status.slo}%`);
console.log(`Current Success Rate: ${status.currentSuccessRate.toFixed(3)}%`);
console.log(`\nError Budget:`);
console.log(` Allowed: ${status.errorBudgetAllowed.toLocaleString()} failures`);
console.log(` Consumed: ${status.errorBudgetConsumed.toLocaleString()} failures`);
console.log(` Remaining: ${status.errorBudgetRemaining.toLocaleString()} failures`);
console.log(` Percent Used: ${status.percentConsumed.toFixed(2)}%`);
console.log(`\nProjected Budget Burn: ${status.projectedBudgetBurn.toFixed(2)}%`);
// Alert thresholds
if (status.percentConsumed > 100) {
console.error('?? CRITICAL: Error budget exhausted! SLO violated.');
alertOncall({
severity: 'critical',
message: `${slo.name} SLO violated. Error budget at ${status.percentConsumed.toFixed(0)}%`,
});
} else if (status.percentConsumed > 80) {
console.warn('?? WARNING: Error budget 80% consumed');
alertTeam({
severity: 'warning',
message: `${slo.name} error budget at ${status.percentConsumed.toFixed(0)}%. Slow down deployments.`,
});
} else if (status.projectedBudgetBurn > 100) {
console.warn('?? WARNING: Projected to exceed error budget');
alertTeam({
severity: 'warning',
message: `${slo.name} projected to exceed error budget (${status.projectedBudgetBurn.toFixed(0)}% burn rate)`,
});
} else {
console.log('? Error budget healthy');
}
}
Multi-Window Alerting (Burn Rate)
Fast-burning error budgets need immediate attention. Use multiple time windows:
// burn-rate-alerts.ts
interface BurnRateAlert {
lookbackWindow: string;
burnRateThreshold: number;
errorBudgetThreshold: number;
severity: 'warning' | 'critical';
}
const burnRateAlerts: BurnRateAlert[] = [
// Fast burn - immediate action needed
{
lookbackWindow: '1h',
burnRateThreshold: 14.4, // 14.4x burn rate
errorBudgetThreshold: 2, // 2% of 30-day budget consumed
severity: 'critical',
},
// Medium burn - investigate soon
{
lookbackWindow: '6h',
burnRateThreshold: 6, // 6x burn rate
errorBudgetThreshold: 5,
severity: 'warning',
},
// Slow burn - keep an eye on it
{
lookbackWindow: '3d',
burnRateThreshold: 1, // Equal to expected
errorBudgetThreshold: 10,
severity: 'warning',
},
];
async function checkBurnRates(slo: SLO) {
for (const alert of burnRateAlerts) {
const windowMinutes = parseWindow(alert.lookbackWindow);
const errorRate = await queryMetric(
`(1 - sum(rate(http_requests_total{status=~"2..|3.."}[${alert.lookbackWindow}])) / sum(rate(http_requests_total[${alert.lookbackWindow}]))) * 100`,
);
const expectedErrorRate = 100 - slo.target; // e.g., 0.1% for 99.9% SLO
const burnRate = errorRate / expectedErrorRate;
const budgetConsumed = await queryMetric(
`sum(increase(http_requests_total{status=~"5.."}[${alert.lookbackWindow}])) / sum(increase(http_requests_total[30d])) * 100`,
);
if (burnRate > alert.burnRateThreshold && budgetConsumed > alert.errorBudgetThreshold) {
alertTeam({
severity: alert.severity,
message: `High error budget burn rate: ${burnRate.toFixed(1)}x over ${alert.lookbackWindow}`,
details: {
window: alert.lookbackWindow,
errorRate: `${errorRate.toFixed(3)}%`,
budgetConsumed: `${budgetConsumed.toFixed(2)}%`,
},
});
}
}
}
Choosing Good SLOs
The Golden Signals
Start with the four golden signals from Google's SRE book:
graph TD
A[SLO Categories] --> B[Latency]
A --> C[Traffic]
A --> D[Errors]
A --> E[Saturation]
B --> B1[Request duration<br/>p50, p95, p99]
C --> C1[Requests per second<br/>Throughput]
D --> D1[Error rate<br/>Failed requests %]
E --> E1[Resource utilization<br/>CPU, Memory, Disk]
style B fill:#bbdefb
style C fill:#c5e1a5
style D fill:#ffccbc
style E fill:#fff9c4
Example SLOs by Service Type
API Service
const apiSLOs: SLO[] = [
{
name: 'API Availability',
sli: requestSuccessRateSLI,
target: 99.9,
window: '30d',
unit: '%',
},
{
name: 'API Latency P95',
sli: requestLatencyP95SLI,
target: 200,
window: '30d',
unit: 'ms',
},
{
name: 'API Latency P99',
sli: requestLatencyP99SLI,
target: 500,
window: '30d',
unit: 'ms',
},
];
Background Job Processor
const jobProcessorSLOs: SLO[] = [
{
name: 'Job Success Rate',
sli: jobSuccessRateSLI,
target: 99.5,
window: '30d',
unit: '%',
},
{
name: 'Job Processing Time P95',
sli: jobProcessingTimeP95SLI,
target: 60000, // 1 minute
window: '7d',
unit: 'ms',
},
{
name: 'Job Queue Depth',
sli: jobQueueDepthSLI,
target: 1000,
window: '1d',
unit: 'jobs',
},
];
Data Pipeline
const dataPipelineSLOs: SLO[] = [
{
name: 'Data Freshness',
sli: dataFreshnessSLI,
target: 15, // minutes
window: '7d',
unit: 'minutes',
},
{
name: 'Data Completeness',
sli: dataCompletenessSLI,
target: 99.99,
window: '30d',
unit: '%',
},
{
name: 'Pipeline Success Rate',
sli: pipelineSuccessRateSLI,
target: 99.0,
window: '30d',
unit: '%',
},
];
SLO Definition Best Practices
| Principle | Good ? | Bad ? |
|---|---|---|
| User-centric | "Database replication lag < 5s" | "95% of page loads complete in < 2s" |
| Measurable | "System is fast" | "P95 latency < 200ms" |
| Achievable | 99.9999% (5 nines) for startup | 99.9% (3 nines) realistic |
| Business-aligned | "Zero errors ever" | "Error rate doesn't exceed refund policy" |
| Simple | "Weighted score of 7 metrics" | "Request success rate > 99.9%" |
Using Error Budgets for Decision Making
Deployment Gating
// deployment-gate.ts
async function canDeploy(slo: SLO): Promise<boolean> {
const status = await getErrorBudgetStatus(slo, 30);
// Policy: Don't deploy if error budget > 80% consumed
if (status.percentConsumed > 80) {
console.log(`? Deployment blocked: Error budget ${status.percentConsumed.toFixed(0)}% consumed`);
console.log(`Focus on reliability before deploying new features.`);
return false;
}
// Policy: Don't deploy if burn rate projects budget exhaustion
if (status.projectedBudgetBurn > 100) {
console.log(`? Deployment blocked: Projected to exceed error budget`);
console.log(`Current burn rate: ${status.projectedBudgetBurn.toFixed(0)}%`);
return false;
}
console.log(`? Deployment approved: Error budget ${status.percentConsumed.toFixed(0)}% consumed`);
return true;
}
// CI/CD integration
async function deploymentPipeline() {
const criticalSLOs = [availabilitySLO, latencySLO];
for (const slo of criticalSLOs) {
const allowed = await canDeploy(slo);
if (!allowed) {
process.exit(1); // Block deployment
}
}
// All SLOs healthy - proceed with deployment
console.log('All SLOs healthy. Proceeding with deployment...');
deploy();
}
Feature Velocity vs Reliability
// velocity-calculator.ts
interface VelocityDecision {
errorBudgetRemaining: number;
recommendedDeploymentFrequency: string;
recommendedChangeSizeRisk: 'low' | 'medium' | 'high';
canExpediteFeatures: boolean;
}
function calculateVelocityPolicy(budgetStatus: BudgetStatus): VelocityDecision {
const remaining = budgetStatus.errorBudgetRemaining;
const percentRemaining = 100 - budgetStatus.percentConsumed;
if (percentRemaining > 50) {
return {
errorBudgetRemaining: remaining,
recommendedDeploymentFrequency: 'Multiple per day',
recommendedChangeSizeRisk: 'high',
canExpediteFeatures: true,
};
} else if (percentRemaining > 20) {
return {
errorBudgetRemaining: remaining,
recommendedDeploymentFrequency: 'Daily',
recommendedChangeSizeRisk: 'medium',
canExpediteFeatures: false,
};
} else {
return {
errorBudgetRemaining: remaining,
recommendedDeploymentFrequency: 'Weekly or less',
recommendedChangeSizeRisk: 'low',
canExpediteFeatures: false,
};
}
}
Implementing SLOs: A Step-by-Step Guide
Step 1: Identify User Journeys
Map the critical paths users take through your service:
// user-journeys.ts
interface UserJourney {
name: string;
steps: string[];
importance: 'critical' | 'high' | 'medium' | 'low';
}
const userJourneys: UserJourney[] = [
{
name: 'User Authentication',
steps: ['POST /api/auth/login', 'GET /api/user/profile'],
importance: 'critical',
},
{
name: 'Product Purchase',
steps: ['GET /api/products/:id', 'POST /api/cart/add', 'POST /api/checkout', 'POST /api/payment/process'],
importance: 'critical',
},
{
name: 'View Dashboard',
steps: ['GET /api/dashboard', 'GET /api/analytics'],
importance: 'high',
},
];
Step 2: Define SLIs for Each Journey
// journey-slis.ts
interface JourneySLI {
journey: UserJourney;
availabilitySLI: SLI;
latencySLI: SLI;
}
const purchaseJourneySLI: JourneySLI = {
journey: userJourneys[1], // Product Purchase
availabilitySLI: {
name: 'purchase_journey_availability',
description: 'Percentage of successful purchase flows',
measurement: async () => {
// Measure end-to-end journey success
const total = await queryMetric('sum(purchase_attempts_total)');
const successful = await queryMetric('sum(purchase_success_total)');
return (successful / total) * 100;
},
},
latencySLI: {
name: 'purchase_journey_latency_p95',
description: 'P95 time from cart to payment confirmation',
measurement: async () => {
return await queryMetric('histogram_quantile(0.95, purchase_duration_seconds_bucket)');
},
},
};
Step 3: Set Initial SLO Targets
Start with what you're currently achieving, then improve:
// baseline-slo.ts
async function establishBaselineSLO(sli: SLI, days: number = 90): Promise<number> {
// Measure current performance over 90 days
const measurements: number[] = [];
for (let i = 0; i < days; i++) {
const value = await sli.measurement();
measurements.push(value);
}
// Use P99 of current performance as initial SLO
measurements.sort((a, b) => a - b);
const p99Index = Math.floor(measurements.length * 0.99);
const baseline = measurements[p99Index];
console.log(`Current performance (P99): ${baseline.toFixed(2)}`);
console.log(`Recommended initial SLO: ${baseline.toFixed(2)}`);
return baseline;
}
Step 4: Implement Monitoring and Alerting
# prometheus-rules.yml
groups:
- name: slo_alerts
interval: 30s
rules:
# High burn rate alert (1 hour window)
- alert: HighErrorBudgetBurnRate1h
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) /
sum(rate(http_requests_total[1h]))
) > 14.4 * (1 - 0.999)
for: 2m
labels:
severity: critical
annotations:
summary: 'High error budget burn rate detected'
description: 'Error budget burning at 14.4x normal rate over 1 hour'
# Error budget exhausted
- alert: ErrorBudgetExhausted
expr: |
(
sum(increase(http_requests_total{status=~"5.."}[30d])) /
sum(increase(http_requests_total[30d]))
) > (1 - 0.999)
labels:
severity: critical
annotations:
summary: 'SLO violated - error budget exhausted'
description: '30-day error budget has been exceeded'
Step 5: Build SLO Dashboard
// slo-dashboard.ts
import { promisify } from 'util';
interface SLODashboard {
slos: Array<{
name: string;
target: number;
current: number;
status: 'healthy' | 'warning' | 'critical';
errorBudget: {
allowed: number;
consumed: number;
remaining: number;
percentUsed: number;
};
}>;
overallHealth: number;
}
async function generateSLODashboard(slos: SLO[]): Promise<SLODashboard> {
const dashboard: SLODashboard = {
slos: [],
overallHealth: 0,
};
for (const slo of slos) {
const current = await slo.sli.measurement();
const budgetStatus = await getErrorBudgetStatus(slo, 30);
let status: 'healthy' | 'warning' | 'critical' = 'healthy';
if (budgetStatus.percentConsumed > 100) {
status = 'critical';
} else if (budgetStatus.percentConsumed > 80) {
status = 'warning';
}
dashboard.slos.push({
name: slo.name,
target: slo.target,
current,
status,
errorBudget: {
allowed: budgetStatus.errorBudgetAllowed,
consumed: budgetStatus.errorBudgetConsumed,
remaining: budgetStatus.errorBudgetRemaining,
percentUsed: budgetStatus.percentConsumed,
},
});
}
// Calculate overall health
const healthyCount = dashboard.slos.filter((s) => s.status === 'healthy').length;
dashboard.overallHealth = (healthyCount / dashboard.slos.length) * 100;
return dashboard;
}
Real-World Example: E-Commerce Platform
The Situation
E-commerce platform with frequent deployments (10/day) experiencing occasional outages and customer complaints about slow checkout.
The SLOs
const ecommerceSLOs: SLO[] = [
{
name: 'Checkout Availability',
sli: checkoutSuccessRateSLI,
target: 99.95, // Very strict - money involved
window: '30d',
unit: '%',
},
{
name: 'Checkout Latency P95',
sli: checkoutLatencyP95SLI,
target: 1000, // 1 second
window: '30d',
unit: 'ms',
},
{
name: 'Product Browse Availability',
sli: browseSuccessRateSLI,
target: 99.9, // Less strict than checkout
window: '30d',
unit: '%',
},
];
The Error Budget Policy
| Error Budget Remaining | Deployment Policy | Change Size | Testing Requirements |
|---|---|---|---|
| > 50% | Deploy freely, 5-10x/day | Large changes OK | Standard CI/CD |
| 20-50% | Deploy cautiously, 1-2x/day | Medium changes | + Canary deployment |
| 5-20% | Deploy only critical fixes | Small changes only | + Manual QA sign-off |
| < 5% | Freeze all non-critical deploys | Emergency only | + VP approval |
The Results
Before SLOs:
- 10 deployments/day
- 2-3 incidents/month
- Unclear when to deploy
- Debates about "acceptable downtime"
After SLOs:
- Deployment frequency varies with error budget
- 0.5 incidents/month
- Data-driven deployment decisions
- Objective reliability targets
Conclusion
SLOs and error budgets transform reliability from a philosophical debate into an engineering discipline. They provide:
- Clarity: Specific, measurable reliability targets
- Balance: Framework for reliability vs. velocity tradeoffs
- Accountability: Clear ownership of reliability outcomes
- Objectivity: Data-driven deployment and risk decisions
To start using SLOs:
- Choose 2-3 critical user journeys
- Define availability and latency SLIs
- Set achievable SLO targets (start with current performance)
- Calculate and track error budgets
- Use error budgets to gate deployments
Remember: Perfect reliability (100% uptime) is impossible and economically irrational. SLOs help you find the right balance for your business�reliable enough to keep users happy, but not so strict that it paralyzes innovation.
Ready to implement SLOs and error budgets in your engineering organization? Sign up for ScanlyApp and get automated SLO monitoring, error budget tracking, and intelligent deployment gating integrated into your CI/CD pipeline.
