Flaky Test Debugging in CI/CD: A Forensic Method That Finds the Real Root Cause
The first flaky test in a CI/CD pipeline is barely a nuisance. You re-run the pipeline, it passes, and you move on. The tenth flaky test breaks your trust in the entire suite. The fiftieth means your team has learned to ignore red builds. And the day someone ships a real regression because "it was probably just flaky" is the day flaky tests become a production incident.
Flaky tests are not just an inconvenience. They are a systemic trust failure that compounds over time. This guide treats flakiness as a forensic problem — something to be diagnosed, categorized, traced to root causes, and remediated permanently.
Defining the Problem: What Makes a Test Flaky?
A flaky test is a test that passes and fails non-deterministically without any changes to the code under test. The same commit, the same environment, the same test — different results.
The critical distinction: a flaky test is not a test that fails. It is a test that sometimes fails and sometimes passes. This makes them far more dangerous than consistently failing tests, because:
- They are hard to reproduce locally — CI environments differ from development machines
- They destroy signal — a developer re-running a flaky test to get a green build might be hiding a real failure
- They normalize "re-run culture" — teams stop investigating failures and just re-run
- They slow pipelines — retries add 20–100% latency to CI runs
The Taxonomy of Flakiness: Eight Root Causes
Effective remediation starts with correct diagnosis. Flakiness has eight primary root causes:
mindmap
root((Flaky Tests))
Timing & Async
Hard-coded sleeps
Race conditions
Polling without timeout
Resource Contention
Shared test database
Port conflicts
File system races
External Dependencies
Third-party APIs
Email/SMS services
Clock/timezone
Test Order Dependency
Shared global state
Database leftovers
Auth state bleed
Environment Differences
OS path separators
Locale settings
Node version
Network Issues
CDN latency
DNS resolution
WebSocket drops
Visually Non-Deterministic
Animation timing
Font rendering
Dynamic content
Browser Process
Memory pressure
GPU process crash
Browser version mismatch
Forensic Techniques: Finding the Root Cause
Technique 1: Flakiness Rate Tracking
The first step is measurement. Without a flakiness rate, you are operating blind. Track test results over time:
// scripts/track-flakiness.ts
// Run after each CI build to accumulate flakiness data
interface TestResult {
testName: string;
passed: boolean;
commitSha: string;
runId: string;
timestamp: Date;
}
async function recordResults(junitXmlPath: string) {
const results = parseJUnitXml(junitXmlPath);
await db.insert('test_runs', results);
}
async function getFlakinessReport() {
const query = `
SELECT
test_name,
COUNT(*) as total_runs,
SUM(CASE WHEN passed = false THEN 1 ELSE 0 END) as failures,
ROUND(SUM(CASE WHEN passed = false THEN 1 ELSE 0 END)::decimal / COUNT(*) * 100, 1) as flakiness_rate
FROM test_runs
WHERE timestamp > NOW() - INTERVAL '14 days'
GROUP BY test_name
HAVING SUM(CASE WHEN passed = false THEN 1 ELSE 0 END) > 0
ORDER BY flakiness_rate DESC
LIMIT 20;
`;
return db.query(query);
}
A test is considered flaky if it fails more than 1% of runs where nothing changed. Prioritize remediation by flakiness rate.
Technique 2: The Isolation Replay
Run the suspected flaky test in isolation, 20 times in a row. If it fails any of those runs, you have confirmed flakiness and can begin debugging:
# Run the specific test 20 times and report any failures
for i in $(seq 1 20); do
npx playwright test --grep "test name" --reporter=dot 2>&1
echo "Run $i complete"
done | grep -E "failed|passed|Run"
Playwright also has a built-in repeat mechanism:
npx playwright test --repeat-each=10 tests/flaky-test.spec.ts
Technique 3: Playwright Trace Analysis
Playwright's trace viewer is the single most powerful tool for diagnosing flaky test failures. When a test fails in CI, the trace file captures every action, snapshot, network request, and console log.
// playwright.config.ts — generate traces for all failures
export default defineConfig({
use: {
trace: 'on-first-retry', // capture trace on first retry
screenshot: 'only-on-failure',
video: 'retain-on-failure',
},
retries: process.env.CI ? 2 : 0,
});
In the trace viewer, look for:
- Timing gaps — long pauses before assertions that suggest async issues
- Network 500s — backend errors that appear intermittently
- Missing elements — elements that should be visible but were not found yet
- Console errors — JavaScript errors that explain unexpected behavior
Technique 4: The Surgeon's Checklist — Hard Waits
The most common cause of flakiness in Playwright tests is using page.waitForTimeout() (hard sleep) instead of event-based waits. Search your codebase for these and replace every one:
// ❌ Flaky: fixed sleep that might not be long enough on slow CI
await page.waitForTimeout(2000);
await page.getByText('Success').click();
// ✅ Reliable: wait for the actual network event
await page.waitForResponse((res) => res.url().includes('/api/save'));
await page.getByText('Success').click();
// ✅ Reliable: wait for the element to actually be visible
await page.getByText('Success').waitFor({ state: 'visible' });
await page.getByText('Success').click();
// ✅ Reliable: for polling-based state changes
await expect(page.getByTestId('status')).toHaveText('Complete', { timeout: 10_000 });
The Eight Fixes: Matched to Root Causes
| Root Cause | Fix |
|---|---|
| Hard-coded sleeps | Replace with waitForResponse, waitFor, expect assertions with timeout |
| Shared test database | Isolate each test with unique data or transaction rollback |
| Test order dependency | Reset state in beforeEach; never rely on previous test side effects |
| External API calls | Mock third-party services in test environments |
| Animation timing | Use page.addInitScript to disable CSS animations in tests |
| Port conflicts | Use random available ports; check before binding |
| Browser memory pressure | Limit parallelism; reuse browser contexts (worker scope) |
| CI environment differences | Use Docker containers with locked dependencies |
Disabling CSS Animations: Quick Win
CSS animations cause visual timing issues especially in screenshot and visual regression tests. Disable them globally in your test setup:
// tests/setup/disable-animations.ts
export async function disableAnimations(page: Page) {
await page.addInitScript(() => {
const style = document.createElement('style');
style.textContent = `
*, *::before, *::after {
animation-duration: 0s !important;
animation-delay: 0s !important;
transition-duration: 0s !important;
transition-delay: 0s !important;
}
`;
document.head.appendChild(style);
});
}
Apply it in a fixture for all tests that do visual assertions.
Handling Test Isolation: The Most Impactful Fix
Tests that leave data in a shared database are a primary source of order-dependent flakiness. The gold standard is transaction rollback between tests:
// tests/fixtures/db.ts - Wrap each test in a database transaction
export const test = base.extend<{ db: DatabaseClient }>({
db: async ({}, use) => {
const client = await pool.connect();
await client.query('BEGIN');
await use(client);
await client.query('ROLLBACK'); // Undo all changes after each test
client.release();
},
});
If transaction rollback is not feasible (e.g., your tests run against a remote staging database), use unique identifiers for all test-created data and clean up in afterEach:
let createdProjectId: string;
test.beforeEach(async ({ request }) => {
const res = await request.post('/api/projects', {
data: { name: `Test-${crypto.randomUUID()}` },
});
createdProjectId = (await res.json()).id;
});
test.afterEach(async ({ request }) => {
await request.delete(`/api/projects/${createdProjectId}`);
});
CI-Specific Flakiness: The Environmental Factor
Tests that pass locally but fail in CI are often caused by CI-specific conditions:
flowchart LR
A[Passes locally\nFails in CI] --> B{Check these first}
B --> C[Different Node version\nnvm use / .nvmrc]
B --> D[Different OS\npath separator bugs]
B --> E[Fewer CPU cores\nslower JS execution]
B --> F[No GPU / headless\nrendering differences]
B --> G[Network latency\nslower API calls]
B --> H[Timezone UTC\nvs local time]
The most reliable fix for environment-related flakiness is containerizing your test runs. Using the same Docker image locally and in CI eliminates the "it works on my machine" class of failures entirely.
# docker/test.Dockerfile
FROM mcr.microsoft.com/playwright:v1.51.0-jammy
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
CMD ["npx", "playwright", "test"]
Flaky Test Quarantine: A Pragmatic Tactic
While you are working on permanent fixes, quarantine known flaky tests to prevent them from blocking deployments:
// Add to flaky tests while fix is in progress
test.fixme('this test is flaky - tracked in PROJ-1234', async ({ page }) => {
// test body
});
Or in your CI, mark as allowed-to-fail while tracking separately:
# .github/workflows/test.yml
- name: Run tests
run: npx playwright test
continue-on-error: ${{ contains(github.event.head_commit.message, '[skip-flaky-check]') }}
Keep a flaky test backlog in your issue tracker and review it in every sprint. Let no flaky test go unaddressed for more than two sprints.
Connecting to Production Reliability
Flaky tests are a signal about the health of both your test suite and your application. An application with a high proportion of timing-dependent test failures may also have timing-dependent production issues — race conditions, eventual consistency bugs, and state management problems.
Fixing flaky tests is therefore not just about CI stability — it is about understanding and improving the reliability of your actual application. This is why robust automated test infrastructure and production monitoring go hand in hand.
ScanlyApp's scan infrastructure is designed with flakiness resistance at its core — multiple network retries, animation-neutral screenshot capture, and intelligent assertion timing mean scan results reflect actual application state, not transient rendering conditions.
Know the difference between a flaky test and a real regression: Try ScanlyApp free and establish a stable production monitoring baseline for your most critical user flows.
The Anti-Flakiness Code Review Checklist
Add this to your PR review process to prevent new flakiness from entering the codebase:
- No hard
waitForTimeoutcalls without a documented justification - All test data is either uniquely named or cleaned up in
afterEach - Async operations await
waitForResponseor specific element states - No assertions on text that could change between renders (dynamic counts, dates)
- Network mocks are explicitly cleared between tests if using a global interceptor
- Visual screenshots disable animations
Summary: The Flakiness Remediation Workflow
1. MEASURE → Track flakiness rate per test over 14 days
2. TRIAGE → Sort by flakiness rate; focus on top 10
3. ISOLATE → Run each flaky test 20x in isolation to confirm
4. TRACE → Use Playwright Trace Viewer to find exact failure moment
5. CATEGORIZE → Which of the 8 root causes applies?
6. FIX → Apply the matched remediation
7. VERIFY → Run 20x again; confirm fix
8. MONITOR → Re-check flakiness dashboard in 2 weeks
Flaky tests are not random. They have causes, and causes have solutions. The teams with the most reliable CI/CD pipelines are not the ones who got lucky — they are the ones who treated flakiness as a first-class engineering problem worth solving systematically.
Further Reading
- Playwright Trace Viewer: Step-by-step visual debugger for dissecting exactly what happened during a failed test run
- Playwright Test Retries: Official documentation on configuring
retries,maxFailures, and understanding retry behavior in CI - Playwright Flaky Tests Guide: Best practices for avoiding non-determinism in test suites
- GitHub Actions — Rerunning Failed Jobs: How to configure GitHub CI to automatically rerun only failed jobs
Related articles: Also see the complete playbook for identifying every category of flaky test, how parallel execution can both expose and eliminate flakiness, and implementing continuous testing that surfaces instability early.
Want to add production monitoring that filters out the noise of network blips? Set up a ScanlyApp scan with smart retry logic on your critical paths and know instantly when a real regression hits.
