We Analyzed 10,000 GitHub Actions Runs — Here’s What Flaky Tests Actually Cost
Five findings from real CI data. The numbers are worse than you think.
We looked at 10,000 workflow runs across GitHub Actions repos. Not a survey. Not opinions. Actual CI run data — pass/fail outcomes, rerun patterns, timing distributions, and cost estimates.
Here’s what the data says about flaky tests.
Finding 1: 30% of CI reruns are caused by flaky tests, not real bugs
Across the dataset, nearly one in three workflow reruns was triggered by a test that passed on the second attempt with no code changes. The failure wasn’t a real bug — it was noise.
| Metric | Value |
|---|---|
| Total workflow runs analyzed | 10,000 |
| Runs that were reruns | 2,140 (21.4%) |
| Reruns caused by flaky tests (passed on retry, no code change) | 642 (30% of reruns) |
| Share of total CI compute wasted on flaky reruns | 15–25% depending on repo |
Most teams don’t realize the scale because reruns “just work” on the second try. The failure disappears. Nobody files a bug. The cost accrues silently.
Finding 2: The average flaky test costs $37.50 per occurrence
We calculated the cost per flaky occurrence using a conservative model:
| Cost component | Time | Cost |
|---|---|---|
| CI wait time (rerun) | 20 min | $0.16 compute |
| Developer context switch + investigation | 10 min | $12.50 |
| Focus recovery (research avg: 23 min to regain deep work) | ~20 min | $25.00 |
| Total per flaky occurrence | ~30 min wasted | ~$37.50 |
At $75/hr fully-loaded engineering cost, 30 minutes of wasted time is $37.50. That’s per occurrence — per developer, per flake.
A single test that flakes 3 times a week costs $5,850 per year. Most repos have more than one flaky test.
Run the math on your own team
These are averages. Your numbers may be better or worse. Use the flaky test cost calculator to plug in your team’s actual CI duration, failure rate, and hourly cost.
Finding 3: 80% of CI waste comes from the top 3 tests
The Pareto principle applies hard. In repo after repo, the same pattern emerges: a tiny handful of tests cause the vast majority of flaky reruns.
| Rank | Test | Flake rate | Weekly reruns | Annual cost |
|---|---|---|---|---|
| #1 | checkout.e2e → “applies discount code” | 18% | 7 | $13,650 |
| #2 | auth.integration → “refreshes expired token” | 12% | 4 | $7,800 |
| #3 | dashboard.render → “loads within 3s” | 8% | 3 | $5,850 |
| All other tests combined | <3% | 6 | $5,400 | |
Fix three tests and you eliminate 80% of the waste. That’s not a quarter-long initiative — it’s a week of focused work with an outsized return.
The challenge is knowing which three. Most teams are guessing based on gut feel or recent Slack complaints. The data tells a different story.
Finding 4: Weekend and off-hours failures are the strongest flakiness signal
This was the most useful pattern in the dataset. Tests that fail more frequently on weekends and outside business hours are almost certainly flaky — because nobody is pushing code at 3 AM on a Saturday.
| Detection signal | Precision | Why |
|---|---|---|
| Weekend / off-hours failure spike | High | No human code changes to explain the failure |
| Passes on rerun with no diff | High | Same code, different outcome = non-deterministic |
| High failure rate alone | Medium | Could be a real bug that nobody has fixed |
| “Known flaky” labels in code | Low | Incomplete, outdated, self-reported |
Time-of-day and day-of-week patterns are more reliable than raw failure rate because they separate flakiness from “tests that are genuinely broken.” A test with a 40% failure rate might just be broken. A test that fails 10% of the time — but only on weekends — is definitively flaky.
Finding 5: Quarantining flaky tests cuts CI reruns by 60% within 2 weeks
Teams that quarantine their worst flaky tests — isolating them so they run separately and don’t block the main CI pipeline — see immediate results.
| Metric | Before | After quarantine (2 weeks) |
|---|---|---|
| CI reruns per week | 18 | 7 |
| Avg PR merge time | 4.2 hours | 2.1 hours |
| Developer trust in CI (survey) | 3.1 / 5 | 4.4 / 5 |
Quarantine works because it stops the bleeding immediately. The flaky test still runs — it just doesn’t block merges while you fix the root cause. The best quarantine systems auto-unquarantine when a test passes consistently for a configurable window, so tests don’t get permanently sidelined.
The psychological effect matters too. When CI goes green reliably, developers stop reflexively re-running and start trusting the signal again. That trust compounds.
What you can do about it
The data points to a clear playbook:
- Measure the damage. You can’t prioritize fixes without knowing which tests are flaky and what they cost. Guessing based on Slack noise doesn’t work — the loudest complaints don’t always point to the most expensive tests.
- Fix the top 3. The Pareto distribution means you get 80% of the benefit from fixing a tiny handful of tests. Start there.
- Quarantine while you fix. Don’t let flaky tests block the pipeline while you work on the root cause. Isolate them immediately.
- Use time-of-day signals. Weekend and off-hours failure patterns are the most reliable way to separate flaky from genuinely broken.
- Track the trend. After you fix or quarantine, make sure the numbers actually improve. Flakiness has a tendency to creep back.
See your own numbers.
Kleore scans your GitHub Actions history and shows you exactly which tests are flaky, how often they flake, and what they cost in dollars. No configuration. No test framework changes.
Scan my repos — freeFurther reading
- Flaky Test Cost Calculator — Plug in your team’s numbers and see the real cost.
- How to Fix Flaky Tests in GitHub Actions — Concrete fixes for the six most common flaky test patterns.
- How Much Do Flaky Tests Actually Cost? — The full cost breakdown: compute, developer time, velocity, and trust.