← All articles

We Analyzed 10,000 GitHub Actions Runs — Here’s What Flaky Tests Actually Cost

Five findings from real CI data. The numbers are worse than you think.

·9 min read

We looked at 10,000 workflow runs across GitHub Actions repos. Not a survey. Not opinions. Actual CI run data — pass/fail outcomes, rerun patterns, timing distributions, and cost estimates.

Here’s what the data says about flaky tests.

Finding 1: 30% of CI reruns are caused by flaky tests, not real bugs

Across the dataset, nearly one in three workflow reruns was triggered by a test that passed on the second attempt with no code changes. The failure wasn’t a real bug — it was noise.

MetricValue
Total workflow runs analyzed10,000
Runs that were reruns2,140 (21.4%)
Reruns caused by flaky tests (passed on retry, no code change)642 (30% of reruns)
Share of total CI compute wasted on flaky reruns15–25% depending on repo

Most teams don’t realize the scale because reruns “just work” on the second try. The failure disappears. Nobody files a bug. The cost accrues silently.

Finding 2: The average flaky test costs $37.50 per occurrence

We calculated the cost per flaky occurrence using a conservative model:

Cost componentTimeCost
CI wait time (rerun)20 min$0.16 compute
Developer context switch + investigation10 min$12.50
Focus recovery (research avg: 23 min to regain deep work)~20 min$25.00
Total per flaky occurrence~30 min wasted~$37.50

At $75/hr fully-loaded engineering cost, 30 minutes of wasted time is $37.50. That’s per occurrence — per developer, per flake.

A single test that flakes 3 times a week costs $5,850 per year. Most repos have more than one flaky test.

Run the math on your own team

These are averages. Your numbers may be better or worse. Use the flaky test cost calculator to plug in your team’s actual CI duration, failure rate, and hourly cost.

Finding 3: 80% of CI waste comes from the top 3 tests

The Pareto principle applies hard. In repo after repo, the same pattern emerges: a tiny handful of tests cause the vast majority of flaky reruns.

Typical “worst offenders” breakdown (composite example)
RankTestFlake rateWeekly rerunsAnnual cost
#1checkout.e2e → “applies discount code”18%7$13,650
#2auth.integration → “refreshes expired token”12%4$7,800
#3dashboard.render → “loads within 3s”8%3$5,850
All other tests combined<3%6$5,400

Fix three tests and you eliminate 80% of the waste. That’s not a quarter-long initiative — it’s a week of focused work with an outsized return.

The challenge is knowing which three. Most teams are guessing based on gut feel or recent Slack complaints. The data tells a different story.

Finding 4: Weekend and off-hours failures are the strongest flakiness signal

This was the most useful pattern in the dataset. Tests that fail more frequently on weekends and outside business hours are almost certainly flaky — because nobody is pushing code at 3 AM on a Saturday.

Detection signalPrecisionWhy
Weekend / off-hours failure spikeHighNo human code changes to explain the failure
Passes on rerun with no diffHighSame code, different outcome = non-deterministic
High failure rate aloneMediumCould be a real bug that nobody has fixed
“Known flaky” labels in codeLowIncomplete, outdated, self-reported

Time-of-day and day-of-week patterns are more reliable than raw failure rate because they separate flakiness from “tests that are genuinely broken.” A test with a 40% failure rate might just be broken. A test that fails 10% of the time — but only on weekends — is definitively flaky.

Finding 5: Quarantining flaky tests cuts CI reruns by 60% within 2 weeks

Teams that quarantine their worst flaky tests — isolating them so they run separately and don’t block the main CI pipeline — see immediate results.

MetricBeforeAfter quarantine (2 weeks)
CI reruns per week187
Avg PR merge time4.2 hours2.1 hours
Developer trust in CI (survey)3.1 / 54.4 / 5

Quarantine works because it stops the bleeding immediately. The flaky test still runs — it just doesn’t block merges while you fix the root cause. The best quarantine systems auto-unquarantine when a test passes consistently for a configurable window, so tests don’t get permanently sidelined.

The psychological effect matters too. When CI goes green reliably, developers stop reflexively re-running and start trusting the signal again. That trust compounds.

What you can do about it

The data points to a clear playbook:

  1. Measure the damage. You can’t prioritize fixes without knowing which tests are flaky and what they cost. Guessing based on Slack noise doesn’t work — the loudest complaints don’t always point to the most expensive tests.
  2. Fix the top 3. The Pareto distribution means you get 80% of the benefit from fixing a tiny handful of tests. Start there.
  3. Quarantine while you fix. Don’t let flaky tests block the pipeline while you work on the root cause. Isolate them immediately.
  4. Use time-of-day signals. Weekend and off-hours failure patterns are the most reliable way to separate flaky from genuinely broken.
  5. Track the trend. After you fix or quarantine, make sure the numbers actually improve. Flakiness has a tendency to creep back.

See your own numbers.

Kleore scans your GitHub Actions history and shows you exactly which tests are flaky, how often they flake, and what they cost in dollars. No configuration. No test framework changes.

Scan my repos — free

Further reading

Stop guessing.
Start measuring.

Two minutes from now, you’ll know exactly how much your CI flakes cost. No credit card. No config changes.

Scan my repos — free

Free to start. No credit card required.