The Flaky Test Confession: “We All Know We’re Ignoring Test Failures”

Somewhere in your CI pipeline right now, there's a test that's been failing intermittently for months. You know the one. It fails; someone hits rerun. It passes; the PR merges. Nobody files a bug. Nobody fixes it. Everyone just moves on.

Eventually, one dismissed failure becomes two, becomes ten, and at some point, the team has quietly recalibrated what a red build means—not a signal worth investigating, just noise to push through.

Once that shift happens, it compounds. Developers stop reading failure logs carefully. QA leads stop triaging every red build. Real failures start getting dismissed alongside the flaky ones.

Why Flaky Tests Stop Getting Fixed

Every flaky test eventually becomes a backlog ticket. And once in the backlog, tickets for flaky tests have a way of staying there.

These tickets—“fix flaky test in checkout flow,” “investigate intermittent auth failure”—sit below the fold, underneath the features and the incidents. Sprint after sprint, they get pushed, not because the team doesn’t care, but because there’s always something more urgent.

Fixing the underlying test means reproducing an intermittent failure, tracing the root cause, and validating the fix: work that rarely fits into a sprint that's already full.

What Ignoring Flaky Tests Is Actually Costing You

The most visible cost is CI time. Every rerun is a full pipeline execution; the same compute, the same minutes, the same bill. Part of that cost is structural. Most teams run their entire test suite on every change, regardless of what changed, which means every build executes hundreds or thousands of tests that have no bearing on the code being tested. For teams running dozens of builds a day across multiple services, that waste compounds quickly.

The less visible cost is engineer time, and it compounds in ways that often don't show up in any dashboard. One CloudBees customer found that QA sign-off on a single change stretched to eight hours when the suite couldn't be trusted to give a clean result, with root-cause analysis on individual flaky failures consuming days of investigation time on top of that.

A 2024 peer-reviewed case study put a number on it: at least 2.5% of productive developer time consumed by flaky test overhead alone; time spent investigating failures that aren't real, repairing tests that shouldn't be broken, and triaging noise instead of shipping code. At that rate, across 30 developers over five years, the overhead adds up to approximately 6,600 developer-hours—or roughly 3.75 full developer-years.

The same study found that manually investigating a failed build costs $5.67 in developer time, compared to $0.02 for an automatic rerun. That math explains exactly why rerun-by-default becomes standard practice; in the short term, it's cheaper. But if you only treat the symptom, the underlying problem continues to grow.

And then there's the cost that's hardest to quantify: the bugs that get through. Real failures get dismissed as noise until something the suite flagged three sprints ago breaks in production. Production bugs carry costs that go well beyond the fix, including:

Engineering time to identify, debug, and resolve something CI should have caught
QA effort to re-verify functionality that was already tested
Customer support, service disruption, and lost trust
Features delayed while the team firefights instead of ships

Every one of these is a cost that a trustworthy test suite would have prevented.

How Teams Handle Flaky Tests, and Why None of it Works

When a flaky test shows up, there are three things teams typically do. Each makes sense in the moment. None of them solves the underlying problem.

Retry Logic

Retry logic is the most common approach, and the easiest to justify, because it works. When a flaky test fails and the build reruns and passes, the pipeline moves on with no apparent harm done. The cost per rerun is negligible.

The problem is the noise that accumulates. Every retry that succeeds makes the next failure more likely to be retried, and the one after that as well. Over time, the rerun stops being a response to an unexpected failure and becomes the default response to any failure. The signal degrades a little more each time.

Mark and Ignore

Quarantining a test feels like a responsible middle ground. The team acknowledges the problem without letting it block the pipeline. But a quarantined test is still a broken test; it just stops blocking the pipeline while the underlying problem goes unfixed. The test stays broken, the suite shrinks, and the backlog ticket becomes permanent.

Manual Triage

Manual triage is the most expensive workaround, and also the least effective at scale. A 2020 study of flaky tests across six Microsoft engineering projects found that even when researchers ran individual flaky tests 500 times each in a controlled environment, they could only reproduce the failure in 25% to 43% of cases.

If reproducing a flaky test takes 500 attempts under controlled conditions and still fails more than half the time, the average developer investigating a red build on a deadline has almost no chance of getting to the root cause. So they rerun it instead.

Fixing the Signal, Not Just the Noise

The three workarounds have something in common: they're all ways of living with the problem rather than solving it. Retry logic absorbs the interruption. Mark-and-ignore removes the test from view. Manual triage investigates individual failures without ever asking whether the suite as a whole is still doing its job.

The reason none of them work isn't a lack of effort, but a lack of visibility. Most teams have no reliable way to know which tests in their suite are flaky, how often they fail, whether the rate is getting worse, or which failures are worth investigating versus which ones are noise. Without that picture, the only rational response to a red build is to rerun it and hope.

That changes when teams gain visibility. Flaky tests stop being invisible background problems and become something that can actually be prioritized and fixed. And when test selection is scoped to the specific change being tested, the suite stops generating noise in the first place. Failures mean something. The signal is worth trusting again.

What Changes When Your Test Suite Is Trustworthy Again

Teams don't ignore flaky tests because they don't care. They ignore them because they've never had a clear way to see them, prioritize them, or know which ones are actually worth fixing.

CloudBees Smart Tests is an AI-driven test intelligence solution that changes that:

Flaky test detection: Test Suite Health Insights automatically surfaces which tests are unreliable and how often they fail.
AI-driven predictive test selection: Runs only the tests most likely to catch a real failure for the specific change being tested, cutting CI time without cutting coverage.
Automated triage: Groups failures by root cause and routes them to the right owner, so investigative work stops disappearing into sprint overhead.

The result is a pipeline that means something again. Builds run faster because the suite is scoped to what matters. Failures get investigated because they're worth investigating. And bugs that would have slipped through a noisy, untrusted suite get caught in CI, where fixing them costs a fraction of what they'd cost in production.

The sprint planning conversation shifts from managing noise to shipping software.