Don't Let Flaky Tests Destroy Your App

"To really understand something is to be liberated from it" - Ross Ashcroft

·

5 min read

Why should I care so much about flaky tests?

What makes flaky tests so pernicious is that they undermine trust. A reliable test works well enough that when it fails, we can divert our attention to the test and look for a problem. Even a broken test fails reliably enough for us to ignore it. In contrast, like Aesop's fable of the Boy Who Cries 'Wolf', a flaky test needlessly calls our attention, threatening our ability to take decisive action.

A productive team makes good decisions quickly enough for those decisions to be meaningful. Flaky tests waste time. If tests start to return unreliable results too often, team members will start to distrust the test results. If a team distrusts the test results, the team will want to ignore the results of those tests. If the team ignores the test results, eventually they will want to make decisions without running tests.

Decisions made without good information will eventually be bad decisions. Bad decisions made quickly enough to be meaningful will lead to defects released to production. More defects released to production lead to dissatisfied users. Dissatisfied users become former users. Former users don't generate revenue. No revenue, no business. Like the old proverb, for want of a test, the business was lost!

How big of a problem are flaky tests? Testers at Google have found that 16 percent of their tests have some degree of flakiness. If Google, for all its resources, hasn't been able to eliminate flaky tests, odds are you might have some as well.

Treat every automated test as a hypothesis

Software tests can be automated at any level of the test pyramid. For purposes of this article, "automated tests" will be assumed to be end-to-end UI tests, as these test an entire system from a perspective close to a typical user experience.

Most automated tests are written so that, under normal operating conditions, when a test passes an app functions as expected, and when the test fails the app functions unexpectedly. These are true positive and true negative test results.

App OKApp Is Not OK
Test PassedTrue PositiveFalse Positive: the test passed but the app is not functioning as expected
Test FailedFalse Negative: the test failed but the app is functioning as expectedTrue Negative

False positive and false negative test results can occur even for reliable tests. Remember how in the above paragraph I mentioned "normal operating conditions?" For now, though, let us assume that our tests are running on a stable foundation.

This means that for our purposes, flaky tests are tests that return false positive or false negative test results too often to be useful.

Why not just retry flaky tests until they pass?

At some point, the story of a "lazy developer" being a good developer became part of software engineering lore. Applied to test automation, this means that most test runners provide the ability to retry tests that fail. But should you use this?

While this is my opinion, just retrying a test until it changes from red to green is an antipattern that will eventually lead to much greater problems. How do you know that a passing test result is reliable? How many times do you need to run a flaky test before you can trust the results? Formally, you can use a binomial distribution to calculate this, but most people can relate to the example of a coin flip.

You would probably agree that a coin flip is a random way to make a decision. Each outcome (heads or tails) should be equally likely, and the next outcome should be independent of any previous outcome. However, if you flip a coin and get "heads" 100 times in a row, wouldn't you be tempted to think that the result of the next coin flip might be "heads" also? Maybe you would want to test the coin?

Applying this example to automated testing, one might be tempted to think of an automated test as a coin that is expected to only land on "heads." It may occasionally land on "tails," but one might presume that a test resulting in "tails" was affected by some change to the underlying system (a gust of wind, perhaps) and retry the test.

Unlike a coin flip, though, for automated tests, outcomes may not be fully independent (such as tests that depend on the test environment or previous tests to provide initial conditions). Additionally, we may not be able to retry a test enough times so that we can rely on the aggregate test results before making a decision.

Commit to Excellence

Since we can't retry away flaky test results, we need to investigate tests to know why they are flaky. As "Automation Panda" Andrew Knight warns:

Test failures indicate a problem – either in test code, product code, or infrastructure.

Automated end-to-end UI tests are situated at the top of the test pyramid. As a result, the apps we are testing run on top of systems and infrastructure that can fail for a variety of conditions. While these failures each have their remedies, implementing them requires cooperation across the entire product team in much the same as it takes the entire team to plan and build the app.

As testers, we may be tempted to regard flaky tests as just part of the job. However, the value of automated testing increases the faster and more frequently tests are run. Rallying our teams to improve the testability of our apps and their systems will enable faster, more frequent tests. In turn, faster and more frequent tests will pay off in fewer defects, satisfied users, and stronger businesses.