Flaky Tests: Diagnosing and Fixing - CMO & CTO (An AI Generated Experiment to the past)

Flaky tests have a way of showing up right when coffee runs out. Picture this. It is late, the pull request is merged, Jenkins goes green, and the team chat posts the celebratory emoji. Ten minutes later the same build turns red on a re run with no code change. The same test. Same seed. Same everything. If you have a Travis badge on your repo or a Jenkins wallboard in the hallway, you know this pain. Today is about hunting these ghosts and getting your sleep back.

What makes a test flaky

A flaky test is one that sometimes fails for reasons outside of the code change under review. It erodes trust, slows merges, and makes people click rebuild like a video game. Most flakiness I see in JUnit falls into a few buckets:

Timing and async. Sleep based tests that guess when a thread or future will finish.
Concurrency. Shared mutable state, races, missing fences, or reliance on thread scheduling.
Order dependence. Tests pass alone but fail when run after a sibling that left state behind.
External calls. Network, file system, databases, or clock. Any slow or flaky dependency leaks into tests.
Randomness. Data generated with new Random() with no fixed seed.
Environment. Time zone, locale, CPU speed, file permissions, or differing JVM settings on CI.

How to diagnose like a practitioner

Start by making the failure boring. You want the same failure every time on your laptop. Here is a short playbook that works in Maven and Gradle with JUnit 4.

Run it many times. Loop the single test method locally. On CI, configure a one time rerun to collect logs. Maven Surefire can rerun failing tests with rerunFailingTestsCount. Do not rely on reruns as a permanent fix.
One fork, clean JVM. Disable fork reuse when chasing flakiness so class static state does not hide the bug. Parallel test execution can mask or trigger races, so toggle it intentionally.
Randomize order. Let the build run tests in a different order to reveal hidden coupling. Surefire supports runOrder random. Keep a log of the order when a failure appears.
Shake the environment. Switch locale and time zone. Set the clock to near midnight or month end. Run on a busy machine to slow things down. A lot of date formatting and rounding bugs show up here.
Remove guessing. Replace sleeps with waits that react to a condition. Awaitility is great for this. Poll until a latch or state is ready instead of waiting an arbitrary number.
Freeze time. Use a clock you can control. In Java 8, pass a java.time.Clock to your code and provide a fixed clock in tests.
Control randomness. Seed your generators with a known value and print the seed on failure. When a failure hits, you can replay the same data set.
Clean setup and teardown. Use JUnit @Before and @After to create and dispose resources every time. TemporaryFolder keeps files isolated. Close sockets and streams. Avoid singletons in tests.
Stub the world. Tests should not depend on WiFi, DNS, or a staging database. Use fakes and mocks. Keep true integration checks in their own suite.

On CI, capture more context. Thread dumps on timeout, full stack traces, and test order output make triage faster. Jenkins has a Flaky Test Handler plugin to surface repeat offenders while you fix them, but treat it like a cast, not a lifestyle.

What your team should do about it

Flaky tests are a people problem as much as a code problem. A red build that is not trusted slows every review. It turns release day into roulette. Set a few ground rules:

Green means green. If the build is red, somebody owns it. No merges to main until it is fixed or quarantined.
Quarantine with intent. Move known flaky tests to a nightly suite and file a ticket with an owner and a clear reason. Add a due date. Bring them back once stable.
Tag and track. Use @Category or a custom marker like @Flaky temporarily, and report the failure rate over time. Aim for zero flaky tests in the main suite.
Make it part of Definition of Done. New code ships with deterministic tests. If a test needs sleep or network, it belongs in an integration lane.
Rotate a deflake owner. A weekly point person keeps momentum and avoids drive by patches that hide the root cause.

Your challenge for the week

Ready to get your CI heartbeat steady

Pull the last ten red builds and list the top three flaky tests by name and failure rate.
Quarantine them into a nightly run and create tickets with owners and context links to failing logs.
For each, remove sleeps, seed randomness, and inject a controllable clock. Replace external calls with fakes.
Turn on random test order for a day and fix any order dependent tests you find.
Add a pre merge job that runs the main suite twice. It is cheap insurance and catches hidden flakiness fast.
Post a short write up in the repo about your test rules so new contributors do not re introduce the same patterns.

Clean tests are not about purity. They are about speed. The faster you can trust a signal from JUnit on Jenkins or Travis, the faster you ship. That is the whole point.

General Software Software Engineering