Testing eventual consistent systems? Settle down

9 min readMar 20, 2024

Have you ever seen a Pull/Merge Request to a test like this ?

Stabilize test with sleep?

A typical time sensitive test. Time sensitive sets are pain — they are unstable, something slows down and the test fail randomly, you increase the sleep time and then you realize that the test is wasting time sleeping, you lower the sleep time and the tests starts failing again.

The ugly

Alright, so there is a time-sensitive test. Although it is a bad practice for a number of reasons, I can accept that there are some “legal” time-sensitive tests. Typical case in my opinion is testing some expiration mechanism:

    public function testJobWatcherExpiration(): void
    {
        $watcher = new DispatchedJobsWatcher(1); // expiration seconds

        $job1 = $this->createJobMock();

        $watcher->markJobDispatched($job1);

        // before expiration threshold - job is filtered
        $jobsList = $watcher->filterRecentlyDispatchedJobs([$job1]);
        self::assertCount(0, $jobsList);

        // after the expiration threshold - job is not filtered 
        //    (i.e. it can be dispatched again)
        sleep(2);
        $jobsList = $watcher->filterRecentlyDispatchedJobs([$job1]);
        self::assertCount(1, $jobsList);
    }

The DispatchedJobsWatcher class has an internal cache that ensures $job1 is not re-dispatched until a specified timeout elapses.

There is a couple of options how the class could be designed better (here “better” meaning better testable). For example in Java I would let the DispatchedJobsWatcher accept the Clock as dependency and supply mock clock in the test (or a shorter version). The above code is in PHP and there is no built in clock class. Adding a custom one would mean that I have to test that as well. It’s a lot more lines of code whose business value is questionable.

The code above is a trade-off between clean code, right code and overall simplicity. In this case I choose simplicity, because the tests only takes 2 seconds (and the test suite runs for half an hour, so there is not much harm in this).

In either case the timing of the test is not a problem. The test is perfectly stable with sleep(2) and there is no need to make pull requests like the one at the beginning.

I could not deprive you of this great but fake poster. The sad thing is that it is completely unrelated to this article.

The Bad

Then there are “illegal” time-sensitive tests. The pull request mentioned at the beginning comes from a test like this:

    public function testImportBasic(): void
    {
        $outputs = $this->getUploader()->runImport(self::BASIC_CONFIG, 10);
        sleep(3);

        $storageClient = $this->createClient();
        $foundFiles = $storageClient->listStorageFiles('tags:(artifact)', 10);

        usort($foundFiles, fn ($a, $b) => strcmp($a['name'], $b['name']));
        self::assertSame($foundFiles[0]['name'], 'first-file');
        self::assertSame($foundFiles[1]['name'], 'second-file');
        self::assertSame($foundFiles[9]['name'], 'last-file');
    }

Shamefully I have to admit that I wrote a test like this many times. And this is an entirely different story. If you ever worked with Elastic or a similar non-ACID database (etcd is one we hit often), you probably recognize the problem at first glance — you insert a record, you can even get that record immediately, but searching/listing that record is eventualy consistent. Time sensitivity of the test here is a direct result of the of eventual consistency.

I hear the shouting — such tests are not according to best practices for tests! — So let’s review some of these:

Write simple tests
Write fast tests
Test one scenario per test
Name your tests properly
Don’t duplicate production logic
Write isolated deterministic tests
Don’t overuse mocks and stubs
Execute test suite often and automatically
Avoid test interdependence
Only test the public behavior

The test si neither simple, nor fast. It does not probably contain one scenario and is definitively not isolated and not deterministic. As bonus, it creates test interdependence (more on that later).

One might argue that the above mentioned best practices are too zealous — I prefer these:

Fast: It isn’t uncommon for mature projects to have thousands of unit tests. Unit tests should take little time to run. Milliseconds.
Isolated: Unit tests are standalone, can be run in isolation, and have no dependencies on any outside factors such as a file system or database.
Repeatable: Running a unit test should be consistent with its results, that is, it always returns the same result if you don’t change anything in between runs.
Self-Checking: The test should be able to automatically detect if it passed or failed without any human interaction.
Timely: A unit test shouldn’t take a disproportionately long time to write compared to the code being tested. If you find testing the code taking a large amount of time compared to writing the code, consider a design that is more testable.

Again, the test in question is neither Fast, not Isolated nor Repeatable. Regardless of which set of best practices we pick, there are certainly some that we break.

The thing is — the test in question is not a unit test! In my work I integrate many different APIs and some of them are eventualy consistent or perhaps even weakly consistent. At some point we need to add functional tests (one might even call them integration tests). Here I’m talking strictly about the functional tests.

Apart from the obvious create-list problem, you get all kinds of side effects — it creates hidden dependencies between tests. For example in a setup method — you’re preparing a test fixture — list some records, delete them and create a new ones in the database. Bam! In the test, there are more records than you created, because the list in the setup method ran before the database reached consistency (thus the fixture was not perfect). The same applies to test teardown. The same applies transitively to methods relying on the list. E.g.

public function testDeleteOutdatedRecords(): void
{
  createOutdatedRecord();
  createOutdatedRecord();
  $deletedRecords = deleteOutdatedRecords();
  self::assertCount(2, $deletedRecords); 
  // fails because deleteOutdatedRecords uses list internally
}

So now we’re even breaking another rule — we no longer test the public behavior, because we have to be very aware of the internal implementation.

I hear the guy in the back — “go and refresh the Elastic index to reach consistency”. How? We’re testing against an API that is backed up by eventual consistent database. I just happen to know it is backed up by Elastic, I do not have direct access to it.

“So let’s add a refresh-index endpoint!” Well, that’s the worst idea of them all. The API is eventual consistent for a reason, adding an endpoint to essentially make it ACID is no good.

This is the real poster. Though it is still completely unrelated to this article.

The Good

Let’s backup a bit, the only reason we’re doing all this is because we really need to wait. The thing that is stupid is to wait a static amount of time. The root cause of the time sensitivity is eventual consistency. So let’s wait for the system to reach the consistent state, which will happen eventually (not within a set amount of time). Let’s wait for Eventually.

That’s how Settle was born. Settle is a function I wrote years ago and then spilled it all over our tests. Essentially it’s this:

    public function settle(callable $comparator, callable $getCurrentValue)
    {
        $attempt = 1;
        while (true) {
            $currentValue = $getCurrentValue();
            $valueMatches = $comparator($currentValue);

            if ($valueMatches) {
                return $currentValue;
            }

            if ($attempt >= $this->maxAttempts) {
                throw new RuntimeException('Failed to settle condition');
            }

            sleep((int) min(2 ** $attempt, $this->maxAttemptsDelay));
            $attempt++;
        }
    }

Full code available in Github. Now I can then rewrite the test with Settle:

public function testImportBasic(): void
{
  $outputs = $this->getUploader()->runImport(self::BASIC_CONFIG, 10);

  $storageClient = $this->createClient();
  $this->settle(
    10,
    function() use ($storageClient) {
      return count($storageClient->listStorageFiles('tags:(artifact)', 10));
    },
  ));

  $foundFiles = $storageClient->listStorageFiles('tags:(artifact)', 10);
  usort($foundFiles, fn ($a, $b) => strcmp($a['name'], $b['name']));
  self::assertSame($foundFiles[0]['name'], 'first-file');
  self::assertSame($foundFiles[1]['name'], 'second-file');
  self::assertSame($foundFiles[9]['name'], 'last-file');
}

The test calls the listStorageFiles method to list up to 10 files and it expects to get exactly 10 files. We simply wait until that it is true and then carry on with the test.

My colleagues were reserved to this for some time — it doesn’t look right in the tests. The rejection can be sort of divided into two arguments:

we’re programming in test,
and it’s against best practices.

Solving the first one is easy — we just move the function into a library. What’s the difference between calling self::assert and self::settle ?

The latter argument is more deeper. If we put this in the test it becomes a written proof that the test is not isolated, not deterministic, and if feels like we acknowledge that we’re not following the best practices. With the sleep inside the test, the test looked like that it followed best practices.

But then again these are not unit tests. What is confusing is that we run them the same way (with phpunit or pyunit or junit). Sometimes even in the same test suite (nobody’s perfect). Requiring isolation on a an integration test is non-sense. Speed is determined by the speed of the systems integrated not solely by the speed of the system under test. So the only virtue that remains important and feasible is Repeatability.

AAAA

To reach repeatability, we need to reach (eventual) consistency.

I could’ve written this sentence at the beginning and make the article shorter, but then you would miss the story. Still something feels wrong about the Settle function — It breaks the common AAA pattern of tests.

Arrange, Act, Assert is a common pattern when unit testing. As the name implies, it consists of three main actions:

Arrange objects, create and set them up as necessary.
Act on an object.
Assert that something is as expected.

All of that is true, until you realize, acknowledge and embrace weak or eventual consistency. When we are testing a weak or eventually consistent system, we need to move from AAA to AAAA.

Arrange objects, create and set them up as necessary.
Act on an object.
Achieve eventual consistency.
Assert that something is as expected.

Achieve eventual consistency could’ve better named “Await eventual consistency”, but then the “await” word is sort of already taken. Anyway, embracing the AAAA approach makes test structure clear again. However it is not completely trivial to do it right (neither is AAA to be fair). Consider the following version of the test:

public function testImportBasic(): void
{
  $outputs = $this->getUploader()->runImport(self::BASIC_CONFIG, 10);

  $storageClient = $this->createClient();
  $foundFiles = $storageClient->listStorageFiles('tags:(artifact)', 10);
  self::assertNotNull($foundFiles);

  $this->settle(
    'first-file',
    function() use ($storageClient) {
      return $storageClient->listStorageFiles('tags:(artifact)', 10)[0]['name'];
    },
  ));

  $foundFiles = $storageClient->listStorageFiles('tags:(artifact)', 10);
  usort($foundFiles, fn ($a, $b) => strcmp($a['name'], $b['name']));
  self::assertSame($foundFiles[1]['name'], 'second-file');
  self::assertSame($foundFiles[9]['name'], 'last-file');
}

This is wrong on a couple of levels — first it does not really work, because the listStorageFiles method does not order the results (but that’s an implementation detail). But most importantly, in this case the Settle just replaces a random assert. Also the AAAA structure is broken, because there are assertions before achieving the eventual consistency. This makes the test unstable and convoluted.

The important part is that the eventual consistency condition is not an assertion. It looks sort of the same (it’s a condition), but servers different purpose and eventually (pun intended!) should not be confused. The settle condition should not fail. It can only fail due to a timeout and the timeout is there just for technical reasons so that the tests do not freeze.

The runImport method creates 10 files. The state of eventual consistency means that the listStorageFiles method returns 10 files. That’s the condition to wait for, that is the definition of the consistent state. Waiting for anything else either makes the test unstable and requires more waiting conditions.

Until now i have mostly assumed an eventually consistent system — more specifically a CALM/ACID 2.0 system. The above tests assume the logical monotonicity. However with Settle we can still handle even non-monotonic systems, e.g.:

public function testImportBasic(): void
{
  $outputs = $this->getUploader()->runImport(self::BASIC_CONFIG, 10);
  $storageClient = $this->createClient();

  $foundFiles = this->settle(
    comparator: fn($v) => count($v) === 10,
    function() use ($storageClient) {
      return $storageClient->listStorageFiles('tags:(artifact)', 10);
    },
  ));

  usort($foundFiles, fn ($a, $b) => strcmp($a['name'], $b['name']));
  self::assertSame($foundFiles[1]['name'], 'second-file');
  self::assertSame($foundFiles[9]['name'], 'last-file');
}

In this case the Settle method returns the actual result that was valid at the time when the system reached eventual consistency (even if it was lost later on). It’s not an issue if the system does not stay in the consistent state, you just have to write the test more carefully, but it is still repeatable and reliable.

So that’s pretty much it. Remember AAAA tests and feel free to use our Settle library. Or write your own, if you re not using PHP. But remember that, testing eventual consistency is not a problem if you settle down and keep testing.

Testing eventual consistent systems? Settle down

The ugly

The Bad

The Good

AAAA

Written by Ondřej Popelka