Skip to main content

Bring Deployment distribution experiment back

10 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

We encountered recently a severe bug zeebe#9877 and I was wondering why we haven't spotted it earlier, since we have chaos experiments for it. I realized two things:

  1. The experiments only check for parts of it (BPMN resource only). The production code has changed, and a new feature has been added (DMN) but the experiments/tests haven't been adjusted.
  2. More importantly we disabled the automated execution of the deployment distribution experiment because it was flaky due to a missing standalone gateway in Camunda Cloud SaaS zeebe-io/zeebe-chaos#61. This is no longer the case, see Standalone Gateway in CCSaaS

On this chaos day I want to bring the automation of this chaos experiment back to life. If I have still time I want to enhance the experiment.

TL;DR; The experiment still worked, and our deployment distribution is still resilient against network partitions. It also works with DMN resources. I can enable the experiment again, and we can close zeebe-io/zeebe-chaos#61. Unfortunately, we were not able to reproduce zeebe#9877 but we did some good preparation work for it.

Standalone Gateway in CCSaaS

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

We recently introduced the Zeebe Standalone Gateway in CCSaaS. Today I wanted to do a first simple chaos experiment with the gateway, where we just restart one gateway.

Ideally in the future we could enable some gateway chaos experiments again, which we currently only support for helm.

TL;DR; Our Camunda Cloud clusters can handle gateway restarts without issues.

High Snapshot Frequency

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today we wanted to experiment with the snapshot interval and verify that a high snapshot frequency will not impact our availability (#21).

TL;DR; The chaos experiment succeeded 馃挭 We were able to prove our hypothesis.

Handling of Big Variables

6 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

New Year;:tada:New Chaos馃悞

This time I wanted to experiment with "big" variables. Zeebe supports a maxMessageSize of 4 MB, which is quite big. In general, it should be clear that using big variables will cause performance issues, but today I also want to find out whether the system can handle big variables (~1 MB) at all.

TL;DR; Our Chaos experiment failed! Zeebe and Camunda Cloud is not able to handle (per default) big variables (~1 MB) without issues.

Worker count should not impact performance

3 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In this chaos day we experimented with the worker count, since we saw recently that it might affect the performance (throughput) negatively if there are more workers deployed. This is related to #7955 and #8244.

We wanted to prove, that even if we have more workers deployed the throughput of the process instance execution should not have an negative impact.

TL;DR; We were not able to prove our hypothesis. Scaling of workers can have a negative impact on performance. Check out the third chaos experiment.

Not produce duplicate Keys

6 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Due to some incidents and critical bugs we observed in the last weeks, I wanted to spent some time to understand the issues better and experiment how we could detect them. One of the issue we have observed was that keys were generated more than once, so they were no longer unique (#8129). I will describe this property in the next section more in depth.

TL;DR; We were able to design an experiment which helps us to detect duplicated keys in the log. Further work should be done to automate such experiment and run it agains newer versions.

Throughput on big state

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In this chaos day we wanted to prove the hypothesis that the throughput should not significantly change even if we have bigger state, see zeebe-chaos#64

This came up due observations from the last chaos day. We already had a bigger investigation here zeebe#7955.

TL;DR; We were not able to prove the hypothesis. Bigger state, more than 100k+ process instances in the state, seems to have an big impact on the processing throughput.

Recovery (Fail Over) time

5 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In the last quarter we worked on a new "feature" which is called "building state on followers". In short, it means that the followers apply the events to build there state, which makes regular snapshot replication unnecessary and allows faster role transition between Follower-to-Leader. In this chaos day I wanted to experiment a bit with this property, we already did some benchmarks here. Today, I want to see how it behaves with larger state (bigger snapshots), since this needed to be copied in previous versions of Zeebe, and the broker had to replay more than with the newest version.

If you want to now more about build state on followers check out the ZEP

TL;DR; In our experiment we had almost no downtime, with version 1.2, the new leader was very fast able to pick up the next work (accept new commands).

Old-Clients

3 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

It has been awhile since the last post, I'm happy to be back.

In today's chaos day we want to verify the hypothesis from zeebe-chaos#34 that old clients can't disrupt a running cluster.

It might happen that after upgrading your Zeebe to the newest shiny version, you might forget to update some of your workers or starters etc. This should normally not an issue since Zeebe is backwards compatible, client wise since 1.x. But what happens when older clients are used. Old clients should not have a negative effect on a running cluster.

TLDR Older clients (0.26) have no negative impact on a running cluster (1.2), and clients after 1.x are still working with the latest version.

Slow Network

6 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

On a previous Chaos Day we played around with ToxiProxy , which allows injecting failures on the network level. For example dropping packages, causing latency etc.

Last week @Deepthi mentioned to me that we can do similar things with tc, which is a built-in linux command. Today I wanted to experiment with latency between leader and followers using tc.

TL;DR; The experiment failed; With adding 100ms network delay to the Leader we broke the complete processing throughput. 馃挜