Blog | Zeebe Chaos

Slow Network

July 6, 2021 · 6 min read

Chaos Engineer @ Zeebe

On a previous Chaos Day we played around with ToxiProxy , which allows injecting failures on the network level. For example dropping packages, causing latency etc.

Last week @Deepthi mentioned to me that we can do similar things with tc, which is a built-in linux command. Today I wanted to experiment with latency between leader and followers using tc.

TL;DR; The experiment failed; With adding 100ms network delay to the Leader we broke the complete processing throughput. 💥

Full Disk Recovery

June 8, 2021 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

On this chaos day we wanted to experiment with OOD recovery and ELS connection issues. This is related to the following issues from our hypothesis backlog: zeebe-chaos#32 and zeebe-chaos#14. This time @Nico joined me.

TL;DR The experiment was successful 💪 and we found several things in the dashboard which we can improve :)

Time travel Experiment

May 25, 2021 · 9 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Recently we run a Game day where a lot of messages with high TTL have been stored in the state. This was based on an earlier incident, which we had seen in production. One suggested approach to resolve that incident was to increase the time, such that all messages are removed from the state. This and the fact that summer and winter time shifts can cause in other systems evil bugs, we wanted to find out how our system can handle time shifts. Phil joined me as participant and observer. There was a related issue which covers this topic as well, zeebe-chaos#3.

TL;DR; Zeebe is able to handle time shifts back and forth, without observable issues. Operate seems to dislike it.

Corrupted Snapshot Experiment Investigation

April 29, 2021 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

A while ago we have written an experiment, which should verify that followers are not able to become leader, if they have a corrupted snapshot. You can find that specific experiment here. This experiment was executed regularly against Production-M and Production-S Camunda Cloud cluster plans. With the latest changes, in the upcoming 1.0 release, we changed some behavior in regard to detect snapshot corruption on followers.

NEW If a follower is restarted and has a corrupted snapshot it will detect it on bootstrap and will refuse to start related services and crash. This means the pod will end in a crash loop, until this is manually fixed.

OLD The follower only detects the corrupted snapshot on becoming leader when opening the database. On the restart of a follower this will not be detected.

The behavior change caused to fail our automated chaos experiments, since we corrupt the snapshot on followers and on a later experiment we restart followers. For this reason we had to disable the execution of the snapshot corruption experiment, see related issue zeebe-io/zeebe-cluster-testbench#303.

In this chaos day we wanted to investigate whether we can improve the experiment and bring it back. For reference, I also opened a issue to discuss the current corruption detection approach zeebe#6907

BPMN meets Chaos Engineering

April 3, 2021 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

On the first of April (2021) we ran our Spring Hackday at Camunda. This is an event where the developers at camunda come together to work on projects they like or on new ideas/approaches they want to try out. This time we (Philipp and me) wanted to orchestrate our Chaos Experiments with BPMN. If you already know how we automated our chaos experiments before, you can skip the next section and jump directly to the Hackday Project section.

In order to understand this blogpost make sure that you have a little understanding of Zeebe, Camunda Cloud and Chaos Engineering. Read the following resources to get a better understanding.

Set file immutable

March 30, 2021 · 7 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

This chaos day was a bit different. Actually I wanted to experiment again with camunda cloud and verify that our high load chaos experiments are now working with the newest cluster plans, see zeebe-cluster-testbench#135. Unfortunately I found out that our test chaos cluster was in a way broken, that we were not able to create new clusters. Luckily this was fixed at the end of the day, thanks to @immi :)

Because of these circumstances I thought about different things to experiment with, and I remembered that in the last chaos day we worked with patching running deployments, in order to add more capabilities. This allowed us to create ip routes and experiment with the zeebe deployment distribution. During this I have read the capabilities list of linux, and found out that we can mark files as immutable, which might be interesting for a chaos experiment.

In this chaos day I planned to find out how marking a file immutable affects our brokers and I made the hypothesis that: If a leader has a write error, which is not recoverable, it will step down and another leader should take over. I put this in our hypothesis backlog (zeebe-chaos#52).

In order to really run this kind of experiment I need to find out whether marking a file immutable will cause any problems and if not how I can cause write errors such that affects the broker. Unfortunately it turned out that immutable files will not cause issues on already opened file channels, but I found some other bugs/issues, which you can read below.

In the next chaos days I will search for a way to cause write errors proactively, so we can verify that our system can handle such issues.

Camunda Cloud network partition

March 23, 2021 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

This time Deepthi was joining me on my regular Chaos Day. 🎉

In the second last chaos day I created an automated chaos experiment, which verifies that the deployments are distributed after a network partition. Later it turned out that this doesn't work for camunda cloud, only for our helm setup. The issue was that on our camunda cloud zeebe clusters we had no NET_ADMIN capability to create ip routes (used for the network partitions). After discussing with our SRE's they proposed a good way to overcome this. On running chaos experiments, which are network related, we will patch our target cluster to add this capability. This means we don't need to add such functionality in camunda cloud and the related zeebe operate/controller. Big thanks to Immi and David for providing this fix.

TL;DR;

We were able to enhance the deployment distribution experiment and run it in the camunda cloud via testbench. We have enabled the experiment for Production M and L cluster plans. We had to adjust the rights for the testbench service account to make this work.

Fault-tolerant processing of process instances

March 9, 2021 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today I wanted to add another chaos experiment, to increase our automated chaos experiments collection. This time we will deploy a process model (with timer start event), restart a node and complete the process instance via zbctl.

TL;DR;

I was able to create the chaos toolkit experiment. It shows us that we are able to restore our state after fail over, which means we can trigger timer start events to create process instances even if they have been deployed before fail-over. Plus we are able to complete these instances.

Automating Deployment Distribution Chaos Experiment

February 23, 2021 · 7 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

This time I wanted to automate a chaos experiment via the ChaosToolkit, which I did on the last chaos day. For a recap check out the last chaos day summary.

TL;DR;

I was able to automate the deployment distribution chaos experiment successfully and deployed it on testbench for a Production - M cluster plan.

Deployment Distribution

January 26, 2021 · 11 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

On this chaos day we wanted to experiment a bit with deployment's and there distribution.

We run a chaos experiment with deploying multiple workflows and disconnecting two leaders. We verified whether deployments are distributed later. The chaos experiment was successful and showed a bit how fault tolerant deployment distribution is. 💪