This chaos day was a bit different. Actually I wanted to experiment again with camunda cloud and verify that our high load chaos experiments are now working with the newest cluster plans, see zeebe-cluster-testbench#135.
Unfortunately I found out that our test chaos cluster was in a way broken, that we were not able to create new clusters. Luckily this was fixed at the end of the day, thanks to @immi :)
Because of these circumstances I thought about different things to experiment with, and I remembered that in the last chaos day we worked with patching running deployments, in order to add more capabilities.
This allowed us to create ip routes and experiment with the zeebe deployment distribution. During this I have read the capabilities list of linux, and found out that we can mark files as immutable, which might be interesting for a chaos experiment.
In this chaos day I planned to find out how marking a file immutable affects our brokers and I made the hypothesis that: If a leader has a write error, which is not recoverable, it will step down and another leader should take over. I put this in our hypothesis backlog (zeebe-chaos#52).
In order to really run this kind of experiment I need to find out whether marking a file immutable will cause any problems and if not how I can cause write errors such that affects the broker.
Unfortunately it turned out that immutable files will not cause issues on already opened file channels, but I found some other bugs/issues, which you can read below.
In the next chaos days I will search for a way to cause write errors proactively, so we can verify that our system can handle such issues.