Blog | Zeebe Chaos

Network partitions

January 19, 2021 · 8 min read

Chaos Engineer @ Zeebe

As you can see, I migrated the old chaos day summaries to github pages, for better readability. I always wanted to play around with github pages and jekyll so this was a good opportunity. I hope you like it. 😄

On the last Chaos Day, we experimented with disconnecting a Leader and one follower. We expected no bigger disturbance, since we still have quorum and can process records. Today I want to experiment with bigger network partitions.

In the first chaos experiment: I had a cluster of 5 nodes and split that into two groups, the processing continued as expected, since we had still quorum. 💪
In the second chaos experiment: I split the cluster again into two groups, but this time we added one follower of the bigger group to the smaller group after snapshot was taken and compaction was done. The smaller group needed to keep up with the newer state, before new processing can be started again, but everything worked fine.

Disconnect Leader and one Follower

January 7, 2021 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Happy new year everyone 🎉

This time I wanted to verify the following hypothesis Disconnecting Leader and one Follower should not make cluster disruptive (#45). But in order to do that we need to extract the Leader and Follower node for a partition from the Topology. Luckily in December we got an external contribution which allows us to print zbctl status as json. This gives us now more possibilities, since we can extract values much better out of it.

TL;DR The experiment was successful 👍

Message Correlation after Failover

November 24, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today I wanted to finally implement an experiment which I postponed for long time, see #24. The problem was that previous we were not able to determine on which partition the message was published, so we were not able to assert that it was published on the correct partition. With this #4794 it is now possible, which was btw an community contribution. 🎉

Many Job Timeouts

November 11, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In the last game day (on friday 06.11.2020) I wanted to test whether we can break a partition if many messages time out at the same time. What I did was I send many many messages with a decreasing TTL, which all targeting a specific point in time, such that they will all timeout at the same time. I expected that if this happens that the processor will try to time out all at once and break because the batch is to big. Fortunately this didn't happen, the processor was able to handle this.

I wanted to verify the same with job time out's.

Investigate failing Chaos Tests

November 3, 2020 · 5 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today as part of the Chaos Day I wanted to investigate why our current Chaos Tests are failing and why our targeting cluster has been broken by them, see the related issue #5688.

TL;DR

We found three new bugs regarding the reprocessing detection and deployment distribution, but still were not able to reproduce the real issue.

Non-graceful Shutdown Broker

October 20, 2020 · 2 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today I had not much time for the chaos day, because of writing Gameday Summary, Incident review, taking part of incidents etc. So enough chaos for one day :)

But I wanted to merge the PR from Peter and test how our brokers behave if they are not gracefully shutdown. I did that on Wednesday (21-10-2020).

Gateway memory consumption

October 20, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In the last weeks I check multiple benchmarks and clusters in incidents. Often I had the feeling that the memory consumption from the gateway is not ideal or that there is a memory leak. I wanted to experiment regarding this memory consumptions. Since we saw in investigating https://github.com/zeebe-io/zeebe/issues/5641 a high memory spike when the gateway was not able to talk to other nodes I suspected that here might be some bugs hiding

Multiple Leader Changes

October 13, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today I wanted to add new chaostoolkit experiment, which we can automate. We already have experiments like restarting followers and leaders for a partition, but in the past what also caused issues was multiple restarts/leader changes in a short period of time. This is the reason why I created #39.

Play around with ToxiProxy

October 6, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

First chaos day since my parental leave 🎉.

Today I wanted to play a bit with ToxiProxy. Toxiproxy is a framework for simulating network conditions and ideal to do some chaos on the network.

Experiment with Camunda Cloud

August 20, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In order to make our chaos experiments more realistic we have setup a new gke cluster, which is similar to the Camunda Cloud gke cluster. It allows us to test and experiment with Zeebe clusters which have the same configuration as Zeebe clusters in the Camunda cloud.

As part of the chaos day I run the same benchmark we normally run in our gke with our configuration against the Camunda Cloud Zeebe clusters.