Blog | Zeebe Chaos

Experiment with Low Load

August 6, 2020 · 3 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Run a benchmark with low load
Investigate last benchmark failures

Experiment without Exporters

July 30, 2020 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Run a chaos experiment without exporters

Big Multi Instance

July 16, 2020 · 3 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

investigate and fix automated chaos experiments - works again with 88c404f and cd8d685
Closed some issues in the backlog
Run a chaos experiment with bigger multi instance to reach maxMessageSize

Experiment with Timers and Huge Variables

July 9, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Failure documentation about RAFT
Added chaos day summaries to repo
Run Chaos experiment with a lot of timers
Run Chaos experiment with huge variables

Extract K8 resources from namespace

July 2, 2020 · 2 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Research: Read about DiRT (Disaster Recovery Tests) @ google - gave me same new ideas for more game days
Failure documentation about Log Appender

Unfortunately I had no time today for new chaos experiment, but I spent with @pihme some time to investigate how we can run the cluster plans in our gke. We did a bit of progress. I'm finally able to create cluster plans in the ultratest and can extract all resource definitions via command line.

kubectl get pvc,configmap,service,deployment,statefulset,cronjob,storageclasses -o yaml --export | sed -e '/resourceVersion: "[0-9]\+"/d' -e '/uid: [a-z0-9-]\+/d' -e '/selfLink: [a-z0-9A-Z/]\+/d' -e '/status:/d' -e '/phase:/d' -e '/creationTimestamp:/d' > s-cluster.yaml

We now need to find a way to successfully deploy it in our cluster - it haven't been successful yet. We thought about using kustomize to adjust some values they use. Much easier would be to just deploy the operator they use in our gke cloud and use the CRD to deploy the cluster plans. I think we need to investigate a bit more here what is the best approach. In the end I would like to run our chaos experiments against clusters which correspond to the real deployed ones.

Gateway Network Partition

June 25, 2020 · 3 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Documented failure cases for AsyncSnasphortDirector. Gave me some ideas where it might make sense to reinstall partition. Discussed a bit with @Deepthi
Still our automated chaos experiments are not running. I need some time for that, but I had no time for that today.
Run a chaos experiment together with @pihme, where we do a network partition with the gateway.

Correlate Message after failover

June 18, 2020 · One min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Documented failure cases for engine and stream processor. I think almost all possible failure cases I can think of we already handle, except problems on reading, which I think can't be handled.
Checked what the current issue is with the automated chaos experiments. It seems it is a infra problem. You can check the discussion in #infra. It might be affected due to Infra-1292
Run a chaos experiment, where we correlate a message after fail over.

High CPU load on Standalone Gateway

June 11, 2020 · 2 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Updated failure cases documentation for exporter based on review
Documented failure cases for ZeebeDB
Wrote an chaostoolkit experiment based on the last manual Chaos experiment
Run a chaos experiment with @Deepthi, where we put high CPU load on the standalone gateway https://github.com/zeebe-io/zeebe-chaos/issues/28

First Chaos Day!

June 4, 2020 · 2 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

First Chaos day 🎉

Documented failure cases for exporter (already some exist, it seemed) gave me a new idea for ZEP
Introduced Peter to our Chaos Repository, discussed a bit about the hypothesis backlog, reopened the Chaos Trello board where we will organize ourselves
Run a chaos experiment, where we put high CPU load on the Leader #6