Christopher Kujawa

Chaos Engineer @ Zeebe

Multiple Leader Changes

October 13, 2020 · 4 min read

Chaos Engineer @ Zeebe

Today I wanted to add new chaostoolkit experiment, which we can automate. We already have experiments like restarting followers and leaders for a partition, but in the past what also caused issues was multiple restarts/leader changes in a short period of time. This is the reason why I created #39.

Play around with ToxiProxy

October 6, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

First chaos day since my parental leave 🎉.

Today I wanted to play a bit with ToxiProxy. Toxiproxy is a framework for simulating network conditions and ideal to do some chaos on the network.

Experiment with Camunda Cloud

August 20, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In order to make our chaos experiments more realistic we have setup a new gke cluster, which is similar to the Camunda Cloud gke cluster. It allows us to test and experiment with Zeebe clusters which have the same configuration as Zeebe clusters in the Camunda cloud.

As part of the chaos day I run the same benchmark we normally run in our gke with our configuration against the Camunda Cloud Zeebe clusters.

Experiment with Low Load

August 6, 2020 · 3 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Run a benchmark with low load
Investigate last benchmark failures

Experiment without Exporters

July 30, 2020 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Run a chaos experiment without exporters

Big Multi Instance

July 16, 2020 · 3 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

investigate and fix automated chaos experiments - works again with 88c404f and cd8d685
Closed some issues in the backlog
Run a chaos experiment with bigger multi instance to reach maxMessageSize

Experiment with Timers and Huge Variables

July 9, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Failure documentation about RAFT
Added chaos day summaries to repo
Run Chaos experiment with a lot of timers
Run Chaos experiment with huge variables

Extract K8 resources from namespace

July 2, 2020 · 2 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Research: Read about DiRT (Disaster Recovery Tests) @ google - gave me same new ideas for more game days
Failure documentation about Log Appender

Unfortunately I had no time today for new chaos experiment, but I spent with @pihme some time to investigate how we can run the cluster plans in our gke. We did a bit of progress. I'm finally able to create cluster plans in the ultratest and can extract all resource definitions via command line.

kubectl get pvc,configmap,service,deployment,statefulset,cronjob,storageclasses -o yaml --export | sed -e '/resourceVersion: "[0-9]\+"/d' -e '/uid: [a-z0-9-]\+/d' -e '/selfLink: [a-z0-9A-Z/]\+/d' -e '/status:/d' -e '/phase:/d' -e '/creationTimestamp:/d' > s-cluster.yaml

We now need to find a way to successfully deploy it in our cluster - it haven't been successful yet. We thought about using kustomize to adjust some values they use. Much easier would be to just deploy the operator they use in our gke cloud and use the CRD to deploy the cluster plans. I think we need to investigate a bit more here what is the best approach. In the end I would like to run our chaos experiments against clusters which correspond to the real deployed ones.

Gateway Network Partition

June 25, 2020 · 3 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Documented failure cases for AsyncSnasphortDirector. Gave me some ideas where it might make sense to reinstall partition. Discussed a bit with @Deepthi
Still our automated chaos experiments are not running. I need some time for that, but I had no time for that today.
Run a chaos experiment together with @pihme, where we do a network partition with the gateway.

Correlate Message after failover

June 18, 2020 · One min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Documented failure cases for engine and stream processor. I think almost all possible failure cases I can think of we already handle, except problems on reading, which I think can't be handled.
Checked what the current issue is with the automated chaos experiments. It seems it is a infra problem. You can check the discussion in #infra. It might be affected due to Infra-1292
Run a chaos experiment, where we correlate a message after fail over.