Skip to main content
Christopher Kujawa
Chaos Engineer @ Zeebe
View all authors

Multiple Leader Changes

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today I wanted to add new chaostoolkit experiment, which we can automate. We already have experiments like restarting followers and leaders for a partition, but in the past what also caused issues was multiple restarts/leader changes in a short period of time. This is the reason why I created #39.

Experiment with Camunda Cloud

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In order to make our chaos experiments more realistic we have setup a new gke cluster, which is similar to the Camunda Cloud gke cluster. It allows us to test and experiment with Zeebe clusters which have the same configuration as Zeebe clusters in the Camunda cloud.

As part of the chaos day I run the same benchmark we normally run in our gke with our configuration against the Camunda Cloud Zeebe clusters.

Extract K8 resources from namespace

2 min read
Christopher Kujawa
Chaos Engineer @ Zeebe
  • Research: Read about DiRT (Disaster Recovery Tests) @ google - gave me same new ideas for more game days
  • Failure documentation about Log Appender

Unfortunately I had no time today for new chaos experiment, but I spent with @pihme some time to investigate how we can run the cluster plans in our gke. We did a bit of progress. I'm finally able to create cluster plans in the ultratest and can extract all resource definitions via command line.

kubectl get pvc,configmap,service,deployment,statefulset,cronjob,storageclasses -o yaml --export | sed -e '/resourceVersion: "[0-9]\+"/d' -e '/uid: [a-z0-9-]\+/d' -e '/selfLink: [a-z0-9A-Z/]\+/d' -e '/status:/d' -e '/phase:/d' -e '/creationTimestamp:/d' > s-cluster.yaml

We now need to find a way to successfully deploy it in our cluster - it haven't been successful yet. We thought about using kustomize to adjust some values they use. Much easier would be to just deploy the operator they use in our gke cloud and use the CRD to deploy the cluster plans. I think we need to investigate a bit more here what is the best approach. In the end I would like to run our chaos experiments against clusters which correspond to the real deployed ones.

Gateway Network Partition

3 min read
Christopher Kujawa
Chaos Engineer @ Zeebe
  • Documented failure cases for AsyncSnasphortDirector. Gave me some ideas where it might make sense to reinstall partition. Discussed a bit with @Deepthi
  • Still our automated chaos experiments are not running. I need some time for that, but I had no time for that today.
  • Run a chaos experiment together with @pihme, where we do a network partition with the gateway.

Correlate Message after failover

One min read
Christopher Kujawa
Chaos Engineer @ Zeebe
  • Documented failure cases for engine and stream processor. I think almost all possible failure cases I can think of we already handle, except problems on reading, which I think can't be handled.
  • Checked what the current issue is with the automated chaos experiments. It seems it is a infra problem. You can check the discussion in #infra. It might be affected due to Infra-1292
  • Run a chaos experiment, where we correlate a message after fail over.