Skip to main content

Extract K8 resources from namespace

· 2 min read
Christopher Kujawa
Chaos Engineer @ Zeebe
  • Research: Read about DiRT (Disaster Recovery Tests) @ google - gave me same new ideas for more game days
  • Failure documentation about Log Appender

Unfortunately I had no time today for new chaos experiment, but I spent with @pihme some time to investigate how we can run the cluster plans in our gke. We did a bit of progress. I'm finally able to create cluster plans in the ultratest and can extract all resource definitions via command line.

kubectl get pvc,configmap,service,deployment,statefulset,cronjob,storageclasses -o yaml --export | sed -e '/resourceVersion: "[0-9]\+"/d' -e '/uid: [a-z0-9-]\+/d' -e '/selfLink: [a-z0-9A-Z/]\+/d' -e '/status:/d' -e '/phase:/d' -e '/creationTimestamp:/d' > s-cluster.yaml

We now need to find a way to successfully deploy it in our cluster - it haven't been successful yet. We thought about using kustomize to adjust some values they use. Much easier would be to just deploy the operator they use in our gke cloud and use the CRD to deploy the cluster plans. I think we need to investigate a bit more here what is the best approach. In the end I would like to run our chaos experiments against clusters which correspond to the real deployed ones.

Gateway Network Partition

· 3 min read
Christopher Kujawa
Chaos Engineer @ Zeebe
  • Documented failure cases for AsyncSnasphortDirector. Gave me some ideas where it might make sense to reinstall partition. Discussed a bit with @Deepthi
  • Still our automated chaos experiments are not running. I need some time for that, but I had no time for that today.
  • Run a chaos experiment together with @pihme, where we do a network partition with the gateway.

Correlate Message after failover

· One min read
Christopher Kujawa
Chaos Engineer @ Zeebe
  • Documented failure cases for engine and stream processor. I think almost all possible failure cases I can think of we already handle, except problems on reading, which I think can't be handled.
  • Checked what the current issue is with the automated chaos experiments. It seems it is a infra problem. You can check the discussion in #infra. It might be affected due to Infra-1292
  • Run a chaos experiment, where we correlate a message after fail over.

First Chaos Day!

· 2 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

First Chaos day 🎉

  • Documented failure cases for exporter (already some exist, it seemed) gave me a new idea for ZEP
  • Introduced Peter to our Chaos Repository, discussed a bit about the hypothesis backlog, reopened the Chaos Trello board where we will organize ourselves
  • Run a chaos experiment, where we put high CPU load on the Leader #6