Skip to main content

Experiment with Camunda Cloud

· 4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In order to make our chaos experiments more realistic we have setup a new gke cluster, which is similar to the Camunda Cloud gke cluster. It allows us to test and experiment with Zeebe clusters which have the same configuration as Zeebe clusters in the Camunda cloud.

As part of the chaos day I run the same benchmark we normally run in our gke with our configuration against the Camunda Cloud Zeebe clusters.

Configurations of Zeebe Clusters

In the following table I want to highlight the different configurations of the different Zeebe Clusters (cluster types).

NameOur DefaultProd SProd MProd L
Partitions3148
Nodes3336
Replication3333
SnapshotPeriod15555
CPU_THREADS4144
IO_THREADS4244
CPU LIMIT5144
CPU REQUEST5200m200m200m
RAM LIMIT12Gi2Gi8Gi8Gi
RAM REQUEST12Gi250Mi250Mi250Mi
GatewayStandaloneEmbeddedEmbeddedEmbedded

Benchmarks

NameOur DefaultProd SProd MProd L
Generalbaseprodsprodsprods
Resourcesbaseprodsprodsprods
Disk usageprodsprodsprods
Latencybaseprodsprodsprods
Workingbaseprodsprodsprods

In general we can see that the clusters haven't survived long. This is also visible in our Camunda Cloud status page. status

I think it is kind of related with the preemtable nodes, high load, long restarts and that pods are restarted after 15 minutes, when there are not getting ready. One of the reasons why restarting takes so long is fixed now with #5189 so I hope that this gets better. But currently it is an issue, since you start replicating a snapshot and reprocess on start up. If this takes longer then 15 min the pod will be restarted because of this configuration: Liveness: http-get http://:9600/ready delay=900s timeout=1s period=15s #success=1 #failure=3 after restarting the pod you haven't gained any value you just need to start again the complete procedure. In k8 we can see a high restart count of the pods.

Interesting is if we take a look at the working part of Prod S then we clearly see how often actually a pod is preemted or leader change happens.

prod-s

It is a known issue that currently the nodes are preemted quite often in Camunda Cloud and they working on a solution to it.

Throughput

If we take a look at the Working part we can see that we scale based on the partition count (or prod cluster sizes) in Camunda Cloud. For Prod S we reach in avg ~24 workflow instance creation/completions per second. For Prod M we reach in avg ~46 workflow instance creation/completions per second. For Prod L we reach in avg ~99 workflow instance creations and completions. To be fair I run the benchmark on these cluster sizes only with three workers, which have 8 threads an activation count of 120 and they completing an job after 150 ms delay, and an starter which starts 100 workflow instances per second. Normally we use in our benchmarks 12 workers and start 300 workflow instances per second. I tried that with the Prod L cluster, but this failed quite fast after increasing the load. Here we probably need to investigate further. If we take a look at our cluster setup then we reach in avg ~147 workflow instance creations/completions per second.

Latency

If take a look at the latency we can see that in Prod M cluster the latency seems to be a bit problematic, where in Prod S and L it seems similar. In our default cluster we get the best latency. Might be worth to take a look as well.

Other Observations

During the benchmark observations I saw that some metrics are missing.

For example the Gateway metrics are not shown:

gw

Furthermore I saw that all container related and pvc related metrics are missing. I was not able to check the IO metrics nor the CPU metrics and other.

io

If we want to run more tests and chaos experiments we need to fix these missing metrics before. Opened a new issue for it #242

Running automated Chaos experiments

In order to run automated chaos experiments in our new gke. I had to create a new serviceaccount and rolebindings, such that our Jenkins can access the new Kubernetes cluster and our experiments can delete and create new resources.

Participants

  • @zelldon