30 posts tagged with "availability"

Continuing SST Partitioning toggle

May 19, 2023 · 9 min read

Chaos Engineer @ Zeebe

Today we want to continue with the experiment from last Chaos day, but this time with a bit more load. This should make sure that we trigger the compaction of RocksDB and cause the SST partitioning to happen, for real.

The reasons stay the same we want to find out whether it would be possible to enable and disable the flag/configuration without issues.

TL;DR; Today's, experiments succeeded 🚀. We were able to show that even with a higher number of process instances (bigger state) we can easily disable and enable the SST partitioning flag without issues. I also got a confirmation from a RocksDb contributor that our observations are correct, and that we can easily toggle this feature without issues.

SST Partitioning toggle

May 15, 2023 · 7 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

On this chaos day I wanted to experiment with a new experimental feature we have released recently. The enablement of the partitioning of the SST files in RocksDB. This is an experimental feature from RocksDb, which we made available now for our users as well, since we have seen great benefits in performance, especially with larger runtime data.

I wanted to experiment a bit with the SST partitioning and find out whether it would be possible to enable and disable the flag/configuration without issues.

TL;DR; The first experiment was successful, it looks like we can enable and disable the partitioning without impacting the execution of one existing PI. We need to experiment a bit more with larger data sets to force RocksDB compaction, to be fully sure.

Gateway Termination

April 6, 2023 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In today's chaos day, we wanted to experiment with the gateway and resiliency of workers.

We have seen in recent weeks some issues within our benchmarks when gateways have been restarted, see zeebe#11975.

We did a similar experiment in the past, today we want to focus on self-managed (benchmarks with our helm charts). Ideally, we can automate this as well soon.

Today Nicolas joined me on the chaos day 🎉

TL;DR; We were able to show that the workers (clients) can reconnect after a gateway is shutdown ✅ Furthermore, we have discovered a potential performance issue on lower load, which impacts process execution latency (zeebe#12311).

Recursive call activity

February 23, 2023 · 5 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Long time no see. Happy to do my first chaos day this year. In the last week have implemented interesting features, which I would like to experiment with. Batch processing was one of them.

TL;DR; Chaos experiment failed. 💥 Batch processing doesn't seem to respect the configured limit, which causes issues with processing and influences the health of the system. We found a bug 💪

Message Correlation after Network Partition

August 31, 2022 · 10 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In the last weeks, we made several changes in our core components, we introduce some new abstractions, and changed how we communicate between partitions.

Due to these changes, we thought it might make sense to run some more chaos experiments in that direction and area since our benchmarks also recently found some interesting edge cases.

Today we experimented with Message Correlation and what happens when a network partition disturbs the correlation process.

TL;DR; The experiment was partially successful (after retry), we were able to publish messages during a network partition that have been correlated after the network partition. We need to verify whether we can also publish messages before a network partition and during the partition create the related instances.

Bring Deployment distribution experiment back

August 2, 2022 · 10 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

We encountered recently a severe bug zeebe#9877 and I was wondering why we haven't spotted it earlier, since we have chaos experiments for it. I realized two things:

The experiments only check for parts of it (BPMN resource only). The production code has changed, and a new feature has been added (DMN) but the experiments/tests haven't been adjusted.
More importantly we disabled the automated execution of the deployment distribution experiment because it was flaky due to a missing standalone gateway in Camunda Cloud SaaS zeebe-io/zeebe-chaos#61. This is no longer the case, see Standalone Gateway in CCSaaS

On this chaos day I want to bring the automation of this chaos experiment back to life. If I have still time I want to enhance the experiment.

TL;DR; The experiment still worked, and our deployment distribution is still resilient against network partitions. It also works with DMN resources. I can enable the experiment again, and we can close zeebe-io/zeebe-chaos#61. Unfortunately, we were not able to reproduce zeebe#9877 but we did some good preparation work for it.

Standalone Gateway in CCSaaS

February 15, 2022 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

We recently introduced the Zeebe Standalone Gateway in CCSaaS. Today I wanted to do a first simple chaos experiment with the gateway, where we just restart one gateway.

Ideally in the future we could enable some gateway chaos experiments again, which we currently only support for helm.

TL;DR; Our Camunda Cloud clusters can handle gateway restarts without issues.

High Snapshot Frequency

February 1, 2022 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today we wanted to experiment with the snapshot interval and verify that a high snapshot frequency will not impact our availability (#21).

TL;DR; The chaos experiment succeeded 💪 We were able to prove our hypothesis.

Handling of Big Variables

January 19, 2022 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

New Year;:tada:New Chaos🐒

This time I wanted to experiment with "big" variables. Zeebe supports a maxMessageSize of 4 MB, which is quite big. In general, it should be clear that using big variables will cause performance issues, but today I also want to find out whether the system can handle big variables (~1 MB) at all.

TL;DR; Our Chaos experiment failed! Zeebe and Camunda Cloud is not able to handle (per default) big variables (~1 MB) without issues.

Recovery (Fail Over) time

October 5, 2021 · 5 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In the last quarter we worked on a new "feature" which is called "building state on followers". In short, it means that the followers apply the events to build there state, which makes regular snapshot replication unnecessary and allows faster role transition between Follower-to-Leader. In this chaos day I wanted to experiment a bit with this property, we already did some benchmarks here. Today, I want to see how it behaves with larger state (bigger snapshots), since this needed to be copied in previous versions of Zeebe, and the broker had to replay more than with the newest version.

If you want to now more about build state on followers check out the ZEP

TL;DR; In our experiment we had almost no downtime, with version 1.2, the new leader was very fast able to pick up the next work (accept new commands).