6 posts tagged with "data"

Continuing SST Partitioning toggle

May 19, 2023 · 9 min read

Chaos Engineer @ Zeebe

Today we want to continue with the experiment from last Chaos day, but this time with a bit more load. This should make sure that we trigger the compaction of RocksDB and cause the SST partitioning to happen, for real.

The reasons stay the same we want to find out whether it would be possible to enable and disable the flag/configuration without issues.

TL;DR; Today's, experiments succeeded 🚀. We were able to show that even with a higher number of process instances (bigger state) we can easily disable and enable the SST partitioning flag without issues. I also got a confirmation from a RocksDb contributor that our observations are correct, and that we can easily toggle this feature without issues.

SST Partitioning toggle

May 15, 2023 · 7 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

On this chaos day I wanted to experiment with a new experimental feature we have released recently. The enablement of the partitioning of the SST files in RocksDB. This is an experimental feature from RocksDb, which we made available now for our users as well, since we have seen great benefits in performance, especially with larger runtime data.

I wanted to experiment a bit with the SST partitioning and find out whether it would be possible to enable and disable the flag/configuration without issues.

TL;DR; The first experiment was successful, it looks like we can enable and disable the partitioning without impacting the execution of one existing PI. We need to experiment a bit more with larger data sets to force RocksDB compaction, to be fully sure.

Not produce duplicate Keys

November 11, 2021 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Due to some incidents and critical bugs we observed in the last weeks, I wanted to spent some time to understand the issues better and experiment how we could detect them. One of the issue we have observed was that keys were generated more than once, so they were no longer unique (#8129). I will describe this property in the next section more in depth.

TL;DR; We were able to design an experiment which helps us to detect duplicated keys in the log. Further work should be done to automate such experiment and run it agains newer versions.

Time travel Experiment

May 25, 2021 · 9 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Recently we run a Game day where a lot of messages with high TTL have been stored in the state. This was based on an earlier incident, which we had seen in production. One suggested approach to resolve that incident was to increase the time, such that all messages are removed from the state. This and the fact that summer and winter time shifts can cause in other systems evil bugs, we wanted to find out how our system can handle time shifts. Phil joined me as participant and observer. There was a related issue which covers this topic as well, zeebe-chaos#3.

TL;DR; Zeebe is able to handle time shifts back and forth, without observable issues. Operate seems to dislike it.

Fault-tolerant processing of process instances

March 9, 2021 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today I wanted to add another chaos experiment, to increase our automated chaos experiments collection. This time we will deploy a process model (with timer start event), restart a node and complete the process instance via zbctl.

TL;DR;

I was able to create the chaos toolkit experiment. It shows us that we are able to restore our state after fail over, which means we can trigger timer start events to create process instances even if they have been deployed before fail-over. Plus we are able to complete these instances.

Deployment Distribution

January 26, 2021 · 11 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

On this chaos day we wanted to experiment a bit with deployment's and there distribution.

We run a chaos experiment with deploying multiple workflows and disconnecting two leaders. We verified whether deployments are distributed later. The chaos experiment was successful and showed a bit how fault tolerant deployment distribution is. 💪