Skip to main content

Experiment with Low Load

· 3 min read
Christopher Kujawa
Chaos Engineer @ Zeebe
  • Run a benchmark with low load
  • Investigate last benchmark failures

Chaos Experiment

We currently seem to have issues with RocksDB, which sometimes generates a lot of SST files during the Broker lifetime. This causes to fail the snapshot replication at some point. This is especially problematic after pod's get restarted, since a follower normally need either to be catched up with the log or the leader will send a snapshot to the follower. If the snapshot contains a lot of files this get's problematic. In order to understand this better we would like to find out how we can reproduce it. We expected this happens only on low load, the assumption is that RocksDB will not trigger the compaction so often, because we are not reaching a certain threshold. See the related issue #4887 .

Expected

When creating workflow instances and completing them directly afterwards we normally expect that there is nothing left and nothing should accumulate together. If we do it on low load, which means 1 workflow instance creation and completion per second then we expect the same. Furthermore we expect that the used resources are lower then on higher load.

Actual

We have setup our default benchmark with three partition, three nodes and replication factor three. We starting one workflow instance at a time. We running one worker, which completes the related workflow.

In the general overview we can see that we start and complete one workflow per second. general

The resource consumption looks ok. resource

But the RocksDB used size and snapshot files seem to be increasing. rocks1 rocks2 snapshot

I will let the benchmark run a bit longer and I think we need to investigate this issue further.

Investigation Last Benchmark

Every week I create a benchmark on the chaos day and let it run until the next chaos day. Since I had not much time for trying out other experiments today and I saw that the old benchmark has problems I decided to investigated why the processing went down. I collected here my observations and use this more as a summary, since until now it is not clear what is the issue.

In the general overview we can see that the processing is in avg under 100 workflow instances per second, normally we would expect something around 130. We can also see that the processing in general is not that stable. general

If we take a look at the partitions separatly we can see that the partition two died quite early.

Partition 1 last 7 days partition-1-general-7-days Partition 2 last 7 days partition-2-general-7-days Partition 3 last 7 days partition-3-general-7-days

We can see that the processing seem to be stopped for the partition two and never comes back. This seem to happen on the 31-07-2020 ~ 2 pm. With the resource panel we can also see that at this time a node preemption happen, since all pod seem to be rescheduled. This can be seen based on the different colors of the graphs.

resources

In the log we can see that all brokers are getting closed.

Interesting is that we see the day after, an continously growing disk usage, which never gets reduced.

disk-usage

Currently I don't understand what was the issue and why it never comes back. I created an issue to further investigate that #5127 . I will keep the old benchmark running and will setup a separe one in order to reproduce this.

Participants

  • @zelldon