30 posts tagged with "availability"

Using flow control to handle bottleneck on exporting

July 25, 2024 · 5 min read

Associate Software Engineer @ Zeebe

Zeebe 8.6 introduces a new unified flow control mechanism that is able to limit user commands (by default it tries to achieve 200ms response times) and rate limit writes of new records in general (disabled by default). Limiting the write rate is a new feature that can be used to prevent building up an excessive exporting backlog. There are two ways to limit the write rate, either by setting a static limit or by enabling throttling that dynamically adjust the write rate based on the exporting backlog and rate. In these experiments, we will test both ways of limiting the write rate and observe the effects on processing and exporting.

TL;DR; Both setting a static write rate limit and enabling throttling of the write rate can be used to prevent building up an excessive exporting backlog. For users, this will be seen as backpressure because processing speed is limited by the rate at which it can write processing results.

Using flow control to handle uncontrolled process loops

July 25, 2024 · 6 min read

Rodrigo Lopes

Associate Software Engineer @ Zeebe

Limiting the write rate is a new feature that can be used to prevent building up an excessive exporting backlog.

In these experiments we will test what happens with the deployment of endless loops that result in high processing load, and how we can use the new flow control to keep the cluster stable.

TL;DR;

Enabling the write rate limiting can help mitigate the effects caused by process instances that contain uncontrolled loops by preventing building up an excessive exporting backlog.

Reducing the job activation delay

January 19, 2024 · 12 min read

Nicolas Pepin-Perreault

Senior Software Engineer @ Zeebe

With the addition of end-to-end job streaming capabilities in Zeebe, we wanted to measure the improvements in job activation latency:

How much is a single job activation latency reduced?
How much is the activation latency reduced between each task of the same process instance?
How much is the activation latency reduced on large clusters with a high broker and partition count?

Additionally, we wanted to guarantee that every component involved in streaming, including clients, would remain resilient in the face of load surges.

TL;DR; Job activation latency is greatly reduced, with task based workloads seeing up to 50% reduced overall execution latency. Completing a task now immediately triggers pushing out the next one, meaning the latency to activate the next task in a sequence is bounded by how much time it takes to process its completion in Zeebe. Activation latency is unaffected by how many partitions or brokers there in a cluster, as opposed to job polling, thus ensuring scalability of the system. Finally, reuse of gRPC's flow control mechanism ensure clients cannot be overloaded even in the face of load surges, without impacting other workloads in the cluster.

Head over to the documentation to learn how to start using job push!

Broker Scaling and Performance

December 20, 2023 · 6 min read

Lena Schönburg

Senior Software Engineer @ Zeebe

Deepthi Akkoorath

Senior Software Engineer @ Zeebe

With Zeebe now supporting the addition and removal of brokers to a running cluster, we wanted to test three things:

Is there an impact on processing performance while scaling?
Is scaling resilient to high processing load?
Can scaling up improve processing performance?

TL;DR; Scaling up works even under high load and has low impact on processing performance. After scaling is complete, processing performance improves in both throughput and latency.

Dynamic Scaling with Dataloss

December 19, 2023 · 5 min read

Lena Schönburg

Senior Software Engineer @ Zeebe

We continue our previous experiments with dynamically scaling by now also testing whether the cluster survives dataloss during the process.

One goal is to verify that we haven't accidentally introduced a single point of failure in the cluster. Another is to ensure that data loss does not corrupt the cluster topology.

TL;DR; Even with dataloss, the scaling completes successfully and with the expected results. We found that during scaling, a single broker of the previous cluster configuration can become a single point of failure by preventing a partition from electing a leader. This is not exactly a bug, but something that we want to improve.

Dynamically scaling brokers

December 18, 2023 · 7 min read

Lena Schönburg

Senior Software Engineer @ Zeebe

We experimented with the first version of dynamic scaling in Zeebe, adding or removing brokers for a running cluster.

Scaling up and down is a high-level operation that consists of many steps that need to be carried co-operatively by all brokers in the cluster. For example, adding new brokers first adds them to the replication group of the assigned partitions and then removes some of the older brokers from the replication group. Additionally, priorities need to be reconfigured to ensure that the cluster approaches balanced leadership eventually.

This orchestration over multiple steps ensures that all partitions are replicated by at least as many brokers as configured with the replicationFactor. As always, when it comes to orchestrating distributed systems, there are many edge cases and failure modes to consider.

The goal of this experiment was to verify that the operation is resilient to broker restarts. We can accept that operations take longer than usual to complete, but we need to make sure that the operation eventually succeeds with the expected cluster topology as result.

TL;DR; Both scaling up and down is resilient to broker restarts, with the only effect that the operation takes longer than usual to complete.

Job push resiliency

December 6, 2023 · 7 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Nicolas Pepin-Perreault

Senior Software Engineer @ Zeebe

In today's chaos day we experimented with job push resiliency.

The following experiments we have done today:

Job streams should be resilient to gateway restarts/crash
Job streams should be resilient to leadership changes/leader restarts
Job streams should be resilient to cluster restarts

TL;DR; All experiments succeeded and showcased the resiliency even on component restarts. 🚀

Job push overloading

November 30, 2023 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Nicolas Pepin-Perreault

Senior Software Engineer @ Zeebe

In today's chaos day we (Nicolas and I) want to verify how job push behaves and in general, the Zeebe system when we have slow workers.

TL;DR; Right now it seems that even if we have a slow worker it doesn't impact the general system, and only affects the corresponding process instance, not other instances. We found no unexpected issues, everything performed pretty well.

Hot backups impact on processing

November 7, 2023 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today, we want to experiment with hot backups in SaaS and a larger runtime state in Zeebe and how it impacts the ongoing processing in Zeebe (or not?). This is part of the investigation of a recently created bug issue we wanted to verify/reproduce #14696.

TL;DR; We were able to prove that hot backups are indeed not impacting overall processing throughput in Zeebe. We found that having a full Elasticsearch disk might impact or even fail your backups, which is intransparent to the user.

Using Large Multi-Instance

June 2, 2023 · 6 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

New day new chaos. 💀 In today's chaos day I want to pick up a topic, which had bothered people for long time. I created a chaos day three years ago around this topic as well.

Today, we experiment with large multi-instances again. In the recent patch release 8.2.5 we fixed an issue with spawning larger multi instances. Previously if you have created a process instance with a large multi-instance it was likely that this caused to blacklist the process instance, since the multi-instance spawning ran into maxMessageSize limitations.

This means the process instance was stuck and was no longer executable. In Operate this was not shown and caused a lot of friction or confusion to users. With the recent fix, Zeebe should chunk even large collections into smaller batches to spawn/execute the multi-instance without any issues.

TL;DR; We were able to see that even large multi-instances can be executed now. ✅ At some point, we experienced performance regressions (during creating new multi-instance elements) but the execution of the process instance doesn't fail anymore. One problem at a time, we will likely investigate further to improve the performance of such a use case.

When we reached the maxMessageSize we got a rejection, if the input collection is too large we see some weird unexpected errors from NGINX.