2 posts tagged with "resiliency"

Job push resiliency

December 6, 2023 · 7 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Nicolas Pepin-Perreault

Senior Software Engineer @ Zeebe

In today's chaos day we experimented with job push resiliency.

The following experiments we have done today:

Job streams should be resilient to gateway restarts/crash
Job streams should be resilient to leadership changes/leader restarts
Job streams should be resilient to cluster restarts

TL;DR; All experiments succeeded and showcased the resiliency even on component restarts. 🚀

Gateway Termination

April 6, 2023 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In today's chaos day, we wanted to experiment with the gateway and resiliency of workers.

We have seen in recent weeks some issues within our benchmarks when gateways have been restarted, see zeebe#11975.

We did a similar experiment in the past, today we want to focus on self-managed (benchmarks with our helm charts). Ideally, we can automate this as well soon.

Today Nicolas joined me on the chaos day 🎉

TL;DR; We were able to show that the workers (clients) can reconnect after a gateway is shutdown ✅ Furthermore, we have discovered a potential performance issue on lower load, which impacts process execution latency (zeebe#12311).