Skip to main content

Slow Network

ยท 6 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

On a previous Chaos Day we played around with ToxiProxy , which allows injecting failures on the network level. For example dropping packages, causing latency etc.

Last week @Deepthi mentioned to me that we can do similar things with tc, which is a built-in linux command. Today I wanted to experiment with latency between leader and followers using tc.

TL;DR; The experiment failed; With adding 100ms network delay to the Leader we broke the complete processing throughput. ๐Ÿ’ฅ

Chaos Experimentโ€‹

We want to experiment with network latency and what kind of effect has a slow network on the cluster.

Hypothesisโ€‹

We expect that we can handle certain network latency, due to our heartbeat and election time timeouts. After, reaching the deadlines we expect fail overs and followers which are lagging behind.

This means under a certain threshold we should be able to still process user commands, with slightly delay but without real issues. After reaching the deadline and the fail overs, it should be possible to continue, since we will add only to one node the delay.

Experimentโ€‹

TCโ€‹

TC is a built in linux command, the manpage summarizes it as tc - show / manipulate traffic control settings.

In order to add delay to a network interface, we can run the following:

tc qdisc add dev eth0 root netem delay 200ms

More details (taken from the blog post):

qdisc: modify the scheduler (aka queuing discipline)
add: add a new rule
dev eth0: rules will be applied on device eth0
root: modify the outbound traffic scheduler (aka known as the egress qdisc)
netem: use the network emulator to emulate a WAN property
delay: the network property that is modified
200ms: introduce delay of 200 ms

Adding this kind of rule means that we add a delay to all outgoing connections, which are going over this network interface.

Actualโ€‹

In order to reduce the blast radius we will run the experiment with one partition on a three broker cluster (replication factor 3).

Steady Stateโ€‹

As we can see in the benchmark we are able to reach in avg. ~77 process instance creation and completions per second.

base

The process instance execution time (from start to end) is under 1 second. base

The commit latency is about 100 ms. base

We can see that one of the followers is a bit lagging behind, but not too far.

base

100 msโ€‹

In the first iteration of the experiment we added a 100 ms delay to the leaders outgoing traffic.

[zell zell-chaos-day/ cluster: zeebe-cluster ns:zell-chaos-day]$ k exec -it zell-chaos-day-zeebe-gateway-579c76978f-npz2k -- zbctl status --insecure
Cluster size: 3
Partitions count: 1
Replication factor: 3
Gateway version: 1.1.0-SNAPSHOT
Brokers:
Broker 0 - zell-chaos-day-zeebe-0.zell-chaos-day-zeebe.zell-chaos-day.svc.cluster.local:26501
Version: 1.1.0-SNAPSHOT
Partition 1 : Follower, Healthy Broker 1 - zell-chaos-day-zeebe-1.zell-chaos-day-zeebe.zell-chaos-day.svc.cluster.local:26501
Version: 1.1.0-SNAPSHOT
Partition 1 : Follower, Healthy Broker 2 - zell-chaos-day-zeebe-2.zell-chaos-day-zeebe.zell-chaos-day.svc.cluster.local:26501
Version: 1.1.0-SNAPSHOT
Partition 1 : Leader, Healthy

Based on the zbctl output, or the grafana dashboard we can find out who is the leader. Note that the output of zbctl looks still a bit broken, related issue #6692.

With the following commands we can add the delay of 100 ms to the leaders outgoing traffic.

[zell zell-chaos-day/ cluster: zeebe-cluster ns:zell-chaos-day]$ k exec -it zell-chaos-day-zeebe-2 -- bash
root@zell-chaos-day-zeebe-2:/usr/local/zeebe# tc qdisc add dev eth0 root netem delay 100ms
root@zell-chaos-day-zeebe-2:/usr/local/zeebe# tc -s qdisc
qdisc noqueue 0: dev lo root refcnt 2
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc netem 8001: dev eth0 root refcnt 2 limit 1000 delay 100.0ms
Sent 11635496 bytes 10086 pkt (dropped 0, overlimits 0 requeues 0)
backlog 339773b 77p requeues 0

Almost immediately we see a drop in our general section of the Grafana Dashboard.

base

The backpressure increased significantly. As expected the commit latency increased.

base

The processing latency as well.

base

It was unexpected that the throughput breaks down so much. We can see in the send request that a lot of the requests are ended with timeouts or with resource exhausted.

base base

Interesting is that no worker is able to activate nor complete any job. This cause increasing of the running process instances, so the state is growing.

base

Taking a look at the raft metrics we see that this already caused some heartbeat misses, but no leader change.

base

base

The follower is now lagging far more behind. We can see in the logs that the Leader tries to send InstallRequests, but these are also timing out.

RaftServer{raft-partition-partition-1} - InstallRequest{currentTerm=1, leader=2, index=1173627, term=1, version=1, chunkId=HeapByteBuffer{position=0, remaining=10, limit=10, capacity=10, mark=java.nio.HeapByteBuffer[pos=0 lim=10 cap=10], hash=1744147670}, nextChunkId=HeapByteBuffer{position=0, remaining=7, limit=7, capacity=7, mark=java.nio.HeapByteBuffer[pos=0 lim=7 cap=7], hash=1283029304}, data=HeapByteBuffer{position=0, remaining=10626817, limit=10626817, capacity=10626817, mark=java.nio.HeapByteBuffer[pos=0 lim=10626817 cap=10626817], hash=1083445787}, initial=false, complete=false} to 1 failed: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Request ProtocolRequest{id=489616, subject=raft-partition-partition-1-install, sender=zell-chaos-day-zeebe-2.zell-chaos-day-zeebe.zell-chaos-day.svc.cluster.local:26502, payload=byte[]{length=10647731, hash=1655826849}} to zell-chaos-day-zeebe-1.zell-chaos-day-zeebe.zell-chaos-day.svc.cluster.local:26502 timed out in PT5S"

The follower is starting regularly elections, but is not able to overturn the existing leader.

In general, in the Gateway logs we can see the start of the delay injection quite good.

D 2021-07-06T10:05:51.263195Z Received REACHABILITY_CHANGED for broker 2, do nothing.
D 2021-07-06T10:05:53.264480Z Received REACHABILITY_CHANGED for broker 1, do nothing.
D 2021-07-06T10:05:54.184981Z Received REACHABILITY_CHANGED for broker 2, do nothing.
D 2021-07-06T10:05:54.213904Z Received REACHABILITY_CHANGED for broker 1, do nothing.
D 2021-07-06T10:05:54.361408Z Received REACHABILITY_CHANGED for broker 0, do nothing.
D 2021-07-06T10:05:54.986270Z Received REACHABILITY_CHANGED for broker 0, do nothing.
D 2021-07-06T10:05:56.617134Z Received REACHABILITY_CHANGED for broker 2, do nothing.
D 2021-07-06T10:05:57.072707Z Expected to handle gRPC request, but request timed out between gateway and broker
D 2021-07-06T10:05:57.126207Z Expected to handle gRPC request, but request timed out between gateway and broker
D 2021-07-06T10:05:57.157683Z Expected to handle gRPC request, but request timed out between gateway and broker

Similar logs we see on the broker side.

In general, we can say the experiment failed. The cluster was not able to run our normal workload. It seem to behave quite bad, but there were no leader change at all.

250 msโ€‹

In order to verify whether 250 ms, will cause a leader election we reconfigured the delay. It looked quite similar, performance wise, but the heartbeats for one of the followers increased. Still it was not enough to cause a leader change.

After several minutes (~30) of running this configuration we were able to observe that one of the other followers missed heartbeats as well. This finally caused a leader change.

base

The throughput came back, it was similar to what is was before.

base

The processing execution latency was higher than usual.

base

Similar to the commit latency, interesting to see such an affect caused by a follower with network issues.

base

Resultโ€‹

As written before the experiment failed, the hypothesis was not met. We were not able to add latency to the network, which is much lower than our deadlines (heartbeat timeout is 2.5 seconds), without causing any harm to our processing throughput/latency.

Still we had some interesting observations we can use for our next experiments and insight which we need to consider on setting up Zeebe in environment where the network might be unreliable/slow.