Skip to main content

Play around with ToxiProxy

ยท 4 min read
Christopher Kujawa

First chaos day since my parental leave ๐ŸŽ‰.

Today I wanted to play a bit with ToxiProxy. Toxiproxy is a framework for simulating network conditions and ideal to do some chaos on the network.

Run ToxiProxyโ€‹

Download from the release page the latest version (server and cli).

Start a broker via docker.

docker pull camunda/zeebe:SNAPSHOT
docker run -p 26500:26500 camunda/zeebe:SNAPSHOT

Start the toxi proxy server.

./toxiproxy-server-linux-amd64 start

Create a proxy for zeebe

./toxiproxy-cli-linux-amd64 create zeebe-proxy -l localhost:26379 -u localhost:26500
Created new proxy zeebe-proxy

You should see something in the toxy proxy server log:

INFO[0031] Started proxy                                 name=zeebe-proxy proxy=127.0.0.1:26379 upstream=localhost:26500

Try zbctl to request the topology.

./zbctl --address localhost:26379 status --insecure

Cluster size: 1
Partitions count: 1
Replication factor: 1
Gateway version: 0.25.0-SNAPSHOT
Brokers:
Broker 0 - 172.17.0.2:26501
Version: 0.25.0-SNAPSHOT
Partition 1 : Leader

In the toxy proxy server log it should be shown as:

INFO[0149] Accepted client                               client=127.0.0.1:41510 name=zeebe-proxy proxy=127.0.0.1:26379 upstream=localhost:26500
WARN[0149] Source terminated bytes=245 err=read tcp 127.0.0.1:56178->127.0.0.1:26500: use of closed network connection name=zeebe-proxy

Add latency to requests

$ ./toxiproxy-cli-linux-amd64 toxic add -t latency -a latency=5000 zeebe-proxy
Added downstream latency toxic 'latency_downstream' on proxy 'zeebe-proxy'

Running zbctl again:

 ./zbctl --address localhost:26379 status --insecure
Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Updating existing toxy:

./toxiproxy-cli-linux-amd64 toxic update -n latency_downstream -t latency -a latency=500 zeebe-proxy

Running zbctl again:

time ./zbctl --address localhost:26379 status --insecure
Cluster size: 1
Partitions count: 1
Replication factor: 1
Gateway version: 0.25.0-SNAPSHOT
Brokers:
Broker 0 - 172.17.0.2:26501
Version: 0.25.0-SNAPSHOT
Partition 1 : Leader

real 0m1.045s
user 0m0.012s
sys 0m0.021s

Inspect existing toxics:

$ ./toxiproxy-cli-linux-amd64 inspect zeebe-proxy
Name: zeebe-proxy Listen: 127.0.0.1:26379 Upstream: localhost:26500
======================================================================
Upstream toxics:
Proxy has no Upstream toxics enabled.

Downstream toxics:
latency_downstream: type=latency stream=downstream toxicity=1.00 attributes=[ jitter=0 latency=500 ]

With toxicity we can change whether it should be applied on all requests or only on some. It is possible to add the latency instead of downstream to upstream. There also other things we can inject, like slicing and delaying packages, dropping packages and limiting the bandwith.

Possible new experiments:

  • introduce latency between one follower and leader - if only one follower experience delays we expect that no election is started
  • introduce latency betweem gw and broker - see whether command timeout
  • slice packages - drop packages, but not every packages - expect that command is send correctly since requests are retried

Slice packagesโ€‹

Slices packages after 128 bytes:

./toxiproxy-cli-linux-amd64 toxic add zeebe-proxy -t slicer -a average_size=128

Publish message seem to work:

$ time ./zbctl --address localhost:26379 publish message failing --insecure --correlationKey key --variables "{}"
{
"key": 2251799813685253
}

After limiting it to 32 bytes:

$ ./toxiproxy-cli-linux-amd64 toxic update -n slicer_downstream -a average_size=32 zeebe-proxy
Updated toxic 'slicer_downstream' on proxy 'zeebe-proxy'

The publish message seem to not work as expected.

$ time ./zbctl --address localhost:26379 publish message failing --insecure --correlationKey key --variables "{}"
null

real 0m0.039s
user 0m0.007s
sys 0m0.023s

Actually I would expect here an error instead of just returning null.

Chaos Experimentโ€‹

No Leader change on high loadโ€‹

Peter volunteered for automating a new chaos experiment, where we put high load on a broker and expect that we have no leader change. This was previous an issue, since the leaders were not able to send heartbeats in time. Related issue #7.

Time resetโ€‹

I wanted to work on the clock reset #3. This seems to be not easily possible in kubernetes or at least with our current images, since we need for that root privilges.

root@zell-time-reset-zeebe-0:/usr/local/zeebe# date -s $(date +%Y-%m-%dT%H:%M:%S)
date: cannot set date: Operation not permitted
Tue Oct 6 11:51:19 UTC 2020

It seems that chaos mesh supports something like that for kubernetes maybe worth to look at https://pingcap.com/blog/simulating-clock-skew-in-k8s-without-affecting-other-containers-on-node

Participantsโ€‹

  • @pihme
  • @zelldon