Skip to main content

News from Camunda Exporter project

· 4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In this Chaos day, we want to verify the current state of the exporter project and run benchmarks with it. Comparing with a previous version (v8.6.6) should give us a good hint on the current state and potential improvements.

TL;DR; The latency of user data availability has improved due to our architecture change, but we still need to fix some bugs before our planned release of the Camunda Exporter. This experiment allows us to detect three new bugs, fixing this should allow us to make the system more stable.

Chaos Experiment

Benchmarks

We have seen in previous experiments and benchmarks that the realistic benchmarks are not yet totally reliable, as they seem to overload at some point the system. This can happen if there is a hiccup and jobs take longer to process. Jobs in the queue are getting delayed, and time out, they are sent out to different workers, but we will reach them at some point again the jobs, and we will publish also for this job a message. This in general increases the load of the system as we have to timeout jobs, we have to handle additional message publish, etc.

Additionally, message publish can be rejected, when this happens we wait for another timeout adding again load on the system, more and more retries happen, etc. This breaks the benchmark performance.

To avoid this, we reduce the benchmark payload for now, which is in charge of creating multiple instances and call activities, etc. To be specific, they reduced the items from 50 to 5 but scaled the starter to start more instances. With this payload, we can scale more fine granular. Each instance can create 5 sub-instances, when creating three process instances we create effectively 15 instances/token.

As this benchmark runs quite stable, it allows us to better compare the latency between base and main.

Details Experiment

We will run two benchmarks one against 8.6.6, call based, and one against the current main branch (commit a1609130).

Expected

When running the base and the main and comparing each other we expect that the general throughput should be similar. Furthermore, we expect that the latency until the user sees data (or data is written into ES and searchable) should be lowered on the main branch rather than on the base.

Note: Right now we don't have a good metric to measure that data is available for the user, we plan to implement this in the starter benchmark application at some point via querying the REST API. For now, we calculate different average latencies together, whereas we take as elastic search flush a constant of 2 seconds.

We expect a reduction of latency as we reduce one additional hop/usage of ES as intermediate storage, before aggregation.

Base

current-8.6

Main

main-target

Actual

We have set up both benchmarks, running as described above with changed payloads.

General Performance

The general throughput performance looks similar. The resource consumption looks similar as well, but we didn't investigate this more deeply. Will be done separate.

Base general

base-general

Main general

main-general

Latency

This experiment aims to show the difference in the data availability for the user.

In order to better visualize the dashboard has been adjusted for this experiment.

Base latency

base-latency

Main latency

As we expected we were able to reduce the latency data is available for the user by the additional ES flush, reducing it by ~2 seconds.

main-latency

Result

We were able to show that the latency has been reduced under normal load.

Note: Be aware this experiment only runs benchmarks with less-to-normal load, on higher load this might change, and need to be tested separately.

Found Bugs

Within the experiment, we run into several other issues. Especially after running for a while, when pods got restarted and importer have been enabled, the Camunda Exporter broke.

exporting-fail

This caused to increase in the latency.

exporting-fail-latency

The exporter was not able to detect correctly anymore that the importing was done but was still flushing periodically (which is as well wrong)

See related GitHub issue(s)

Furthermore, based on logs we saw that the treePath hasn't been published correctly in the Exporter.