Skip to content

Riak tuning 1

Matthew Von-Maszewski edited this page Jun 27, 2016 · 10 revisions

June 16, 2016 - Conclusions disputed

The performance tests mentioned below were duplicate in 2016 and resulted in the same throughput numbers. However, thread names now visible via "top -H -p $(pgrep beam)" lead to different conclusions ... which lead to new, slightly different performance tests and different recommendations. See this link for details:

+S to counter Erlang busy wait

The short answer is NUMA has very little to do with performance variances below. It is Erlang's busy wait logic. "+S" adjustment is still the best generic solution ("+sbwt" only works here and there). Recommendation is to reduce "+S x:x" by one for every six logic CPUs in the server, e.g. 24 logical CPUs use "+S 20:20", 12 logical CPUs use "+S 10:10", etc.

Original conclusions disputed because on a 24 logical core machine (2 physical NUMA processor chips):

  • yes "+S 12:12" is faster than Erlang's default "+S 24:24"
  • but "+S 20:20" is even faster than "+S 12:12"

NUMA may have an impact, but the Erlang busy wait appears to be the critical factor in these tests.

Summary:

leveldb has a higher read and write throughput in Riak if the Erlang scheduler count is limited to half the number of CPU cores. Tests have demonstrated improvements of 15% to 80% greater throughput.

The scheduler limit is set in the vm.args file:

+S x:x

where "x" is the number of schedulers Erlang may use. Erlang's default value of "x" is the total number of CPUs in the system. For Riak installations using leveldb, the recommendation is to set "x" to half the number of CPUs. Virtual environments are not yet tested.

Example: for 24 CPU system

+S 12:12

April 6, 2014 - Update for Riak 2.0

Riak 2.0 includes a newer revision of Erlang (R16B02). It also includes the recent mv-hot-threads branch for leveldb. These two upgrades have changed the performance characteristics relative to the +S option … and lead to a completely different recommendation than the original recommendation for Riak 1.x detailed below.

The Riak 2.0 recommendation for +S is to only use the option if your server is both NUMA capable AND contains two or more NUMA memory zones. All test loads were consistent with this recommendation. The test loads were applied with and without +S on a non-NUMA server, a single zone (single physical processor) NUMA server, and a multi-zone (two physical processors) server. Only the multi-zone server benefitted from +S as described below. The other two servers had worse performance with +S.

Discussion:

We have tested a limited number of CPU configurations and customer loads. In all cases, there is a performance increase when the +S option is added to the vm.args file to reduce the number of Erlang schedulers. The working hypothesis is that the Erlang schedulers perform enough "busy wait" work that they always create context switch away from leveldb when leveldb is actually the only system task with real work.

The tests included 8 CPU (no hyper threading, physical cores only) and 24 CPU (12 physical cores with hyper threading) systems. All were 64bit Intel platforms. Generalized findings:

  • servers running higher number of vnodes (64) had larger performance gains than those with fewer (8)
  • servers running SSD arrays had larger performance gains than those running SATA arrays
  • Get and Write operations showed performance gains, 2i query operations (leveldb iterators) were unchanged
  • Not recommended for servers with less than 8 CPUs (go no lower than +S 4:4)

Performance improvements were as high as 80% over extended, heavily loaded intervals on servers with SSD arrays and 64 vnodes. No test resulted in worse performance due to the addition of +S x:x.

The +S x:x configuration change does not have to be implemented simultaneously to an entire Riak cluster. The change may be applied to a single server for verification. Steps: update the vm.args file, then restart the Riak node. Erlang command line changes to schedulers were ineffective.

This configuration change has been running in at least one large, multi-datacenter production environment for several months.

Clone this wiki locally