Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop 3.0 ensemble #1331

Merged
merged 3 commits into from
Oct 31, 2019
Merged

Develop 3.0 ensemble #1331

merged 3 commits into from
Oct 31, 2019

Conversation

martinsumner
Copy link

@martinsumner martinsumner commented Oct 28, 2019

These changes are primarily about correcting the test conditions so that the target_n_val is set correctly, and that the number of nodes and n-vals used in the tests will consistently work - without there being a chance that sometimes a minority of nodes might contain a majority of partitions.

There is still an outstanding issue with ensemble_remove_node2 - basho/riak_core#943

There is still a very intermittent failure in ensemble_sync. Very rarely one of the PUTs will timeout https://github.com/basho/riak_test/blob/develop-2.9/tests/ensemble_sync.erl#L109-L110. It is not the first put. On one occasion it was associated with a vnode unexpectedly trying to handoff via raw_put and the bang returning a badarg.

Note also - basho/riak#994

See - https://docs.riak.com/riak/kv/latest/configuring/strong-consistency/index.html#setting-the-target-n-val-parameter

Previously ensemble_util would not set the target_n_val, and it would default to 4 when nval is 5 in many riak_ensemble tests.

This setting could mean that a non-quorum minority partition (2 nodes) could contain a majority of peers (3 peers) - and so it would not be possible for all ensembles to prove quorum during such a partition.

Where tests are going to partition the cluster they need to have at least nval + 1 nodes - as without this the target_n_val will not be achievable.

Also stops over-riding backend.  This runs the risk that the tests assume that the backend has the features of the in-memory backend (e.g. non-persisting) - but that isn't documented.  The tests appear to be more stable with bitcask backend.  All tests still at least past intermittently with the in-memory backend.

The test also uses more relaistic defaults.  It increases the super-fast anti-entropy tick, and normalises the vnode_management_timer.  This appeasr to increase stability in tests (at a cost that more sleeps required to give time for completion).
Currently riak develop-3.0 does not support basho-patches in the code path.  Bodge this so we can still add and save a patch (maybe - or perhaps just adding the path this way won't have the intercept survive the reboot?).

basho/riak#994
Copy link

@ThomasArts ThomasArts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an esemble expert and cannot really judge the impact, but it seems sound to make these changes. They basically make the tests more stable, don't change the code. Moreover, test behaviour has changed a bit, but not in essence.

Wait for the replace and the leave to complete - test now fails 80% of the time.
@martinsumner martinsumner merged commit d89420d into develop-3.0 Oct 31, 2019
@martinsumner martinsumner deleted the develop-3.0-ensemble branch October 8, 2021 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants