Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Membersforquery fixes #808

Merged
merged 4 commits into from
Jan 10, 2018
Merged

Membersforquery fixes #808

merged 4 commits into from
Jan 10, 2018

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Jan 3, 2018

TestPeersForQueryMulti occasionally fails in CI. I've seen this in various non-cluster-related PR's and I think also in master. I think this has been more frequent recently, but i'm not sure.
when I run for i in $(seq 1 200); do go test -run=TestPeersForQueryMulti >> master.txt; done I observed 2 failures, 198 passes.

see the indiviual commit messages for how i've gone about addressing this.

a node's priority may be changed in between the two checks,
which with the previous code could cause:
* a node not be taken into account at all
* a list of nodes for a given prio to be replaced by a single
  node with the same prio
note that in practice this doesn't work well.
out of 200 runs:

$ ./analyze.sh
failures
197
Mentions of 'Line 85:' specifically
197
passes:
out.txt:3

sed -n "s/.*Expected '\(.*\)' to almost equal '500'.*/\1/p" out.txt | goplot hist                                                                                    ⏎
462.00 -> 469.90: ▇▇▇  2
          477.80: ▇▇▇▇▇▇▇▇▇  6
          485.70: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  29
          493.60: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  30
          501.50: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  34
          509.40: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  38
          517.30: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  30
          525.20: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  23
          533.10: ▇▇▇▇  3
          541.00: ▇▇▇  2

so our distribution is off by up to 41/500 = .082 = 8.2%
$ ./analyze.sh
failures
198
Mentions of 'Line 85:' specifically
198
passes:
2
$ sed -n "s/.*Expected '\(.*\)' to almost equal '5000'.*/\1/p" out.txt2 | goplot hist
4867.00 -> 4893.90: ▇  1
           4920.80: ▇▇▇▇▇▇▇▇▇▇▇▇▇  9
           4947.70: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  19
           4974.60: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  41
           5001.50: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  35
           5028.40: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  40
           5055.30: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  27
           5082.20: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇  16
           5109.10: ▇▇▇▇▇▇▇▇▇▇▇  8
           5136.00: ▇▇  2

now worst delta is 136/5000 = .0272 = 2.7%
I suspect if we crank up number of runs significantly we should have a
better convergence, a smaller delta that ultimately can consistently
pass the "almost label".
However I would argue that these bounds are too loose. I don't want
convergence to appear after thousands of requests.  See next commit
@@ -142,7 +145,8 @@ LOOP:
}
// if no nodes have been selected yet then grab a
// random node from the set of available nodes
selected := candidates.nodes[rand.Intn(len(candidates.nodes))]
pos := int(atomic.AddUint64(&counter, 1)) % len(candidates.nodes)
Copy link
Contributor

@replay replay Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should that counter maybe get reset to 0 at some point? I know Uint64 is big, but being optimistic we should assume MT can have a long life time without restarting :)
or do we just rely on it going back to 0 once it is full?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when it wraps around, all the logic keeps on functioning fine as far as i can see.
in fact i might as well have used a uint32 for that reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah true

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 3, 2018

i just realized the suggested counter-remainder approach may result in a skewed distribution if the incerements align with the number of membersMap iterations (e.g. based on the cluster topology) , need to fine tune it a bit more

the randomness, both in memberlist order, as well
as in selection of peers
was creating too much noise

./analyze.sh
failures
0
Mentions of 'Line 85:' specifically
0
passes:
200
replay
replay previously requested changes Jan 10, 2018
// if no nodes have been selected yet then grab a node from
// the set of available nodes in such a way that nodes are
// weighted fairly across MembersForQuery calls
selected := candidates.nodes[count%len(candidates.nodes)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is reading the count like this atomic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry, got it now. the count is the return value of the above atomic operation

@replay replay dismissed their stale review January 10, 2018 13:03

nvm, I understood how it works

Copy link
Member

@woodsaj woodsaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dieterbe Dieterbe merged commit c0f9312 into clear_cache_api Jan 10, 2018
@Dieterbe Dieterbe mentioned this pull request Apr 11, 2018
@Dieterbe Dieterbe deleted the membersforquery-fixes branch September 18, 2018 08:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants