-
Notifications
You must be signed in to change notification settings - Fork 105
Conversation
a node's priority may be changed in between the two checks, which with the previous code could cause: * a node not be taken into account at all * a list of nodes for a given prio to be replaced by a single node with the same prio
note that in practice this doesn't work well. out of 200 runs: $ ./analyze.sh failures 197 Mentions of 'Line 85:' specifically 197 passes: out.txt:3 sed -n "s/.*Expected '\(.*\)' to almost equal '500'.*/\1/p" out.txt | goplot hist ⏎ 462.00 -> 469.90: ▇▇▇ 2 477.80: ▇▇▇▇▇▇▇▇▇ 6 485.70: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 29 493.60: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 30 501.50: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 34 509.40: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 38 517.30: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 30 525.20: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 23 533.10: ▇▇▇▇ 3 541.00: ▇▇▇ 2 so our distribution is off by up to 41/500 = .082 = 8.2%
$ ./analyze.sh failures 198 Mentions of 'Line 85:' specifically 198 passes: 2 $ sed -n "s/.*Expected '\(.*\)' to almost equal '5000'.*/\1/p" out.txt2 | goplot hist 4867.00 -> 4893.90: ▇ 1 4920.80: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 9 4947.70: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 19 4974.60: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 41 5001.50: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 35 5028.40: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 40 5055.30: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27 5082.20: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16 5109.10: ▇▇▇▇▇▇▇▇▇▇▇ 8 5136.00: ▇▇ 2 now worst delta is 136/5000 = .0272 = 2.7% I suspect if we crank up number of runs significantly we should have a better convergence, a smaller delta that ultimately can consistently pass the "almost label". However I would argue that these bounds are too loose. I don't want convergence to appear after thousands of requests. See next commit
cluster/cluster.go
Outdated
@@ -142,7 +145,8 @@ LOOP: | |||
} | |||
// if no nodes have been selected yet then grab a | |||
// random node from the set of available nodes | |||
selected := candidates.nodes[rand.Intn(len(candidates.nodes))] | |||
pos := int(atomic.AddUint64(&counter, 1)) % len(candidates.nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should that counter maybe get reset to 0
at some point? I know Uint64
is big, but being optimistic we should assume MT can have a long life time without restarting :)
or do we just rely on it going back to 0
once it is full?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when it wraps around, all the logic keeps on functioning fine as far as i can see.
in fact i might as well have used a uint32 for that reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah true
i just realized the suggested counter-remainder approach may result in a skewed distribution if the incerements align with the number of membersMap iterations (e.g. based on the cluster topology) , need to fine tune it a bit more |
the randomness, both in memberlist order, as well as in selection of peers was creating too much noise ./analyze.sh failures 0 Mentions of 'Line 85:' specifically 0 passes: 200
733815d
to
414b2f2
Compare
a0f35fa
to
5f49ca0
Compare
0ec10db
to
dfa58d3
Compare
// if no nodes have been selected yet then grab a node from | ||
// the set of available nodes in such a way that nodes are | ||
// weighted fairly across MembersForQuery calls | ||
selected := candidates.nodes[count%len(candidates.nodes)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is reading the count like this atomic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry, got it now. the count
is the return value of the above atomic operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TestPeersForQueryMulti occasionally fails in CI. I've seen this in various non-cluster-related PR's and I think also in master. I think this has been more frequent recently, but i'm not sure.
when I run
for i in $(seq 1 200); do go test -run=TestPeersForQueryMulti >> master.txt; done
I observed 2 failures, 198 passes.see the indiviual commit messages for how i've gone about addressing this.