-
Notifications
You must be signed in to change notification settings - Fork 2
Conversation
node_heap.go
Outdated
for i := 0; i < n && i < len(nh.Nodes); i++ { | ||
node := nh.Nodes[i] | ||
m = append(m, node) | ||
for i := 0; i < n; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willscott noticed a slight issue with the use of heaps here.
The heap will keep itself sorted upon insertion and removal of new nodes. However, the existing nodes in the heap have dynamically changing values and the heap does not continuously sort itself every time one of those values changes.
The strategy I've implemented here is to select the top N nodes from the heap, to use .Pop()
n times then to retrieve the top nodes, then add them back using .Push()
. This is a similar flow to the .Fix()
method that exists on the heap which is meant to reposition an item in the heap if it changes its value. Not the most efficient, but we are dealing with fairly small heap sizes here.
Also wondering if we still think that a priority queue is the best data structure to store the AllNodes
tier? My understanding of the code / implementation is that the ActiveNodes
is always a subset of AllNodes
and they are not mutually exclusive sets like the previous "Main" and "Uknown" tier design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willscott One issue I noticed with my changes is that since the nodes are temporarily removed from the heap, if there are any stats being written to them concurrently, that will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by 'fail'? a temporary dropped metric isn't the end of the world, potentially.
If not a priority queue for all nodes, how would you imagine keeping track of the whole set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by 'fail'? a temporary dropped metric isn't the end of the world, potentially.
What i mean by "fail" is that any concurrent operation that accesses the heap while the topN selection is occurring, it might get an incorrect list. I do agree with you though, it doesn't seem that damaging specially that the "AllNodes" set is just used for testing nodes.
If not a priority queue for all nodes, how would you imagine keeping track of the whole set?
Wondering if a regular array structure would work better here. It's simpler and gives us the functionality we need (top n nodes, random nodes, node removal). We're not really using the properties of a priority queue because the values of each node in the queue are changing constantly and we have to basically re-sort the nodes every time we need to make a selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We can add add/remove methods that are simple enough api.
- i think keeping the nodes in a roughly sorted order for consideration isn't a bad idea, as we'll want to access the better ones.
- if we run into performance issues with using the push/pop semantics on updates, we can get lazy about it and then use the heap fix when we need to get the top n.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one note on golang semantics here: you have the nh.lk.RLock()
above - that indicates you're taking a "read only lock" over the data structure - but the pop and push you're proposing in your change will do a read-write operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip here Will, always good to learn :)
Hey @willscott noticing that the |
you believe this is a bug in the considerupdate? do you have a failing test against it that you can point me to so that i can help debug it? |
I think it would be helpful to just get an idea of the expected behavior here to understand what Im testing against. The example im going off is lets say I have a new node that is not currently in the hash ring and I run |
I would expect One way that confidence gets better is if mirroring has already sent some test requests to the new node, so we have some sense that it could be good. the other question is how to handle initial startup - we probably need to enforce some sort of minimum reasonable size where we're okay with negative updates if we fall below. |
Removed unused metrics
@willscott the reason why the tests are passing is because the |
@willscott a few things I found:
The
|
node_heap.go
Outdated
} | ||
for _, n := range m { | ||
heap.Push(nh, n) | ||
for i := 0; i < n && i < len(nh.Nodes); i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this returns an unsorted list of candidates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we always use heap.Push
when adding new elements, then the nh.Nodes
remains in sorted order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My train of thought is that if the node(s) stats change though after they have been pushed, then nh.Nodes
doesn't remain sorted because heap.push
only sorts on insertion.
Also if the stats of already ordered nodes change, then I think heap.Push
actually would probably never keep a sorted list becauseheap.Push
would only work properly if the heap is correctly sorted when the insertion occurs.
Not sure if im missing something here maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how much churn we expect - hopefully in the steady state not too much.
the answer to changing Priority
of nodes that lead to an unsorted heap is to use heap.Init
at the beginning of the call to TopN - or to use heap.Fix
whenever specific node priorities are changed.
Since we only call TopN
in the periodic pool refresh, so once every 5 minutes, it's probably no problem to re-establish the heap ordering each time.
Integrate new orchestrator endpoint
@aarshkshah1992 test flakiness addressed. cc @willscott was able to reproduce flakiness locally and made changes to the topN selection. I believe just using |
@willscott added a test that targets testing cache affinity. Relevant commit is here: 1975f49 The test tries to ensure that if we have a set of good performing nodes that we have previously made cid requests to, we still hit those nodes for those cids even if other "similar" performing nodes join the pool. Currently that test is failing, and we are seeing new nodes that join the pool show up as first candidate nodes for those cids instead. |
@@ -15,8 +17,9 @@ import ( | |||
) | |||
|
|||
const ( | |||
nodesSize = 200 | |||
nodesSize = 6 | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to do this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to change the pool size desired for testing. I reduced the pool size for testing cache affinity to make it easier to debug and get the tests passing. Once we get that passing we can set it to a high / realistic number again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has already been structured so that the lower pool target is only used in the test context. A larger pool size target is used when running normally
Test cache affinity
feat: port compliance cids
Port over caboose main
I'm going to merge this into adaptive so we have one PR tracking all of this and we can get to final merge |
No description provided.