-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balancing #107
Conversation
balancing/balance_breaker.go
Outdated
} | ||
} | ||
|
||
type lenLimitCounter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lenDelimitedCounter
return tracker.state, true | ||
} | ||
|
||
changed := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this, we can just return false
/true
everywhere below
balancing/balance_breaker.go
Outdated
return exceeded | ||
} | ||
|
||
func (breaker *NodeBreaker) checkStateHalfOpen(exceeded bool) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isHalfOpen
} | ||
if !meter.IsActive() && active { | ||
meter.histogram.shiftData(meter.now().Sub(meter.inActiveSince)) | ||
meter.inActiveSince = time.Time{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe nil
instead oftime.Time
, when switching from inactive to active?
balancing/balance_breaker.go
Outdated
state *openStateTracker | ||
} | ||
|
||
// Record collects call data and returns bool if breaker should be open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open
-> opened
balancing/balance_breaker.go
Outdated
return breaker.ShouldOpen() | ||
} | ||
|
||
// ShouldOpen checks if breaker should be open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
} | ||
|
||
func (tracker *openStateTracker) currentDelay() time.Duration { | ||
multiplier := int(math.Pow(2, tracker.closeIteration)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
closeIteration
is never decremented/reset, even if a backend is healthy for a long time.
An interesting case popped up in my head.
-
initially the backend is not doing so well and its
closeIteration
gets incremented until it reaches some high value -
the backend becomes healthy and stays healthy for a long time, but it's
closeIteration
is already big -
the backend misbehaves for a brief second and gets taken out of the pool of healthy nodes for a long time, due to it's
closeIteration
-
we wait a long time until we plug the backend back in, even though it' already healthy and might provide us with the best timings
func (h *histogram) cellsCount() int { | ||
return int(math.Ceil(float64(h.retention)/float64(h.resolution))) + 1 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the methods of histogram
are protected by a mutex, but some are not. shiftData
mutates the state heavily, yet it doesn't lock the data it mutates. If (or when)shiftData
and unshiftData
run concurrently bad things may happen.
balancing/balance_breaker.go
Outdated
} | ||
|
||
func (series *dataSeries) ValueRangeFun(timeStart, timeEnd time.Time, fun func(*timeValue)) []float64 { | ||
dataRange := []float64{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataRange
is unused
} | ||
|
||
// TimeSpent returns float64 repesentation of time spent in execution | ||
func (meter *CallMeter) TimeSpent() float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this be too expensive to calculate?
It happens every time we want to know the node's weight, which is basically every call. If there are a lot rps, then each call we have to traverse each series in a histogram for all of the active nodes to elect a node.
We could simply aggregate this instead of calculating.
balancing/balance_breaker.go
Outdated
return ms.Node.IsActive() | ||
} | ||
|
||
func raportMetrics(rt http.RoundTripper, since time.Time, open bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
report
} | ||
|
||
// IsActive checks Breaker status propagates it to Node compound | ||
func (ms *MeasuredStorage) IsActive() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't like the fact that the method's name hides what it's actually doing and you have to take a look at the doc to know the whole truth.
@@ -16,7 +16,7 @@ type AccessMessageData struct { | |||
Path string `json:"path"` | |||
UserAgent string `json:"useragent"` | |||
StatusCode int `json:"status"` | |||
Duration float64 `json:"duration"` | |||
Duration float64 `json:"duration_ms"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't json camel case? (duration_ms)
@@ -25,7 +25,7 @@ func (lrt *loggingRoundTripper) RoundTrip(req *http.Request) (resp *http.Respons | |||
timeStart := time.Now() | |||
resp, err = lrt.roundTripper.RoundTrip(req) | |||
|
|||
duration := time.Since(timeStart).Seconds() | |||
duration := time.Since(timeStart).Seconds() * 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why multiply by 1000?
} | ||
|
||
func (c *ShardClient) balancerRoundTrip(req *http.Request) (resp *http.Response, err error) { | ||
notFoundNodes := []balancing.Node{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe its worth it to add entries to sync log for each of the nodes in notFoundNodes
after we've found a node that was able to process the request?
@@ -37,6 +37,11 @@ It also backtracks to older cluster when requested for not existing object on | |||
target cluster. This kind of events are logged, so it's possible to rebalance | |||
clusters in background. | |||
|
|||
### Multi cloud cost optimization | |||
While all objects has to be written in every storage from shard, not all storages | |||
has to be read. With load balancing and storage prioritization akubra will peak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pick the
h.t0 = newT0 | ||
for _, series := range h.data { | ||
for _, value := range series.data { | ||
value.date = value.date.Add(delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop seems costly. We could instead accumulate the delta and later, when someone calls ValueRangeFun
just apply to delta to timeStart
and timeEnd
once at the beginning of the method
This reverts commit 73be064.
open := ms.Breaker.Record(duration, success) | ||
log.Debugf("MeasuredStorage %s: Request %s took %s was successful: %t, opened breaker %t\n", ms.Name, reqID, duration, success, open) | ||
log.Debugf("s %s: Request %s took %s was successful: %t, opened breaker %t\n", ms.Name, reqID, duration, success, open) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You replaced MeasuredStorage
with s
in the log message. Is this intentional?
No description provided.