-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(sinks): Add distributed service #13918
Conversation
Signed-off-by: Kruno Tomola Fabro <krunotf@gmail.com>
Signed-off-by: Kruno Tomola Fabro <krunotf@gmail.com>
✅ Deploy Preview for vector-project ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Soak Test ResultsBaseline: 447fdc9 ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. Changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%:
Fine details of change detection per experiment.
|
@tobz this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started reviewing this, and left a few cursory comments.
Overall, the state transition (healthy to unhealthy, unhealthy to healthy, etc) logic is what I'm primarily concerned with and I'm still going through it because I think there might be a simpler approach but I'm also not sure if the current approach is going to handle all scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is definitely getting closer, but there's still a few hangups for me.
Primarily, I don't believe using the sink's normal healthcheck -- or anything we'd normally consider a healthcheck -- is a valid measure of the downstream system's health in this case.
As mentioned previously, the healthcheck of a system might represent only a cursory "service is running" check, whereas once we actually send traffic to the system, the request hits different code paths, queues, and so on, that lead to actual errors.
Instead, we need to be observing the actual responses when they come back from the service, using those to determine whether the system is healthy again or still unhealthy. This is the sort of design we should be following: https://martinfowler.com/bliki/CircuitBreaker.html
@tobz I'm not sure if we are having a misunderstanding somewhere but current implementation does behave like a circuit breaker. It monitors responses to requests and triggers when a sufficient number of unhealthy responses are observed and no healthy in the meantime. Once triggered it goes to sleep reporting the service as unavailable. Once it wakes up it checks the service with a healthcheck. What will that do depends on the specific sink in question. For some a nop always true could make sense, for some it's worth checking if the service is at least reachable, and etc. If it passes that it enters a probation period. If a healthy response is observed, it exits probation, moves to a healthy state, and timeout is reset. If not, and a sufficient number of unhealthy responses is observed, it yet again goes to sleep but this time for longer and so on. How many unhealthy responses are needed to trigger and how many of them are acceptable during probation period can be configured with |
@ktff I see what you are saying, and to be clear, I don't think this design isn't like a circuit breaker. What I'm trying to convey is that I want it to be more like a classic circuit breaker design where the traffic itself is what we use to reset the circuit rather than the healthcheck. Testing the actual traffic that caused it to trip in the first place is the only surefire way to know if it's healthy again. Beyond that, I'm still reviewing the logic around how it trips/resets because I want to make sure we have it right considering that updates to the healthy/unhealthy counters are happening concurrently. |
@tobz so you have issue with usage of healthchecks at all even though they don't reset circuit breaker but just to see if it's even worth it? And even though it's implementation is left to the sinks so it could be nop? Only traffic can reset the circuit, "healthchecks" are here to test if it's even worth it, so maybe "healthchecks" is a confusing name, perhaps "unhealthcheck" is better name since it doesn't say that a service is healthy just that it can say for sure it's unhealthy. |
I do, yes. Let's call the "normal" sink traffic a "data" request, and a typical healthcheck request a "control" request, in the spirit of data plane vs control plane. It is very common that the control plane can report that things are healthy, while actually sending traffic to the data plane would show an elevated/persistent rate of errors. As well, I'm going to use circuit breaker terminology from this point one: closed (healthy, traffic flowing), open (no traffic flowing), and half open (only probe/canary traffic flowing). In my mind, if we're observing data requests to figure out whether or not the circuit breaker should trip into the open/half open state, we should likewise be observing data requests to figure out when it can be closed. At best, a healthcheck could provide a response that is 100% indicative of whether or not data requests should succeed, and we've simply made an extra HTTP control request when we could have skipped straight to sending a data request. At worst, we're putting the service back in the closed state prematurely, which could allow a flood of requests to be sent through it before the first one returns with an error that triggers it to trip into the open state again. This is why I'm arguing so strongly for allowing a data request through that gets used as the canary in the half open state. At the risk of talking a little too much about the implementation details of
In this way, we only use data requests to figure out if the endpoint is healthy and the circuit breaker can be closed, and additionally, we don't run the risk of letting multiple requests through immediately until we know that the endpoint is healthy, and we're more confident that the endpoint is actually healthy because we're using real data requests. |
Regarding letting only one request pass during probation, I think that's the job of concurrency layers to throttle the requests so that short temporary downtimes don't inhibit throughput much, although I don't have an issue to limit this in Health service to one. Regarding not taking control plane into consideration, I don't agree. For some sinks that can make sense and so they can leave healthcheck nop, but for some it doesn't. For example Elasticsearch sink when communicating with the cluster should, or at least be left to users to enable/disable, try not to pass requests if cluster is not in healthy state regardless of data plane. I really don't see an issue in having healthckeck be here. It enables to add things when needed and to be nop when it doesn't. |
|
I definitely agree that the letting through of a single request is starting to border on this layer also acting as a rate limit/concurrency layer... but in practice, I'm not sure how much this would change the runtime behavior? The difference would be the latency of the healthcheck endpoint vs the latency of the data plane endpoint. So when the layer first trips open, and the backoff period is, say, 100ms or whatever... yeah, maybe the healthcheck request takes 10ms and the data plane request takes 1000ms, so now we're open/half-open for 1100ms instead of 110ms... but if we've exceeded our error budget, does only waiting 100ms have a chance to meaningfully improve the health of the downstream service? Is allowing all traffic to immediately start flowing again more likely than not to hurt the service? Additionally, ARC may not have meaningfully scaled itself back if it's only seen a few errors so far. This is really the sort of thing that I'd prefer to empirically model with some sort of queueing simulation system based on these types, because it's a bit hard to talk about the behavioral specifics without a true model.. but that's far out of the scope of this PR.
I personally don't want to have to figure out, for every component, what the right health check logic to use for We can always expand on the design in the future, to incorporate the notion of checking some other source of health information... but for this PR to proceed, it needs to observe the data plane traffic. |
Then I'll add a limit to one request in half open state, and drop healthcheck since that's necessary to proceed. |
@tobz this is ready for review. |
Signed-off-by: Kruno Tomola Fabro <krunotf@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is a solid base to continue building from. 👍🏻
Signed-off-by: Kruno Tomola Fabro <krunotf@gmail.com>
The failing Soak analysis checks don't seem to be related to this PR, since they fail at the start while reading some files. |
Yeah, I'll take a look. They don't handle action runner failure gracefully, so sometimes it's as simple as just rerunning them. |
@ktff Can you merge in the latest changes from |
Signed-off-by: Kruno Tomola Fabro <krunotf@gmail.com>
Soak Test ResultsBaseline: 8e2201b ExplanationA soak test is an integrated performance test for vector in a repeatable rig, with varying configuration for vector. What follows is a statistical summary of a brief vector run for each configuration across SHAs given above. The goal of these tests are to determine, quickly, if vector performance is changed and to what degree by a pull request. Where appropriate units are scaled per-core. The table below, if present, lists those experiments that have experienced a statistically significant change in their throughput performance between baseline and comparision SHAs, with 90.0% confidence OR have been detected as newly erratic. Negative values mean that baseline is faster, positive comparison. Results that do not exhibit more than a ±8.87% change in mean throughput are discarded. An experiment is erratic if its coefficient of variation is greater than 0.3. The abbreviated table will be omitted if no interesting changes are observed. No interesting changes in throughput with confidence ≥ 90.00% and absolute Δ mean >= ±8.87%: Fine details of change detection per experiment.
|
Issue #3649
#13236 PR got quite large so it made sense to split it into two PR. This is the first one.
Adds service for distributing events to multiple endpoints.
cc. @tobz