-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PinUntilErrorChannel doesn't switch on 429 #661
Conversation
Generate changelog in
|
@@ -17,7 +17,7 @@ | |||
live_reloading[CONCURRENCY_LIMITER_ROUND_ROBIN].txt: success=89.9% client_mean=PT3.8310808S server_cpu=PT1H55M21.66S client_received=2500/2500 server_resps=2500 codes={200=2248, 500=252} | |||
live_reloading[UNLIMITED_ROUND_ROBIN].txt: success=60.2% client_mean=PT2.84698S server_cpu=PT1H58M37.45S client_received=2500/2500 server_resps=2500 codes={200=1504, 500=996} | |||
one_big_spike[CONCURRENCY_LIMITER_BLACKLIST_ROUND_ROBIN].txt: success=79.0% client_mean=PT1.478050977S server_cpu=PT1M59.71393673S client_received=1000/1000 server_resps=790 codes={200=790, Failed to make a request=210} | |||
one_big_spike[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT1.286733552S server_cpu=PT2M48.75S client_received=1000/1000 server_resps=1125 codes={200=1000} | |||
one_big_spike[CONCURRENCY_LIMITER_PIN_UNTIL_ERROR].txt: success=100.0% client_mean=PT1.135007332S server_cpu=PT2M49.65S client_received=1000/1000 server_resps=1131 codes={200=1000} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As expected, the graph shows one big spike of requests (i.e. this exact use case) remains pinned to one uri. https://github.com/palantir/dialogue/blob/dfox/pin-until-error-fix/simulation/src/test/resources/report.md#one_big_spikeconcurrency_limiter_pin_until_error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to update this simulation to respond 429 instead of 503
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intended to be representative of this exact workflow, so it responds 429 above some threshold:
public void one_big_spike() {
int capacity = 100;
servers = servers(
SimulationServer.builder()
.serverName("node1")
.simulation(simulation)
.handler(h -> h.respond200UntilCapacity(429, capacity).responseTime(Duration.ofMillis(150)))
.build(),
SimulationServer.builder()
.serverName("node2")
.simulation(simulation)
.handler(h -> h.respond200UntilCapacity(429, capacity).responseTime(Duration.ofMillis(150)))
.build());
dialogue-core/src/main/java/com/palantir/dialogue/core/PinUntilErrorChannel.java
Show resolved
Hide resolved
dialogue-core/src/main/java/com/palantir/dialogue/core/PinUntilErrorChannel.java
Outdated
Show resolved
Hide resolved
dialogue-core/src/test/java/com/palantir/dialogue/core/PinUntilErrorChannelTest.java
Outdated
Show resolved
Hide resolved
…dialogue into dfox/pin-until-error-fix
Released 1.23.1 |
Before this PR
In PDS-117063, a user of our internal atlas-replacement switched from c-j-r -> dialogue and saw server errors.
Looking at the
pinuntilerror.nextNode
metric, it seemed we switched channels 5 times during a supposedly transactional workflow. This meant that some requests landed on one node and others landed on a different node, which caused the second node to return a hard error.cc @LucasIME and @jkozlowski
After this PR
==COMMIT_MSG==
PinUntilErrorChannel doesn't switch on 429, to unblock transactional workflows
==COMMIT_MSG==
Possible downsides?