light client : failsafe crate (circuit breaker) #9790

niklasad1 · 2018-10-21T20:07:33Z

Attempt to close #9536

Major changes in this PR:

Instead of limit the request by the number query retries and empty responses this introduces a exponentially moving average (provided by failsafe-rs) during a fixed time period when a request evaluated as faulty or not (but only failures are recorded) and a simple max_timeout_time for the maximum time that responses are evaluated
Adds functionality to determine the failure reason for bad responses which are propagated via the RPC interface
New CLI args:
- response_time_window
- request_time_window
- request_backoff_start
- request_backoff_max
- request_backoff_rounds_max

Benefits

A user can determine for over how long time period a request should be evaluated as a failure without specifying the number of responses or queries
The actual failure response is propagated to the user via the RPC interface
The request backs off exponentially upon failures and retries less frequently to avoid spamming the network

Circuit breaker

closed - calls are accepted
open - failed over some threshold and calls are rejected
half-open - calls are accepted after backoff from failure

Example

curl --data '{"method":"eth_getTransactionReceipt","params":["0x444172bef57ad978655171a8af2cfd89baa02a97fcb773067aef7794d6913fff"],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8555

{"jsonrpc":"2.0","error":{"code":-32042,"message":"Bad response on request: [ TransactionIndex ]. Error cause was EmptyResponse, (majority count: 91 / total: 91)"},"id":1}

ordian

Looks good in general 👍 , I've left some questions/comments.

ethcore/light/src/on_demand/mod.rs

ethcore/light/src/on_demand/tests.rs

ethcore/light/src/on_demand/mod.rs

ethcore/light/src/on_demand/request_guard.rs

ethcore/light/src/on_demand/tests.rs

ethcore/light/src/types/request/mod.rs

ordian · 2018-10-22T15:15:28Z

parity/cli/mod.rs

+			ARG arg_on_demand_max_backoff_rounds: (Option<usize>) = None, or |c: &Config| c.light.as_ref()?.on_demand_max_backoff_rounds,
+			"--on-demand-max-backoff-rounds=[TIMES]",
+			"Specify light client maximum number of backoff iterations",
+


Not sure a user wants to have control over all those knobs, maybe it's ok to just have good (empirical research?) defaults, but I don't have a strong opinion on that.

I think it is good to expose the primitives to give full control, but some alternate preset (basics configuration) could also be a nice thing.

(at this point it probably lacks documentation of the impact of every option, but maybe the wiki or another source is better for that : it seems difficult to explain quickly).

cheme

There is a few thing that I lack for reviewing correctly this PR (I left a few review comments in the code but they may not be pertinent).

A general explanation on how the circuit behave; more precisely how many circuit states are managed and what are they following (global or per request or per user ...). At this point it only makes sense to me as a global on_demand circuit state but at the same time I feel like it does not totally match the implementation.

Could be also good to have a short word on the advantage of the approach (but I think it was already discussed in related issues or PR).

A last thing is the additional dependency on 'spin' crate (we use parking_lot and got past issues with deadlock so I am a bit worried of the impact of the circuit breaker but I did not really analyse further because I want to be sure of our strategy when using circuit breaker (aka is it global or local to each request...).

Cargo.lock

ethcore/light/src/on_demand/mod.rs

ethcore/light/src/on_demand/response_guard.rs

cheme · 2018-10-25T09:39:47Z

parity/cli/mod.rs

+			ARG arg_on_demand_max_backoff_rounds: (Option<usize>) = None, or |c: &Config| c.light.as_ref()?.on_demand_max_backoff_rounds,
+			"--on-demand-max-backoff-rounds=[TIMES]",
+			"Specify light client maximum number of backoff iterations",
+


I think it is good to expose the primitives to give full control, but some alternate preset (basics configuration) could also be a nice thing.

(at this point it probably lacks documentation of the impact of every option, but maybe the wiki or another source is better for that : it seems difficult to explain quickly).

5chdn · 2018-10-26T11:49:52Z

Please rebase

niklasad1 · 2018-11-09T16:54:36Z

@ordian @cheme

I refactored this and the major thing to sort out is whether the timeout feature on the responses is the way to go? IMHO, I think this is the more user-friendly than limit the number of bad_responses, basically, a user just configures "I want to wait at most x seconds for this request"

However, the response guard doesn't need the use failsafe mechanism as you pointed out earlier which I can remove if I should continue with this.

Dump of the new RPC error response:

curl --data '{"method":"eth_getTransactionReceipt","params":["0x444172bef57ad978655171a8af2cfd89baa02a97fcb773067aef7794d6913fff"],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8555

{"jsonrpc":"2.0","error":{"code":-32042,"message":"Bad response on request: [ TransactionIndex ]. Error cause was EmptyResponse, (majority count: 91 / total: 91)"},"id":1}

5chdn · 2018-11-25T19:10:01Z

Could you give this a final review @ordian @cheme?

sorpaas

Mostly LGTM. A few small grumbles.

It indeed seems that the use of failsafe on ResponseGuard is optional (and one can also argue that NoBackoff doesn't look really nice). But I think that's not a huge issue -- being wrapped in ResponseGuard means it can be easily changed to use without failsafe, or to use other circuit breaker policies.

ethcore/light/src/on_demand/request_guard.rs

sorpaas · 2018-11-26T07:35:02Z

ethcore/light/src/on_demand/response_guard.rs

+			trace!(target: "circuit_breaker", "ResponseGuard: {:?}, responses: {:?}", self.state, self.responses);
+			let (&err, &max_count) = self.responses.iter().max_by_key(|(_k, v)| *v).expect("got at least one element; qed");
+			let majority = self.responses.values().filter(|v| **v == max_count).count() == 1;
+			// FIXME: more efficient with a separate counter


Would be nice if you can create an new issue and include an issue number there!

I will fix this if/when @cheme signs off of this :P

Sorry to ask, did I lock something?

@cheme

haha, I was referring to the functionality that this PR breaks i.e, replacing the response limit per peer and to use a simple timeout instead? If, that is not sorted there is no point and implementing other things IMHO!

I must admit, those points are a bit old in my mind. The very main thing is that LES query results should always be considered unsafe (query to random full node) and be checked.
Checking result for LES is easy (and done before caching up to cht block header reference), but checking no result is impossible (or at least quite a difficult question).
Therefore when you query an information that do not exist, there is not many ways to accept a 'no result reply' others than saying either 1: 'I did query all my nodes pool and do not get new nodes for a certain time' or 2: 'I did N random query and I consider it enough to think that at least one of those query replies is accurate". That was the two use cases that were in place before.

So this PR uses different ways of doing things. At start I felt like this is additional complexity, and did not really get the immediate usefulness of it, but on the other hand trying/starting new/different things is often what allows great things in the long term.

So trying to explain the new 'no reply' conditions, what are those correct formulations ? At this point what I read is :

'SuccessRateOverTimeWindows' fails on response due to no first success until time window (0% success), meaning 'simple timeout'

request_limit_reached : same as my second condition. Is managed out of fail_safe crate (When reading some failsafe crate, I just observe that there is a hard limit at 30 attempts for exponential backoff), here I think that using consecutive failure strategy could make sense.

So it does not change that much (just a hard timeout on response instead of my first condition : maybe simplier for the user), then I would not say it 'breaks' things.

Still, I do not really get why we use success rate over consecutive failure (success rate can only be 0 or 100 (request end on first success and we use our constructions on a per request basis) whereas consecutive failure takes a lot of possible states and is somehow managed as the error counter in request_guards (I may be wrong I did not read the crate implementation)).

Edit: thinking twice, no result in LES could be seen as invalid if query is valid on most query. So previous statement is far from being correct, I should check if LES queries are validated.

Ok, thanks for taking the time but to verify that I understood your concerns:

This PR still might be useful (I want to add that it gives some information to the user why the request to the full-node failed or least what the majority response was)

Circuit breaker for responses are useless and we should replace it with a simple timeout as stated you pointed out earlier thanks for that!

Exponentially backoff for requests is useful but because we only register bad responses here to it is better to use consecutive failures (also easiser to understand)

I just observe that there is a hard limit at 30 attempts for exponential backoff), here I think that using consecutive failure strategy could make sense.

I think you are wrong there as you can see, https://github.com/dmexe/failsafe-rs/blob/master/src/backoff.rs#L136 the backoff is always regenerated. I think the MAX_RETRIES is used to calcuate "binary exponential backoff" by shifting the value 1 << MAX_RETRIES and bigger value than 63 will overflow. That's my best guess

Still, I do not really get why we use success rate over consecutive failure (success rate can only be 0 or 100 (request end on first success and we use our constructions on a per request basis) whereas consecutive failure takes a lot of possible states and is somehow managed as the error counter in request_guards (I may be wrong I did not read the crate implementation)).

Basically, the request will run forever until it gets a successful response in the circuit breaker. That is why we use an upperbound max_failures in the RequestGuard when to drop the request

It is okay for me to try with consecutive failures instead but then I suggest setting to value to 1 and then backoff and repeat until we either got a successful response or backoff #max_number in order to avoid doing too many N/W requests.

I think you are wrong there as you can see, https://github.com/dmexe/failsafe-rs/blob/master/src/backoff.rs#L136 the backoff is always regenerated. I think the MAX_RETRIES is used to calcuate "binary exponential backoff" by shifting the value 1 << MAX_RETRIES and bigger value than 63 will overflow.

Oups, sorry, I should have check more cautiously, thanks for looking at it.

Basically, the request will run forever until it gets a successful response in the circuit breaker. That is why we use an upperbound max_failures in the RequestGuard when to drop the request

That is what I understood :-)

I feel like consecutive failure before backoff could make sense as a percentage of the current number of peers, but I do not really have things to back that (except that when testing a few month back I usually got a lot of peers returning invalid value but when we do not get lot of peers we want to wait). Isn't it a kind of error rates strategy. I start to wonder if it may make sense to write our own tailored 'failure policy' like 'ErrorRateOverTimeWindow' (just thinking loud).

I do not know about dropping response_guard, am I correct if I say that 'request_guard' and 'response_guard' are both in 'Pending' struct ? (just wondering if they can be merged, it probably only make sense with custom 'failure policy')

I do not know about dropping response_guard, am I correct if I say that 'request_guard' and 'response_guard' are both in 'Pending' struct ? (just wondering if they can be merged, it probably only make sense with custom 'failure policy')

Yes, they are but they we are slightly different if we got a successful request we insert the same Pending again wait for responses and when we get a successful response or failure we drop the Pending completely. The idea from the beginning was to have a two failure policies of the circuit-breaker and combine them but because the state can't be accessed and it will never terminate I created two wrappers instead. The problem is that we can just query if the request is permitted or not thus can't tell if it will backoff will be closed/half-open or stays open forever. So, we can either an additional timeout or keep some state by our selves.

I don't really understand what you are suggesting and the relationship between requests and responses when you talk about our own failure policy.

What I can pick off is this:

We should start with consecutive failures based on the number of peers

When the peers are under some threshold we should back off?

?

Well, I think the circuit breaker API doesn't provide the functionality to change the peers for example after the circuit breaker has been created which makes that solution hard to get right (unless we want to write a circuit breaker ourselves)

Personally, I don't have any intuition on what this values should be and suggest just to keep it simple (e.g, consecutive failures then backoff and continue backoff and so on). The consecutive failures shall be configured by the user

👍 for keeping it simple, I don't either have strong intuition on value but I identify two case where my previous configuration where not good :

start of a light client : there is to few nodes connected and we probably want to wait

well established connection in light client : there will be no new node so waiting does not make a lot of sense.

About 'our own failure policy', I was mainly interpolating that our need may be some 'RateOverTimeWindow', but as we count error and not success our own implementation of failsafe trait FailurePolicy could be an idea. Thinking twice, going for the crate success rate counting could be another target, but it would involve managing circuit breaker at a per peer level instead of a per request level. So I see two scenario (I do not say that they should be include in this pr):

we will keep per request management, so we count errors only -> go with error count policy or a custom error rate one.

we plan to switch to per peerid circuit so we can count multiple success (having mutex in the circuit does make sense with this scenario) and keeping a rate success is explain by this future move. The issue here is that hour request handler are not threadsafe (not sure they need to be, maybe they can stay per request specific but using per user id circuit breakers).

All right, yeah I think it is out-of-scope for this PR but we can create an issue and investigate it further!

niklasad1 · 2018-11-26T21:02:17Z

Thanks, Afri, it's far from being merged :P

stale

Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead `spin::Mutex`

When a reponse `times-out` provide the actual request or requests that failed

Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>

* Use second resolution on CLI args * Use `consecutive failure policy` instead of `timeOverWindow` * Add a couple of tests for `request_guard`

niklasad1 · 2018-11-27T22:35:08Z

/cc @amaurymartiny any thoughts on the format of a non-successful request? See example in the PR description.

amaury1093 · 2018-11-28T09:49:05Z

The error format is good for me

ordian

I'm in favor of merging this PR. It adds certain benefits outlined in the PR description.
I think using guards at peer level (instead of request level) is about peer ranking and is slightly orthogonal to what we're trying to achieve here.

ordian · 2018-12-04T13:21:28Z

ethcore/light/src/on_demand/mod.rs

+/// The maximum request interval for OnDemand queries
+pub const DEFAULT_REQUEST_MAX_BACKOFF_DURATION: Duration = Duration::from_secs(100);
+/// The default window length a response is evaluated
+pub const DEFAULT_RESPONSE_TIME_TO_LIVE: Duration = Duration::from_secs(60);


Why was it changed from 10 secs to 1min, isn't 1min too long for a user to wait?

I don't know, will revert sorry.

I think we still need to (but haven't) revert this line at this moment, right?

Yeah, I should have commented instead of approving the PR to avoid the merge before fixing the comment, sorry.

ordian · 2018-12-04T13:23:07Z

parity/cli/mod.rs

@@ -1820,8 +1835,11 @@ mod tests {
 			arg_snapshot_threads: None,

 			// -- Light options.
-			arg_on_demand_retry_count: Some(15),
-			arg_on_demand_inactive_time_limit: Some(15000),
+			arg_on_demand_response_time_window: Some(2000),


Could you update these numbers to seconds (/ 1000)?

* refactor(light) : N/W calls with `circuit breaker` * fix(nits) : forgot to commit new files * Add tests and change CLI args * Address grumbles * fix(failsafe-rs) : Santize input to prevent panics * chore(failsafe) : bump failsafe, (parking_lot) Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead `spin::Mutex` * Remove `success_rate` * feat(circuit_breaker logger) * feat(CLI): separate CLI args request and response * Fix tests * Error response provide request kind When a reponse `times-out` provide the actual request or requests that failed * Update ethcore/light/src/on_demand/mod.rs Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com> * Update ethcore/light/src/on_demand/mod.rs Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com> * fix(grumbles): formatting nit * fix(grumbles) * Use second resolution on CLI args * Use `consecutive failure policy` instead of `timeOverWindow` * Add a couple of tests for `request_guard` * fix(request_guard): off-by-one error, update tests

niklasad1 added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Oct 21, 2018

niklasad1 requested review from cheme and ordian October 21, 2018 20:07

ordian reviewed Oct 22, 2018

View reviewed changes

niklasad1 force-pushed the light/circuit-breaker branch from 7482c00 to 76fd478 Compare October 23, 2018 14:36

ordian previously approved these changes Oct 23, 2018

View reviewed changes

ordian added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 23, 2018

5chdn added this to the 2.2 milestone Oct 24, 2018

cheme reviewed Oct 25, 2018

View reviewed changes

niklasad1 force-pushed the light/circuit-breaker branch 4 times, most recently from 863b651 to c6051f5 Compare October 27, 2018 08:20

niklasad1 added A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. and removed A8-looksgood 🦄 Pull request is reviewed well. labels Oct 27, 2018

5chdn modified the milestones: 2.2, 2.3 Oct 29, 2018

niklasad1 force-pushed the light/circuit-breaker branch 3 times, most recently from 039e995 to 5ba2ca3 Compare October 31, 2018 17:32

niklasad1 force-pushed the light/circuit-breaker branch from 5ba2ca3 to c2c53e2 Compare November 9, 2018 16:36

niklasad1 added A0-pleasereview 🤓 Pull request needs code review. and removed A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. labels Nov 9, 2018

sorpaas approved these changes Nov 26, 2018

View reviewed changes

sorpaas added the A5-grumble 🔥 Pull request has minor issues that must be addressed before merging. label Nov 26, 2018

niklasad1 and others added 15 commits November 27, 2018 23:24

refactor(light) : N/W calls with circuit breaker

f5042a5

fix(nits) : forgot to commit new files

389f374

Add tests and change CLI args

9335d9b

Address grumbles

93cf347

fix(failsafe-rs) : Santize input to prevent panics

6c42608

chore(failsafe) : bump failsafe, (parking_lot)

1149adc

Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead `spin::Mutex`

Remove success_rate

5080b41

feat(circuit_breaker logger)

d7319f9

feat(CLI): separate CLI args request and response

e2ed4ad

Fix tests

754dc43

Error response provide request kind

5f74970

When a reponse `times-out` provide the actual request or requests that failed

Update ethcore/light/src/on_demand/mod.rs

88e55ba

Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>

Update ethcore/light/src/on_demand/mod.rs

5170b2c

Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>

fix(grumbles): formatting nit

bbff7f8

fix(grumbles)

f5b1a66

* Use second resolution on CLI args * Use `consecutive failure policy` instead of `timeOverWindow` * Add a couple of tests for `request_guard`

niklasad1 force-pushed the light/circuit-breaker branch from 768feac to f5b1a66 Compare November 27, 2018 22:25

fix(request_guard): off-by-one error, update tests

024b8b8

ordian approved these changes Dec 4, 2018

View reviewed changes

5chdn added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Dec 4, 2018

5chdn merged commit 7fb3379 into master Dec 5, 2018

5chdn deleted the light/circuit-breaker branch December 5, 2018 09:37

ordian mentioned this pull request Dec 5, 2018

light(on_demand): decrease default time window to 10 secs #10016

Merged

niklasad1 mentioned this pull request Jan 12, 2019

PIPv1 Provides no Transaction Error Status. #9817

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

light client : failsafe crate (circuit breaker) #9790

light client : failsafe crate (circuit breaker) #9790

niklasad1 commented Oct 21, 2018 •

edited

Loading

ordian left a comment

ordian Oct 22, 2018

cheme Oct 25, 2018

cheme left a comment

cheme Oct 25, 2018

5chdn commented Oct 26, 2018

niklasad1 commented Nov 9, 2018 •

edited

Loading

5chdn commented Nov 25, 2018

sorpaas left a comment

sorpaas Nov 26, 2018

niklasad1 Nov 26, 2018

cheme Nov 26, 2018

niklasad1 Nov 26, 2018

cheme Nov 26, 2018 •

edited

Loading

niklasad1 Nov 26, 2018

cheme Nov 26, 2018

niklasad1 Nov 26, 2018 •

edited

Loading

cheme Nov 27, 2018

niklasad1 Nov 27, 2018

niklasad1 commented Nov 26, 2018

niklasad1 commented Nov 27, 2018

amaury1093 commented Nov 28, 2018

ordian left a comment •

edited

Loading

ordian Dec 4, 2018

niklasad1 Dec 4, 2018

sorpaas Dec 5, 2018

ordian Dec 5, 2018

ordian Dec 4, 2018

light client : failsafe crate (circuit breaker) #9790

light client : failsafe crate (circuit breaker) #9790

Conversation

niklasad1 commented Oct 21, 2018 • edited Loading

Major changes in this PR:

Benefits

Circuit breaker

Example

ordian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

5chdn commented Oct 26, 2018

niklasad1 commented Nov 9, 2018 • edited Loading

5chdn commented Nov 25, 2018

sorpaas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheme Nov 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklasad1 Nov 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklasad1 commented Nov 26, 2018

niklasad1 commented Nov 27, 2018

amaury1093 commented Nov 28, 2018

ordian left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklasad1 commented Oct 21, 2018 •

edited

Loading

niklasad1 commented Nov 9, 2018 •

edited

Loading

cheme Nov 26, 2018 •

edited

Loading

niklasad1 Nov 26, 2018 •

edited

Loading

ordian left a comment •

edited

Loading