-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disputes: implement validator disabling (Stale) #784
Comments
Also disable validators who repeatedly vote against valid. Disabling means in general that we should not accept any votes/statements from that validator for some time, those include:
In addition, depending on how quickly we disable a validator, it might already have raised thousands of disputes (if it disputes every single candidate for a few blocks), we should also consider deleting already existing disputes (at the dispute-coordinator) in case one side of the dispute consists only and exclusively of disabled validators - so we apply disabling to already pending participations, not just new ones. This might be tricky to get right (sounds like it could be racy). The reason we should at least think about this a bit, is that so many disputes will delay finality for a significant amount of time resulting in DoS. Things to consider:
|
That's tracked in #785 and is purely runtime changes.
We can disable as soon as a dispute (reaching threshold) concludes.
Indeed. I'd be in favor of not complicated this unnecessarily. |
Just had a discussion with @ordian . So what is the point of disabling in the first place? It is mostly about avoiding service degradation due to some low number of misbehaving nodes (e.g. just one). There are other mechanism in place which provide soundness guarantees even with such misbehaving nodes, but service quality might suffer for everybody (liveness). On the flip side, with disabling, malicious actors could take advantage of bugs/subtle issues to get honest validators slashed and thus disabled. Therefore disablement if done wrong, could actually lead to security/soundness issues. With this two requirements together, we can conclude that we don't need perfect disablement, but an effective rate limit for misbehaving nodes is enough to maintain service quality. Hence we should be able to limit the number of nodes being disabled at any point in time, to something like 10% maybe 20% ... in any case to something less than 1/3 of the nodes. If this threshold is reached, we can either by random choice or based on the amount of accumulated slashes (or both) enable some nodes again. This way we do have the desired rate limiting characteristics, but at the same time make it unlikely that an attacker can get a significant advantage via targeted disabling. Furthermore as this is about limiting the amount of service degradation a small number of nodes (willing to get slashed) can cause, it makes sense to only start disabling once a certain threshold in accumulated slashes is reached. For the time being, we have no reason to believe that these requirements are any different for disabling in other parts of the system, like BABE. We should therefore double check that and if it holds true strive for a unified slashing/disabling system that is used everywhere through the stack in a consistent fashion. |
|
I'll leave my thoughts on a strategy for validator disabling here so that we can discuss it and improve it further (unless it's a total crap 💩). When a validator gets slashed it's disabled following these rules:
Open questions:
|
Reiterating Requirements:
2 is conflicting with 1, as a small slash would result in barely any rate limiting. On the flip side, if a node is misbehaving it is definitely better to have it disabled and protect the network this way, than keep slashing the node over and over again for the same flaw. Luckily there is a solution to these conflicting requirements: Having the disabling strictly proportional to the slash is only necessary once a significant number of nodes would get disabled, hence we can introduce another (lower) threshold on number of slashed nodes, if it is below that threshold we just disable all of them, regardless of the amount. Meaning of DisablingDisabled nodes will always be determined in the runtime, so we do have consensus. There should be an API for the node to retrieve the list of currently disabled nodes as per a given block. The effect will be that no data from a validator disabled in a block X, should ever end up in block X+1. For simplicity and performance we will ignore things like relay parents of candidates, all that is relevant is the block being built. On the node side, we do have forks, therefore we will ignore data from validators as long as a disabling block is in our view. Runtime
NodeFor all nodes being disabled in at least one head in our current view:
Affected subsystems:
If we wanted to go fully minimal on nodes side changes, it should be enough to honor disabled state in the dispute coordinator. Degradation in backing performance should be harmless, approval subsystems are also robust against malicious actors and filtering in the provisioner is strictly speaking redundant as the filtering will also be performed in the runtime. Disabling StrategyWe will keep a list of validators that have been slashed, sorted by slash amount. For determining for the current block, which validators are going to be disabled we do the following:
I would suggest to ignore slash amount in 3 for simplicity, because:
Rule 1 protects the network from single (or a low amount) of rogue validators and also protects those validators from themselves: Instead of getting slashed over and over again, they will end up being disabled for the whole session. Giving operators time to react and fix their nodes. (See point 2 in requirements) This means we will have two thresholds: One where, as long as we are below we always disable 100% and one where, once we are above start to randomly enable validators again. Disabling, eras, sessions, epochsInformation about slashes should be preserved until a new validator set is elected. With a newly elected validator set, we can drop information about slashed validators and start anew with no validators disabled. If we settle on this approach, then this would be obsoleted by the proposed threshold system. |
Two questions/comments:
Why head in current view instead of 'slashed in finalized block'? To be proactive in case of finality stall? And the second related to disabling stragegy:
I think we should do this in two steps:
|
Yes, we could do that, but I argued above that we should be able to keep it simple without any harm done.
Yes. Given that attacks on disputes can trigger a finality stall, it would be really bad if attackers could avoid getting disabled by their very attack. While at the same time for honest, but malfunctioning nodes they might already accumulate a significant amount of slash before getting disabled. |
What are the other repercussions of forcing a new era? This sounds like a good idea, but I'm guessing it could break a lot of unrelated things. We should consider tooling as well.
I think we can just use the slashes as @eskimor suggested. But, yes, if a validator is disabled then reactivated then slashed again we need to recalculate the disabled list. I am still a little uncomfortable with the notion of disabling validators who haven't been 100% slashed in order to protect them from bugs when they can always ask to have the slashes reversed by governance. My bias is towards handling it economically and increasing the slashing amount if we think repeated misbehavior would bring too much load on the network before a bad actor loses all their stake. However, this probably isn't compatible with the solution we came up with for time overruns (since we have to balance the overrun charge with the collective amount slashed from potentially as much as a byzantine threshold of approval checkers). I'll probably just have to accept this. |
We discussed it yesterday. It's not a good idea. Starting a new era takes time and it's not safe to force it if we have got too many misbehaving validators. We won't do this. |
About the rate limiting, considering that we have that upper limit on disabled nodes. I think having a rate limiting disabling strategy for lesser slashes makes sense and adds little to no complexity. It only makes sense, with accumulating slashes though or alternatively if we considered the slashes being accumulative at least from the disabling strategy perspective. Consider nodes that are not behaving equally bad, some nodes being more annoying than others, then we would disable them more and more until they are eventually silenced, having the network resume normal operation. While other nodes, only having minor occasional hickups or even only one, would continue operating normally. This also has the nice property that the growth of the disabling ratio for an individual node will automatically slow down, as there are less possibilities for the node to do any offenses. So to get disabled 100%, you really have to be particularly annoying. About accumulating slashes: We would like to protect the network from a low number of nodes going rogue, but once disputes are raised by more than just a couple of nodes it is not an isolated issue, but either an attack or more likely a network wide issue. In case of an attack, it would then be good to have accumulating slashes, in case of a network wide issue - accumulating slashes would still be no real harm, if we can easily refund them - can we? For isolated issues, nodes are protected from excessive slashing via disabling. |
A priori, we should avoid randomness here since on-chain randomness is biasable. It makes analyzing this annoying and appears non-essential. I've not thought much about it though, so if it's easy then explain. We can disable the most slashed nodes of course, which also remains biasable, but not for quite so long in theory. Ideally, we should redo the slashing for the whole system, aka removing slashing spans ala https://github.com/w3f/research/blob/master/docs/Polkadot/security/slashing/npos.md, but that's a larger undertaking. We'd likely plan for subsystem elsewhere bugs too, which inherently links this to the subsystem. |
We want slashes to be minimal while still accomplishing their protocol goals. It avoids bad press, community drama, etc. We do not know exactly what governance considers bugs, like what if the validator violates some obscure node spec rule. It's maybe even political, like based upon who requests a refund, who their ISP is, etc. In fact, there exist stakers like parity and w3f who'd feel reluctant to request refunds for some borderline bugs. |
We are disabling only slashed validators? We won't disable anyone disputing a valid block or voting for invalid block (unless being a backer)? |
Yes we only ever disable slashed validators. We do disable on disputing valid block though and we will also slash and disable for approving an invalid block, see #635 .. but a suitable disabling strategy as discussed here is a prerequisite for the latter. |
And one more question regarding:
If there is space for all 100% slash and all 10% slash (in this case) - should we (a) add all 10% slashed validators to the set or (b) still add them with 10% probability (and potentially skip some validators)? I think you meant (a) otherwise there is contradiction with:
|
No it is (b) - point 1 was under the prerequisite that we are below the lower threshold. For point 2 and on-wards this is not the case. Idea being: If there are only a few rogue validators having problems - just disable them and don't bother. It is not a security threat and keeping them silent is better for everybody. |
Yes, my bad. There is no contradiction. If we are at point 2, we are already above the limit. |
I like thinking of this as rate limiting instead of disabling. Something at least like And then if we reach a concerning threshold of active validators, even just on average, we can slow the rate limiting. A special case is when it's so bad we need to reactivate validators that have been slashed 100%: they still shouldn't be allowed to back candidates and maybe not produce relay chain blocks either. We could generally have the slower rate limiting apply only to finality and not backing and block production. The upside of this is it doesn't require randomness. However, the problem is we'd need to think about whether nodes are synced up in how they're rate limited. For example, if you have 10% of the network 50% rate limited that would be fine if the rate limiting is staggered, which is less likely in practice if we don't intentionally design it that way. |
I think we can't do this. If we disable more than f validators - we'll break the security assumptions of the protocols. Doesn't allowing them to back candidates is more or less equal to disabling them.
Can you elaborate on this? How will we pass by without randomness? |
This would just be choosing safety over liveness, no?
We can rate limit deterministically, like in my example. Regardless of whether we do it deterministically or try to do it randomly, we do still probably need to assume all rate limited validators are sometimes inactive in the same slot -- or likely because maybe they were slashed for the same reason. Unless we try to intentionally stagger them and do some complicated bookkeeping around it. So if we want no more than 10% inactive then we'd probably have to back down on rate limiting when 10% are rate limited at all. Maybe that's not a problem. |
We can put it this way. My main concern was that we were trying to handle a case when there are more than f byzantine nodes but this is not entirely correct. f is related to all validators, not just the ones in the active set right? My concern with disabling too many validators is killing the network in case of a bug which is not an attack. If we sacrifice liveness aren't we killing any chances of governance to recover the network?
Yes I understand your idea for the disabling now. Thanks! |
Think we want to push this even further, I don't think we ever need to re-enable a validator within an era. Enabling them after a session will lead to them committing the offence again if they are buggy or malicious. The scope of disabling should be eras as the validator sets change in the scope of eras. This might slightly alter the storage considerations. We can follow-up on the disputes call. |
I am not sure how this resolves the issue with era changes I described. In particular I am not sure which problem you want to solve with this at all. 🤔 .. Do you want to change the runtime API to only get us newly disabled nodes? I might be missing something obvious ... it is quite late already. 😪 |
The implementation simplicity, ensuring we use only one strategy consistently across backing, statement-distribution and disputes on the node side. This API is meant to be stored and used on the node side only.
It doesn't. Persisting disabled state for the next era should help though. |
Mini-guide for the current version of the disabling strategy:
|
Whether or not we re-enable disabled validators, I'd argue we need special handling for disputes and not just rely on on-chain state.
For dispute disabling it would be easier to use relay_parent state (if possible) + For statement-distribution, as it only an optimization, and the main filtering will be done in the runtime, by using |
First of all attackers need to pull off a successful time dispute attack and get 1/3 of the network slashed (probably a minuscule amount if any at all based on if we have time disputes countermeasures). If they succeed they can:
I assume what you mean is they vote
Is this an issue tho? Disabling is there to reduce damage done in the current era so if they are already gone there's not much more damage they can cause. Generally the main bulk of the punishment comes from the slash which should still be applied even if they are no longer in the active validator set. |
And pulling off a time dispute attack where you are not even slashed means that it must be the collator attack variant (we have 3 main flavours of time dispute attacks: malicious collators, backers or checkers). Time dispute attacks organised by malicious collators are hard to pull off. They would need to construct a block that takes less than 2s on at least 2 out of 5 backers (one of the reasons why I'm strongly opposed to lowering the backing requirement) and then the same block would need to miraculously take more than 12s on many (1/3 in fact) of approval checkers. While not statistically impossible this is the least probable time dispute attack. |
I think we slash even in time disputes - or at least we can now. (Slashes are deferred, validators don't lose nominators, no chilling, ...)
Yep. For the node side disabling data structures, I don't think the suggested one cuts it. I would propose the following (pseudo code):
(lru size is session window) We need this map for two reasons:
Now, on receiving a dispute message, what we would be doing is the following:
On concluding disputes, we add losing validators to With this strategy we are covering both (1) and (2), without risking any consensus issues. TL;DR: Only use the node side set, if disabled set in the runtime is not too large already. |
Yes. There will always be at least 10 blocks (with forks and lacking finality maybe even significantly more) full with candidates that can be disputed of the previous session(s). With 100 cores, we are talking about > 1000 candidates. Now this validator can still dispute those candidates, even if no longer live in the current session. This can be quite a significant number of disputes and with our current 0% slash it would go completely unpunished:
Now with the above algorithm, the guy would still go unpunished, but at least he the harm to the network would be minimized. This would actually be an argument for >0% slashes. |
This is the most current design of the disabling strategy: #2955 Overall state done, only missing validator re-enabling. Can deploy with it missing but awaiting audit before deployment. |
Bumps [wasmtime](https://github.com/bytecodealliance/wasmtime) from 0.38.1 to 0.38.3. - [Release notes](https://github.com/bytecodealliance/wasmtime/releases) - [Changelog](https://github.com/bytecodealliance/wasmtime/blob/main/docs/WASI-some-possible-changes.md) - [Commits](bytecodealliance/wasmtime@v0.38.1...v0.38.3) --- updated-dependencies: - dependency-name: wasmtime dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# Summary This PR aims to remove `im-online` pallet, its session keys, and its on-chain storage from both Kusama and Polkadot relay chain runtimes, thus giving up liveness slashing. # Motivation * Missing out on rewards because of being offline is enough disincentive for validators. Slashing them for being offline is redundant. * Disabling liveness slashing is a prerequisite for validator disabling. # See also paritytech/polkadot-sdk#1964 paritytech/polkadot-sdk#784
Once a dispute is concluded and an offence is submitted with
DisableStrategy::Always
, a validator will be added toDisabledValidators
list.Implement on-chain and off-chain logic to ignore dispute votes for X sessions. Optionally, we can ignore backing and approval votes and remove from the reserved validator set on the network level.
statement-distribution
subsystem #1591 statement-distribution: validator disabling #1841BackedCandidates
in process_inherent_data #1863DisabledValidators
runtime api call is released #1940Possibly related paper here.
Goals for new validator disabling/Definition of Done
Timeline
As quickly as possible, definitely by the end of the year.
The text was updated successfully, but these errors were encountered: