Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop accepting bundle produced by operator who has previously produced invalid bundle #2316

Open
NingLin-P opened this issue Dec 12, 2023 · 18 comments

Comments

@NingLin-P
Copy link
Member

Currently, the honest operator only submits fraud poof if it finds a mismatch between the ER stored onchain and its local ER. For malicious operator who produce an invalid bundle, it will be marked on the ER and slashed once the ER is confirmed thus the slashing is deferred.

This can be an issue because the malicious operator can still produce bundles before it get slash, if it continue to produce more invalid bundle and it may waste more block space than it can afford since the MinOperatorStake is set to 2*Block while it can waste up to challenge_period * Block.

This can be fixed by introducing a new status PendingSlash(domain_block_number) to the operator:

  • The operator's status should set to PendingSlash once an ER is submitted and it claims that the operator has produces an invalid bundle
    • The domain_block_number inside PendingSlash(domain_block_number) is the domain block number of the ER
    • The operator's stake should be removed from the domain's total_stake while everything else should still keep until the ER is confirmed
  • The farmer should stop accepting bundles produced by PendingSlash operator
  • When a fraud proof is submitted and a bad ER is pruned, the invalid bundle claimed by the bad ER may be false
    • If the bad_receipt.domain_block_number <= PendingSlash.domain_block_number the PendingSlash status should be reverted and the operator's stake should be added to the domain's total_stake

cc @dariolina @vedhavyas

@dariolina
Copy link
Member

I think the root of this issue stems from the initial divergence of mental models here. In the current model, fraud proofs are viewed as a way to signal bad ERs. However, the way Chen was initially proposing was for fraud proofs to signal invalid bundles, and a bundle in turn could be invalid because of bad extrinsics or because of bad ER fields. The set of fraud proofs we have for InherentTx, OutOfRangeTx, etc. was supposed to be able to target a bundle author and slash them immediately.
Since we already have all that in place of ER::bundles field can we reuse it for fraud proofs of an invalid bundle (targeting a bad_bundle_hash instead of bad_receipt_hash)?

@vedhavyas
Copy link
Contributor

Since we already have all that in place of ER::bundles field can we reuse it for fraud proofs of an invalid bundle (targeting a bad_bundle_hash instead of bad_receipt_hash)?

I agree with the divergence and we have outdated TODOs for this but even if the fraud proof is submitted, there could be multiple bundles coming from the operator before fraud proof is submitted.

But on the other hand, not letting the bundle author submit more bundles on runtime seems more straight forward.

@dariolina
Copy link
Member

But we can't have consensus on whether PendingSlash is just until the accusing ER is confirmed, right? So the accusation may be unjust?

@NingLin-P
Copy link
Member Author

But we can't have consensus on whether PendingSlash is just until the accusing ER is confirmed, right? So the accusation may be unjust?

Yeah, it is, but it won't affect the honest operator much. When a bad ER with fraudulent inboxed_bundles is submitted to the consensus chain, even without introducing PendingSlash, the honest operator won't and can't submit bundle before the bad ER is pruned by fraud proof. So either way, the honest operator needs to submit a fraud proof first.

@dariolina
Copy link
Member

I see, so the expected scenarios are:

  • accusation is legit -> accused malicious operator can't dispute it and can't submit anymore bundles
  • accusation is wrong -> accused honest operator disputes it with fraud proof -> fraud proof is accepted -> they can produce more bundles

@vedhavyas
Copy link
Contributor

Yeap. That looks right to me

@vedhavyas
Copy link
Contributor

@dariolina @NingLin-P do we have a quorum here on the approach of not letting operators produce bundles once they produce invalid bundle. If accusation was wrong, they should be able to produce bundles once FruadProof is in for such bad accusation

@dariolina
Copy link
Member

dariolina commented Dec 14, 2023

Looks good to me in general.
One nit

The operator's stake should be removed from the domain's total_stake while everything else should still keep until the ER is confirmed

Is this at the end of epoch?
If epoch changed and accused operator stake was removed from VRF, and then pending slash is lifted in the epoch, we will either have more actual stake than domain_total_stake counts in VRF, which means more bundles, or have them wait until next epoch which isn't fair.

@NingLin-P
Copy link
Member Author

Is this at the end of epoch?

Yeah, so in most cases where the honest operator is online and fraud proof will be submitted within a few blocks, the false accusation will be removed before this happen.

or have them wait until next epoch which isn't fair.

In this case, we can force epoch transition when fraud proof is submitted, so the honest operator won't need to wait and it doesn't affect at all from the bundle production perspective.

@vedhavyas
Copy link
Contributor

This is how I'm thinking implementation would look like

  • Operator submits a bundle
  • Next operator marks bundle as invalid. So runtime, would tag that operator to not produce any further bundles
  • If the accuation was wrong, a fraud proof is submitted. If valid, we will remove the tag and operator can submit bundles again
  • If the accusation was indeed correct, then they will slashed once the Er is confirmed.

I dont think we should remove the operator from the staking since fraud proof may be delayed before epoch is finished. If we remove them before fraud proof comes in, then, like you said, they would need to wait until next epoch is complete which is not fair yes and I can see malicious operators marking bundles invalid right before epoch completion to kick honest operators out of the set.

If the operator was indeed malicious, then they wont produce any bundles. Even they try to deregister, they will be slashed once ER is confirmed.

Does this make sense @dariolina
Is there anything missing from this idea ?

@NingLin-P
Copy link
Member Author

I dont think we should remove the operator from the staking since fraud proof may be delayed before epoch is finished.

If they are not removed from staking, it will lower the bundle probability for other honest operators.

If we remove them before fraud proof comes in, then, like you said, they would need to wait until next epoch is complete which is not fair yes and I can see malicious operators marking bundles invalid right before epoch completion to kick honest operators out of the set.

We can detect such situations in consensus runtime and force epoch transition immediately once fraud proof is submitted.

@vedhavyas
Copy link
Contributor

If they are not removed from staking, it will lower the bundle probability for other honest operators.

Sure and bundle production will slow down but this is not permanent and they will be removed once the ER is confirmed. I fail to see why this is an issue here

We can detect such situations in consensus runtime and force epoch transition immediately once fraud proof is submitted.

I would prefer not doing such forced transitions by runtime itself. Forced epoch transitions are meant for the Sudo or Governance when there is no way to recover such domain. This case is not warranted for such forced epoch transition.
If we start forcing epoch, not only will runtime do a lot of computation on epoch completion but since this could be triggered runtime, we can estimate how many time we would need to do this just because malicious operators are marking some bundles invalid

@NingLin-P
Copy link
Member Author

Sure and bundle production will slow down but this is not permanent and they will be removed once the ER is confirmed. I fail to see why this is an issue here

Depending on the percentage of stake of the operator who produces invalid bundle, if it is 50% the bundle production is slow by half, if it is 90% production slow by 10 times. While the honest operator can detect that within a few blocks we need to wait a whole challenge period pass (i.e. 1d)

Forced epoch transitions are meant for the Sudo or Governance when there is no way to recover such domain.

That is the current usage not the fundamental limit of force epoch transition

since this could be triggered runtime, we can estimate how many time we would need to do this just because malicious operators are marking some bundles invalid

Indeed, this will increase the weight of the submit_fraud_proof call:

  • in general, we should decrease the cost of epoch transition, this also benefited by the submit_bundle call
  • maybe we can find a way to only count the epoch transition weight if the runtime detected such situation

@vedhavyas
Copy link
Contributor

Depending on the percentage of stake of the operator who produces invalid bundle, if it is 50% the bundle production is slow by half, if it is 90% production slow by 10 times. While the honest operator can detect that within a few blocks we need to wait a whole challenge period pass (i.e. 1d)

Sure but this will limit the slowness to just that particular domain and consensus does not need to do any extra compute. We generally prefer not making the consensus do extra work.

if it is 90% production slow by 10 times.

This would be a scenario that might not be practical for all the domains. If there are some domains that has the limited operator set, then be it that that particular domain will have slow block production because there is a malicious operator with huge stake.
We do not need to necessarily solve fastness for malicious operated domains and their operators but if the accusation was incorrect, then after fraud proof honest operator will continue to produce the bundle without any wait.

That is the current usage not the fundamental limit of force epoch transition

I would have to disagree with you here. Though it may not be a fundamental limit, this will be an abuse of doing such epoch transitions when there is no such reason to do so.

Indeed, this will increase the weight of the submit_fraud_proof call:
in general, we should decrease the cost of epoch transition, this also benefited by the submit_bundle call
maybe we can find a way to only count the epoch transition weight if the runtime detected such situation

Overall, this seems more like an optimisation that can come in if we have a efficient way to do such epoch transitions. Though I would be hesistant to do such things by runtime and should only come from Sudo or Governance.
We can further think of such optimisations for Domains with malicious operator with huge stake but primary focus should be for the domains with honest operators IMHO

@NingLin-P
Copy link
Member Author

Sure but this will limit the slowness to just that particular domain

Wasn't it already an issue?

and consensus does not need to do any extra compute. We generally prefer not making the consensus do extra work.
I would have to disagree with you here. Though it may not be a fundamental limit, this will be an abuse of doing such epoch transitions when there is no such reason to do so.

What we do just to repair the damage made by the malicious operator, through both prune bad ER and epoch transition, I don't think we should count that as "extra" and "no reason".

@vedhavyas
Copy link
Contributor

Wasn't it already an issue?

Not ideally. We are talking about specific scenario where a malicious operator had a huge amount of stake for a given domain. In practical scenraio, maybe I'm not seeing everything here, from the example taken from the Gemini-3g, this should not lead to slow bundle production IMHO given a sufficient number of operators for that domain.

What we do just to repair the damage made by the malicious operator, through both prune bad ER and epoch transition, I don't think we should count that as "extra" and "no reason".

This should not be the task of Consensus chain, IMHO. Though I dont have strong reason against what you are proposing, there are many ways to make consensus do extra work to clear up such damage made by bad actors.
Maybe @dariolina @nazar-pc has some thoughts here ?

@dariolina
Copy link
Member

I would avoid using force epoch transition, please.
I still think this should be handled by the accuser producing also the InvalidBundle fraud proof, apart from marking it invalid in their ER. Then if it is accepted, the accused is slashed and if not, the accuser will be slashed for invalid ER.
In this approach, any operator is innocent until proven guilty, which is the def of optimistic approach.
In PendingSlash status, the accused is guilty until proven innocent, which feels fundamentally wrong.

@NingLin-P
Copy link
Member Author

I'm okay with avoiding epoch transition since it makes implementation easier and won't make things worse.

I think the root of this issue stems from the initial divergence of mental models here. In the current model, fraud proofs are viewed as a way to signal bad ERs. However, the way Chen was initially proposing was for fraud proofs to signal invalid bundles, and a bundle in turn could be invalid because of bad extrinsics or because of bad ER fields.

The divergence happens because making fraud proof signal bundle (and covering ER through bundle) will bring much more significant change in many aspects of domain (rather than just fraud proof) that are not mentioned in the design doc.

And IIRC from the original design doc, the reason for signal bundle is something like "bundle submitted with bad ER is produced by the malicious operator so should not be handled" which is not a strong statement and not really correct since the honest operator can validate every tx in the bundle before further handling.

Thus we chose the more straightforward and intuitive approach to signal bad ER, but now with all the changes/infra made, I do see feasibility in that approach, which may also help #1673, I will open another issue with more detail for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants