Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kusama validators getting 0 backing rewards the first session they enter the active set #3722

Merged
merged 5 commits into from
Mar 18, 2024

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Mar 17, 2024

There is a problem in the way we update authorithy-discovery next keys and because of that nodes that enter the active set would be noticed at the start of the session they become active, instead of the start of the previous session as it was intended. This is problematic because:

  1. The node itself advertises its addresses on the DHT only when it notices it should become active on around ~10m loop, so in this case it would notice after it becomes active.
  2. The other nodes won't be able to detect the new nodes addresses at the beginning of the session, so it won't added them to the reserved set.

With 1 + 2, we end-up in a situation where the the new node won't be able to properly connect to its peers because it won't be in its peers reserved set. Now, the nodes accept by defaultMIN_GOSSIP_PEERS: usize = 25 connections to nodes that are not in the reserved set, but given Kusama size(> 1000 nodes) you could easily have more than25 new nodes entering the active set or simply the nodes don't have slots anymore because, they already have connections to peers not in the active set.

In the end what the node would notice is 0 backing rewards because it wasn't directly connected to the peers in its backing group.

Root-cause

The flow is like this:

  1. At BAD_SESSION - 1, in rotate_session new nodes are added to QueuedKeys
    pub fn rotate_session() {
 <QueuedKeys<T>>::put(queued_amalgamated.clone());
<QueuedChanged<T>>::put(next_changed);
  1. AuthorityDiscovery::on_new_session is called with changed being the value of <QueuedChanged<T>>: at BAD_SESSION - 2 because it was saved before being updated
    let changed = <QueuedChanged<T>>::get();
  2. At BAD_SESSION - 1, AuthorityDiscovery::on_new_session doesn't updated its next_keys because changed was false.
  3. For the entire durations of BAD_SESSION - 1 everyone calling runtime api authorities(should return past, present and future authorities) won't discover the nodes that should become active .
  4. At the beginning of BAD_SESSION, all nodes discover the new nodes are authorities, but it is already too late because reserved_nodes are updated only at the beginning of the session by the gossip-support. See above why this bad.

Fix

Update next keys with the queued_validators at every session, not matter the value of changed this is the same way babe pallet correctly does it.

NextAuthorities::<T>::put(&next_authorities);

Notes

  • The issue doesn't reproduce with proof-authorities changes like versi because changed would always be true and AuthorityDiscovery correctly updates its next_keys every time.
  • Confirmed at session 37651 on kusama that this is exactly what it happens by looking at blocks with polkadot.js.

TODO

  • Move versi on proof of stake and properly test before and after fix to confirm there is no other issue.

…y enter active set

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh requested a review from a team as a code owner March 17, 2024 12:39
@alexggh alexggh requested review from eskimor and ordian March 17, 2024 12:49
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find 👍

Not sure you need to test this on versi. However, the tests should be updated. Especially as there is already one authorities_returns_current_and_next_authority_set ;)

substrate/frame/authority-discovery/src/lib.rs Outdated Show resolved Hide resolved
alexggh and others added 3 commits March 18, 2024 08:43
Co-authored-by: Bastian Köcher <git@kchr.de>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Mar 18, 2024

Not sure you need to test this on versi.

I agree the fix is pretty clear, I just want to make sure there aren't other places that I missed. We don't need to wait with merging this PR, until I get that setup working.

However, the tests should be updated. Especially as there is already one authorities_returns_current_and_next_authority_set ;)

Done.

IMO it would make sense to make the docs of on_new_session more clear.

Did my best hopefully it is clearer now.

@alexggh alexggh added the R0-silent Changes should not be mentioned in any release notes label Mar 18, 2024
Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think, although it would be good to test on Versi, it's not strictly required. By the nature of changes, the worst possible scenario is that we're updating something we shouldn't update with the data still correct, so it doesn't make any worse anyway.

Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bkchr
Copy link
Member

bkchr commented Mar 18, 2024

I think, although it would be good to test on Versi, it's not strictly required.

If you want to test this, run a test network locally. Set MIN_GOSSIP_PEERS to 0 and then reproducing this should be fairly simple.

@bkchr bkchr added T1-FRAME This PR/Issue is related to core FRAME, the framework. and removed R0-silent Changes should not be mentioned in any release notes labels Mar 18, 2024
@bkchr
Copy link
Member

bkchr commented Mar 18, 2024

@alexggh please add some small prdoc explaining the change.

@ordian
Copy link
Member

ordian commented Mar 18, 2024

Does this also fix #1819?

@alexggh
Copy link
Contributor Author

alexggh commented Mar 18, 2024

Does this also fix #1819?

It does help, statement-v1 does not require direct connections to your peers in the validator group, so you might get lucky with the random gossiping when you are sparsely connected, but fully lucky, so I guess that's why we would see just some missed statements instead of missing all the statements as it happens with statements-v2.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 18, 2024

If you want to test this, run a test network locally. Set MIN_GOSSIP_PEERS to 0 and then reproducing this should be fairly simple.

Yeah, that's what I ended-up doing, it is not that easy, because ValidatorManager changes the set every session, so it was masking the problem also you need some bigger waiting times and sessions because DHT is updated every 10 minutes.

Managed to reproduce it with zombienet and a modified ValidatorManager and confirmed this patch is fixing the issue.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Mar 18, 2024

@alexggh please add some small prdoc explaining the change.

Done.

@alexggh alexggh enabled auto-merge March 18, 2024 11:32
@alexggh alexggh added this pull request to the merge queue Mar 18, 2024
Merged via the queue into master with commit 8d0cd4f Mar 18, 2024
131 of 132 checks passed
@alexggh alexggh deleted the alexaggh/fix_kusama_validators_first_session branch March 18, 2024 12:15
alexggh added a commit that referenced this pull request Mar 18, 2024
…y enter the active set (#3722)

There is a problem in the way we update `authorithy-discovery` next keys
and because of that nodes that enter the active set would be noticed at
the start of the session they become active, instead of the start of the
previous session as it was intended. This is problematic because:

1. The node itself advertises its addresses on the DHT only when it
notices it should become active on around ~10m loop, so in this case it
would notice after it becomes active.
2. The other nodes won't be able to detect the new nodes addresses at
the beginning of the session, so it won't added them to the reserved
set.

With 1 + 2, we end-up in a situation where the the new node won't be
able to properly connect to its peers because it won't be in its peers
reserved set. Now, the nodes accept by default`MIN_GOSSIP_PEERS: usize =
25` connections to nodes that are not in the reserved set, but given
Kusama size(> 1000 nodes) you could easily have more than`25` new nodes
entering the active set or simply the nodes don't have slots anymore
because, they already have connections to peers not in the active set.

In the end what the node would notice is 0 backing rewards because it
wasn't directly connected to the peers in its backing group.

## Root-cause

The flow is like this:
1. At BAD_SESSION - 1, in `rotate_session` new nodes are added to
QueuedKeys
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L609
```
 <QueuedKeys<T>>::put(queued_amalgamated.clone());
<QueuedChanged<T>>::put(next_changed);
```
2. AuthorityDiscovery::on_new_session is called with `changed` being the
value of `<QueuedChanged<T>>:` at BAD_SESSION - **2** because it was
saved before being updated
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L613
3. At BAD_SESSION - 1, `AuthorityDiscovery::on_new_session` doesn't
updated its next_keys because `changed` was false.
4. For the entire durations of `BAD_SESSION - 1` everyone calling
runtime api `authorities`(should return past, present and future
authorities) won't discover the nodes that should become active .
5. At the beginning of BAD_SESSION, all nodes discover the new nodes are
authorities, but it is already too late because reserved_nodes are
updated only at the beginning of the session by the `gossip-support`.
See above why this bad.

## Fix
Update next keys with the queued_validators at every session, not matter
the value of `changed` this is the same way babe pallet correctly does
it.
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/babe/src/lib.rs#L655

## Notes

- The issue doesn't reproduce with proof-authorities changes like
`versi` because `changed` would always be true and `AuthorityDiscovery`
correctly updates its next_keys every time.
- Confirmed at session `37651` on kusama that this is exactly what it
happens by looking at blocks with polkadot.js.

## TODO
- [ ] Move versi on proof of stake and properly test before and after
fix to confirm there is no other issue.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
(cherry picked from commit 8d0cd4f)
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
alexggh added a commit that referenced this pull request Mar 18, 2024
…y enter the active set (#3722)

There is a problem in the way we update `authorithy-discovery` next keys
and because of that nodes that enter the active set would be noticed at
the start of the session they become active, instead of the start of the
previous session as it was intended. This is problematic because:

1. The node itself advertises its addresses on the DHT only when it
notices it should become active on around ~10m loop, so in this case it
would notice after it becomes active.
2. The other nodes won't be able to detect the new nodes addresses at
the beginning of the session, so it won't added them to the reserved
set.

With 1 + 2, we end-up in a situation where the the new node won't be
able to properly connect to its peers because it won't be in its peers
reserved set. Now, the nodes accept by default`MIN_GOSSIP_PEERS: usize =
25` connections to nodes that are not in the reserved set, but given
Kusama size(> 1000 nodes) you could easily have more than`25` new nodes
entering the active set or simply the nodes don't have slots anymore
because, they already have connections to peers not in the active set.

In the end what the node would notice is 0 backing rewards because it
wasn't directly connected to the peers in its backing group.

## Root-cause

The flow is like this:
1. At BAD_SESSION - 1, in `rotate_session` new nodes are added to
QueuedKeys
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L609
```
 <QueuedKeys<T>>::put(queued_amalgamated.clone());
<QueuedChanged<T>>::put(next_changed);
```
2. AuthorityDiscovery::on_new_session is called with `changed` being the
value of `<QueuedChanged<T>>:` at BAD_SESSION - **2** because it was
saved before being updated
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L613
3. At BAD_SESSION - 1, `AuthorityDiscovery::on_new_session` doesn't
updated its next_keys because `changed` was false.
4. For the entire durations of `BAD_SESSION - 1` everyone calling
runtime api `authorities`(should return past, present and future
authorities) won't discover the nodes that should become active .
5. At the beginning of BAD_SESSION, all nodes discover the new nodes are
authorities, but it is already too late because reserved_nodes are
updated only at the beginning of the session by the `gossip-support`.
See above why this bad.

## Fix
Update next keys with the queued_validators at every session, not matter
the value of `changed` this is the same way babe pallet correctly does
it.
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/babe/src/lib.rs#L655

## Notes

- The issue doesn't reproduce with proof-authorities changes like
`versi` because `changed` would always be true and `AuthorityDiscovery`
correctly updates its next_keys every time.
- Confirmed at session `37651` on kusama that this is exactly what it
happens by looking at blocks with polkadot.js.

## TODO
- [ ] Move versi on proof of stake and properly test before and after
fix to confirm there is no other issue.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
(cherry picked from commit 8d0cd4f)
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
alexggh added a commit that referenced this pull request Mar 18, 2024
…y enter the active set (#3722)

There is a problem in the way we update `authorithy-discovery` next keys
and because of that nodes that enter the active set would be noticed at
the start of the session they become active, instead of the start of the
previous session as it was intended. This is problematic because:

1. The node itself advertises its addresses on the DHT only when it
notices it should become active on around ~10m loop, so in this case it
would notice after it becomes active.
2. The other nodes won't be able to detect the new nodes addresses at
the beginning of the session, so it won't added them to the reserved
set.

With 1 + 2, we end-up in a situation where the the new node won't be
able to properly connect to its peers because it won't be in its peers
reserved set. Now, the nodes accept by default`MIN_GOSSIP_PEERS: usize =
25` connections to nodes that are not in the reserved set, but given
Kusama size(> 1000 nodes) you could easily have more than`25` new nodes
entering the active set or simply the nodes don't have slots anymore
because, they already have connections to peers not in the active set.

In the end what the node would notice is 0 backing rewards because it
wasn't directly connected to the peers in its backing group.

## Root-cause

The flow is like this:
1. At BAD_SESSION - 1, in `rotate_session` new nodes are added to
QueuedKeys
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L609
```
 <QueuedKeys<T>>::put(queued_amalgamated.clone());
<QueuedChanged<T>>::put(next_changed);
```
2. AuthorityDiscovery::on_new_session is called with `changed` being the
value of `<QueuedChanged<T>>:` at BAD_SESSION - **2** because it was
saved before being updated
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L613
3. At BAD_SESSION - 1, `AuthorityDiscovery::on_new_session` doesn't
updated its next_keys because `changed` was false.
4. For the entire durations of `BAD_SESSION - 1` everyone calling
runtime api `authorities`(should return past, present and future
authorities) won't discover the nodes that should become active .
5. At the beginning of BAD_SESSION, all nodes discover the new nodes are
authorities, but it is already too late because reserved_nodes are
updated only at the beginning of the session by the `gossip-support`.
See above why this bad.

## Fix
Update next keys with the queued_validators at every session, not matter
the value of `changed` this is the same way babe pallet correctly does
it.
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/babe/src/lib.rs#L655

## Notes

- The issue doesn't reproduce with proof-authorities changes like
`versi` because `changed` would always be true and `AuthorityDiscovery`
correctly updates its next_keys every time.
- Confirmed at session `37651` on kusama that this is exactly what it
happens by looking at blocks with polkadot.js.

## TODO
- [ ] Move versi on proof of stake and properly test before and after
fix to confirm there is no other issue.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
(cherry picked from commit 8d0cd4f)
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
alexggh added a commit that referenced this pull request Mar 18, 2024
…y enter the active set (#3722)

There is a problem in the way we update `authorithy-discovery` next keys
and because of that nodes that enter the active set would be noticed at
the start of the session they become active, instead of the start of the
previous session as it was intended. This is problematic because:

1. The node itself advertises its addresses on the DHT only when it
notices it should become active on around ~10m loop, so in this case it
would notice after it becomes active.
2. The other nodes won't be able to detect the new nodes addresses at
the beginning of the session, so it won't added them to the reserved
set.

With 1 + 2, we end-up in a situation where the the new node won't be
able to properly connect to its peers because it won't be in its peers
reserved set. Now, the nodes accept by default`MIN_GOSSIP_PEERS: usize =
25` connections to nodes that are not in the reserved set, but given
Kusama size(> 1000 nodes) you could easily have more than`25` new nodes
entering the active set or simply the nodes don't have slots anymore
because, they already have connections to peers not in the active set.

In the end what the node would notice is 0 backing rewards because it
wasn't directly connected to the peers in its backing group.

## Root-cause

The flow is like this:
1. At BAD_SESSION - 1, in `rotate_session` new nodes are added to
QueuedKeys
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L609
```
 <QueuedKeys<T>>::put(queued_amalgamated.clone());
<QueuedChanged<T>>::put(next_changed);
```
2. AuthorityDiscovery::on_new_session is called with `changed` being the
value of `<QueuedChanged<T>>:` at BAD_SESSION - **2** because it was
saved before being updated
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L613
3. At BAD_SESSION - 1, `AuthorityDiscovery::on_new_session` doesn't
updated its next_keys because `changed` was false.
4. For the entire durations of `BAD_SESSION - 1` everyone calling
runtime api `authorities`(should return past, present and future
authorities) won't discover the nodes that should become active .
5. At the beginning of BAD_SESSION, all nodes discover the new nodes are
authorities, but it is already too late because reserved_nodes are
updated only at the beginning of the session by the `gossip-support`.
See above why this bad.

## Fix
Update next keys with the queued_validators at every session, not matter
the value of `changed` this is the same way babe pallet correctly does
it.
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/babe/src/lib.rs#L655

## Notes

- The issue doesn't reproduce with proof-authorities changes like
`versi` because `changed` would always be true and `AuthorityDiscovery`
correctly updates its next_keys every time.
- Confirmed at session `37651` on kusama that this is exactly what it
happens by looking at blocks with polkadot.js.

## TODO
- [ ] Move versi on proof of stake and properly test before and after
fix to confirm there is no other issue.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
(cherry picked from commit 8d0cd4f)
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
alexggh added a commit that referenced this pull request Mar 18, 2024
…y enter the active set (#3722)

There is a problem in the way we update `authorithy-discovery` next keys
and because of that nodes that enter the active set would be noticed at
the start of the session they become active, instead of the start of the
previous session as it was intended. This is problematic because:

1. The node itself advertises its addresses on the DHT only when it
notices it should become active on around ~10m loop, so in this case it
would notice after it becomes active.
2. The other nodes won't be able to detect the new nodes addresses at
the beginning of the session, so it won't added them to the reserved
set.

With 1 + 2, we end-up in a situation where the the new node won't be
able to properly connect to its peers because it won't be in its peers
reserved set. Now, the nodes accept by default`MIN_GOSSIP_PEERS: usize =
25` connections to nodes that are not in the reserved set, but given
Kusama size(> 1000 nodes) you could easily have more than`25` new nodes
entering the active set or simply the nodes don't have slots anymore
because, they already have connections to peers not in the active set.

In the end what the node would notice is 0 backing rewards because it
wasn't directly connected to the peers in its backing group.

## Root-cause

The flow is like this:
1. At BAD_SESSION - 1, in `rotate_session` new nodes are added to
QueuedKeys
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L609
```
 <QueuedKeys<T>>::put(queued_amalgamated.clone());
<QueuedChanged<T>>::put(next_changed);
```
2. AuthorityDiscovery::on_new_session is called with `changed` being the
value of `<QueuedChanged<T>>:` at BAD_SESSION - **2** because it was
saved before being updated
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L613
3. At BAD_SESSION - 1, `AuthorityDiscovery::on_new_session` doesn't
updated its next_keys because `changed` was false.
4. For the entire durations of `BAD_SESSION - 1` everyone calling
runtime api `authorities`(should return past, present and future
authorities) won't discover the nodes that should become active .
5. At the beginning of BAD_SESSION, all nodes discover the new nodes are
authorities, but it is already too late because reserved_nodes are
updated only at the beginning of the session by the `gossip-support`.
See above why this bad.

## Fix
Update next keys with the queued_validators at every session, not matter
the value of `changed` this is the same way babe pallet correctly does
it.
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/babe/src/lib.rs#L655

## Notes

- The issue doesn't reproduce with proof-authorities changes like
`versi` because `changed` would always be true and `AuthorityDiscovery`
correctly updates its next_keys every time.
- Confirmed at session `37651` on kusama that this is exactly what it
happens by looking at blocks with polkadot.js.

## TODO
- [ ] Move versi on proof of stake and properly test before and after
fix to confirm there is no other issue.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
(cherry picked from commit 8d0cd4f)
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
bkchr added a commit that referenced this pull request Mar 18, 2024
… enter the active set (#3735)

This is a cherry-pick from master of
#3722

Expected patches for (1.7.0):
 `pallet-authorithy-discovery`

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
bkchr added a commit that referenced this pull request Mar 18, 2024
… enter the active set (#3733)

This is a cherry-pick from master of
#3722

Expected patches for (1.7.0):
 `pallet-authorithy-discovery`

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
Morganamilo pushed a commit that referenced this pull request Mar 20, 2024
… enter the active set (#3735)

This is a cherry-pick from master of
#3722

Expected patches for (1.7.0):
 `pallet-authorithy-discovery`

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Mar 24, 2024
…y enter the active set (paritytech#3722)

There is a problem in the way we update `authorithy-discovery` next keys
and because of that nodes that enter the active set would be noticed at
the start of the session they become active, instead of the start of the
previous session as it was intended. This is problematic because:

1. The node itself advertises its addresses on the DHT only when it
notices it should become active on around ~10m loop, so in this case it
would notice after it becomes active.
2. The other nodes won't be able to detect the new nodes addresses at
the beginning of the session, so it won't added them to the reserved
set.

With 1 + 2, we end-up in a situation where the the new node won't be
able to properly connect to its peers because it won't be in its peers
reserved set. Now, the nodes accept by default`MIN_GOSSIP_PEERS: usize =
25` connections to nodes that are not in the reserved set, but given
Kusama size(> 1000 nodes) you could easily have more than`25` new nodes
entering the active set or simply the nodes don't have slots anymore
because, they already have connections to peers not in the active set.

In the end what the node would notice is 0 backing rewards because it
wasn't directly connected to the peers in its backing group.

## Root-cause

The flow is like this:
1. At BAD_SESSION - 1, in `rotate_session` new nodes are added to
QueuedKeys
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L609
```
 <QueuedKeys<T>>::put(queued_amalgamated.clone());
<QueuedChanged<T>>::put(next_changed);
```
2. AuthorityDiscovery::on_new_session is called with `changed` being the
value of `<QueuedChanged<T>>:` at BAD_SESSION - **2** because it was
saved before being updated
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/session/src/lib.rs#L613
3. At BAD_SESSION - 1, `AuthorityDiscovery::on_new_session` doesn't
updated its next_keys because `changed` was false.
4. For the entire durations of `BAD_SESSION - 1` everyone calling
runtime api `authorities`(should return past, present and future
authorities) won't discover the nodes that should become active .
5. At the beginning of BAD_SESSION, all nodes discover the new nodes are
authorities, but it is already too late because reserved_nodes are
updated only at the beginning of the session by the `gossip-support`.
See above why this bad.

## Fix
Update next keys with the queued_validators at every session, not matter
the value of `changed` this is the same way babe pallet correctly does
it.
https://github.com/paritytech/polkadot-sdk/blob/02e1a7f476d7d7c67153e975ab9a1bdc02ffea12/substrate/frame/babe/src/lib.rs#L655

## Notes

- The issue doesn't reproduce with proof-authorities changes like
`versi` because `changed` would always be true and `AuthorityDiscovery`
correctly updates its next_keys every time.
- Confirmed at session `37651` on kusama that this is exactly what it
happens by looking at blocks with polkadot.js.

## TODO
- [ ] Move versi on proof of stake and properly test before and after
fix to confirm there is no other issue.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T1-FRAME This PR/Issue is related to core FRAME, the framework.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants