(fix): Making stress.py with node_restart mode pass, and fixing #2916 #3036

SkidanovAlex · 2020-07-24T22:53:01Z

After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly.

Nearcore fixes:

We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run
(After a restart, a node doesn't respond to chunk requests that have non-empty list of requested receipts #2916) we had an issue that if a node produced a chunk, and then
crashed, on recovery it was not able to serve it because it didn't have
all the parts and receipts stored in the storage from which we recover
cache entries in the shards manager. Fixing it by always storing all the
parts and receipts (redundantly) for chunks in the shards we care about.

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

Other issues discovered while fixing stress.py:

clear_data can delete the same transaction twice, resulting in debug_assert #2906

gitpod-io · 2020-07-24T22:53:05Z

SkidanovAlex · 2020-07-24T23:01:38Z

Inviting @mfornet to review the changes to the network tests infra.

chain/chunks/src/lib.rs

chain/client/src/types.rs

bowenwang1996 · 2020-07-24T23:27:33Z

pytest/tests/sanity/repro_2916.py

+            await conn0.send(peer_message)
+
+            received_response = False
+            # make several attempts, in case the node sends us a block or some other message before sending the response


is this robust? should we wait until a PartialEncodedChunkResponse is sent?

The goal of this loop is to see whether it is ever sent. I can have a wall clock timeout instead of the number of messages, but generally with the current nearcore it is very robust (the PECR message is either first or second, if a block was produced in the meantime).

mfornet

In general changes to python infrastructure looks good, however wrote some suggestions about how to improve robustness for test repro_2916.py.

mfornet · 2020-07-25T05:14:55Z

pytest/lib/peer.py

    async def recv(self, expected=None):
        while True:
            response_raw = await self.recv_raw()
+            if response_raw is None:
+                return None
+
            response = BinarySerializer(schema).deserialize(
                response_raw, PeerMessage)

            if expected is None or response.enum == expected:
                return response


Change expected (enum name) to expected_fn a function that determines if this is the message we are waiting for.

Suggested change

async def recv(self, expected=None):

while True:

response_raw = await self.recv_raw()

if response_raw is None:

return None

response = BinarySerializer(schema).deserialize(

response_raw, PeerMessage)

if expected is None or response.enum == expected:

return response

async def recv(self, expected_fn=None):

while True:

response_raw = await self.recv_raw()

if response_raw is None:

return None

response = BinarySerializer(schema).deserialize(

response_raw, PeerMessage)

if expected_fn is None or expected_fn(response):

return response

I don't want to break existing users, so made it both compare and try to call.

pytest/lib/peer.py

mfornet · 2020-07-25T05:26:49Z

pytest/tests/sanity/repro_2916.py

+            for i in range(5):
+                response = await conn0.recv()
+                if response is None: # None means timeout
+                    break
+                elif response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse':
+                    print("Received response for shard %s" % shard_ord)
+                    received_response = True
+                    break
+                else:
+                    print(response.enum)
+            else:
+                print("Didn't receive response for shard %s" % shard_ord)


as mentioned by @bowenwang1996 this might not be very reliable. Instead we can pass a filter to recv such that it waits until the expected message arrives. As per suggestion above, we can use expect_fn like the following snippet:

Suggested change

for i in range(5):

response = await conn0.recv()

if response is None: # None means timeout

break

elif response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse':

print("Received response for shard %s" % shard_ord)

received_response = True

break

else:

print(response.enum)

else:

print("Didn't receive response for shard %s" % shard_ord)

try:

expected_fn = lambda response : response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse'

_response = await asyncio.wait_for(conn0.recv(expected_fn), 5)

recieved_response = True

break

except concurrent.futures.TimeoutError:

print("Didn't receive response for shard %s" % shard_ord)

This way we only have a timeout in one place, and to wait exactly for the message we are expecting, this will work even if we receive in between several not relevant messages, as long as the expected message arrives before the timeout.

After this change stress.py passes consistently, and is reintroduced to nightly. Nearcore fixes: [v] We had a bug in the syncing logic (with a low chance of being triggered in the wild): if a block is produced, and between 1/3 and 2/3 of block producers received it, and the rest have not, the system stalls, because no 2/3 of block producers have the same head, but also nobody is two blocks behind the highest peer to start syncing. Fixing it by forcing sync if we've been 1 block behind for too long. stress.py was reproducing this issue in every run Test fixes [v] Fixing a scenario in which a failure to send a transaction to all validators resulted in recording an incorrect tx hash alongside the tx. Later when checking balances using the incorrect hash resulted in getting incorrect success value, and thus applying incorrect corrections to the expected balances; [ ] Removing the old infrastructure for network interference, which relied on certain node setup, and instead using the new network proxy infrastructure. [ ] Adding a new argument that controls the percentage of dropped messages between nodes. [v] Changing the order of magnitude of staking transactions, so that the validator set actually changes. [ ] Altering `node_restart` process to ocassionally wipe out the data folder of the node, so that we stress state sync (and syncing in general) more Other issues discovered while fixing stress.py: - #2906

Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com>

codecov · 2020-07-30T21:27:24Z

Codecov Report

Merging #3036 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3036   +/-   ##
=======================================
  Coverage   87.64%   87.64%           
=======================================
  Files         212      212           
  Lines       41861    41861           
=======================================
  Hits        36691    36691           
  Misses       5170     5170

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33befb9...0bb12b1. Read the comment docs.

…#3036) After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly. Nearcore fixes: - We had a bug in the syncing logic (with a low chance of being triggered in the wild): if a block is produced, and between 1/3 and 2/3 of block producers received it, and the rest have not, the system stalls, because no 2/3 of block producers have the same head, but also nobody is two blocks behind the highest peer to start syncing. Fixing it by forcing sync if we've been 1 block behind for too long. stress.py was reproducing this issue in every run - (#2916) we had an issue that if a node produced a chunk, and then crashed, on recovery it was not able to serve it because it didn't have all the parts and receipts stored in the storage from which we recover cache entries in the shards manager. Fixing it by always storing all the parts and receipts (redundantly) for chunks in the shards we care about. Test fixes [v] Fixing a scenario in which a failure to send a transaction to all validators resulted in recording an incorrect tx hash alongside the tx. Later when checking balances using the incorrect hash resulted in getting incorrect success value, and thus applying incorrect corrections to the expected balances; [v] Changing the order of magnitude of staking transactions, so that the validator set actually changes. Other issues discovered while fixing stress.py: - #2906

Since #3036, nodes will persist all the partial encoded chunk parts for the shard they track. However, on mainnet there are quite a few blocks produced before the PR landed and therefore we need to do a migration for archival nodes that contain all the historical data. Test plan ---------- Tested the migration on a mainnet archival node.

SkidanovAlex requested a review from bowenwang1996 as a code owner July 24, 2020 22:53

SkidanovAlex requested a review from mfornet July 24, 2020 23:01

bowenwang1996 approved these changes Jul 24, 2020

View reviewed changes

mfornet approved these changes Jul 25, 2020

View reviewed changes

SkidanovAlex and others added 7 commits July 30, 2020 06:42

Naive fix of #2916

401e930

Syncing to completion

1c2d29a

Adjusting the test to the new realities

74813df

Fixing #2916 for real

6b657a6

Apply suggestions from code review

9551d21

Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com>

Addressing comments

b17ccc3

SkidanovAlex force-pushed the stress_node_restart_and_2916 branch from 3b63b2e to b17ccc3 Compare July 30, 2020 13:58

Formatting

4c3dbf7

Merge branch 'master' into stress_node_restart_and_2916

0bb12b1

SkidanovAlex added the automerge label Jul 31, 2020

nearprotocol-bulldozer bot merged commit 3530e7d into master Jul 31, 2020

nearprotocol-bulldozer bot deleted the stress_node_restart_and_2916 branch July 31, 2020 04:50

weekly-digest bot mentioned this pull request Jul 31, 2020

Weekly Digest (24 July, 2020 - 31 July, 2020) #3062

Closed

pmnoxx mentioned this pull request Aug 5, 2020

add test to cover stress.py fails #3089

Closed

weekly-digest bot mentioned this pull request Aug 7, 2020

Weekly Digest (31 July, 2020 - 7 August, 2020) #3100

Closed

bowenwang1996 mentioned this pull request Sep 20, 2020

fix: always have partial encoded chunk when chunk exists #3344

Merged

marcelo-gonzalez mentioned this pull request Mar 1, 2022

refactor: simplify if statements in persist_partial_chunk_for_data_av… #6366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): Making stress.py with node_restart mode pass, and fixing #2916 #3036

(fix): Making stress.py with node_restart mode pass, and fixing #2916 #3036

SkidanovAlex commented Jul 24, 2020

gitpod-io bot commented Jul 24, 2020 •

edited

Loading

SkidanovAlex commented Jul 24, 2020

bowenwang1996 Jul 24, 2020

SkidanovAlex Jul 26, 2020 •

edited

Loading

mfornet left a comment

mfornet Jul 25, 2020

SkidanovAlex Jul 30, 2020

mfornet Jul 25, 2020

codecov bot commented Jul 30, 2020 •

edited

Loading

(fix): Making stress.py with node_restart mode pass, and fixing #2916 #3036

(fix): Making stress.py with node_restart mode pass, and fixing #2916 #3036

Conversation

SkidanovAlex commented Jul 24, 2020

gitpod-io bot commented Jul 24, 2020 • edited Loading

SkidanovAlex commented Jul 24, 2020

bowenwang1996 Jul 24, 2020

Choose a reason for hiding this comment

SkidanovAlex Jul 26, 2020 • edited Loading

Choose a reason for hiding this comment

mfornet left a comment

Choose a reason for hiding this comment

mfornet Jul 25, 2020

Choose a reason for hiding this comment

SkidanovAlex Jul 30, 2020

Choose a reason for hiding this comment

mfornet Jul 25, 2020

Choose a reason for hiding this comment

codecov bot commented Jul 30, 2020 • edited Loading

Codecov Report

gitpod-io bot commented Jul 24, 2020 •

edited

Loading

SkidanovAlex Jul 26, 2020 •

edited

Loading

codecov bot commented Jul 30, 2020 •

edited

Loading