Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): Making stress.py with node_restart mode pass, and fixing #2916 #3036

Merged
merged 9 commits into from
Jul 31, 2020

Conversation

SkidanovAlex
Copy link
Collaborator

After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly.

Nearcore fixes:

  • We had a bug in the syncing logic (with a low chance of being
    triggered in the wild): if a block is produced, and between 1/3 and 2/3
    of block producers received it, and the rest have not, the system
    stalls, because no 2/3 of block producers have the same head, but also
    nobody is two blocks behind the highest peer to start syncing. Fixing it
    by forcing sync if we've been 1 block behind for too long.
    stress.py was reproducing this issue in every run

  • (After a restart, a node doesn't respond to chunk requests that have non-empty list of requested receipts #2916) we had an issue that if a node produced a chunk, and then
    crashed, on recovery it was not able to serve it because it didn't have
    all the parts and receipts stored in the storage from which we recover
    cache entries in the shards manager. Fixing it by always storing all the
    parts and receipts (redundantly) for chunks in the shards we care about.

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

Other issues discovered while fixing stress.py:

@gitpod-io
Copy link

gitpod-io bot commented Jul 24, 2020

@SkidanovAlex SkidanovAlex requested a review from mfornet July 24, 2020 23:01
@SkidanovAlex
Copy link
Collaborator Author

Inviting @mfornet to review the changes to the network tests infra.

chain/chunks/src/lib.rs Outdated Show resolved Hide resolved
chain/chunks/src/lib.rs Outdated Show resolved Hide resolved
chain/client/src/types.rs Outdated Show resolved Hide resolved
await conn0.send(peer_message)

received_response = False
# make several attempts, in case the node sends us a block or some other message before sending the response
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this robust? should we wait until a PartialEncodedChunkResponse is sent?

Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex Jul 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this loop is to see whether it is ever sent. I can have a wall clock timeout instead of the number of messages, but generally with the current nearcore it is very robust (the PECR message is either first or second, if a block was produced in the meantime).

Copy link
Member

@mfornet mfornet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general changes to python infrastructure looks good, however wrote some suggestions about how to improve robustness for test repro_2916.py.

Comment on lines 34 to 50
async def recv(self, expected=None):
while True:
response_raw = await self.recv_raw()
if response_raw is None:
return None

response = BinarySerializer(schema).deserialize(
response_raw, PeerMessage)

if expected is None or response.enum == expected:
return response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change expected (enum name) to expected_fn a function that determines if this is the message we are waiting for.

Suggested change
async def recv(self, expected=None):
while True:
response_raw = await self.recv_raw()
if response_raw is None:
return None
response = BinarySerializer(schema).deserialize(
response_raw, PeerMessage)
if expected is None or response.enum == expected:
return response
async def recv(self, expected_fn=None):
while True:
response_raw = await self.recv_raw()
if response_raw is None:
return None
response = BinarySerializer(schema).deserialize(
response_raw, PeerMessage)
if expected_fn is None or expected_fn(response):
return response

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to break existing users, so made it both compare and try to call.

pytest/lib/peer.py Outdated Show resolved Hide resolved
pytest/lib/peer.py Outdated Show resolved Hide resolved
Comment on lines 77 to 89
for i in range(5):
response = await conn0.recv()
if response is None: # None means timeout
break
elif response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse':
print("Received response for shard %s" % shard_ord)
received_response = True
break
else:
print(response.enum)
else:
print("Didn't receive response for shard %s" % shard_ord)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned by @bowenwang1996 this might not be very reliable. Instead we can pass a filter to recv such that it waits until the expected message arrives. As per suggestion above, we can use expect_fn like the following snippet:

Suggested change
for i in range(5):
response = await conn0.recv()
if response is None: # None means timeout
break
elif response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse':
print("Received response for shard %s" % shard_ord)
received_response = True
break
else:
print(response.enum)
else:
print("Didn't receive response for shard %s" % shard_ord)
try:
expected_fn = lambda response : response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse'
_response = await asyncio.wait_for(conn0.recv(expected_fn), 5)
recieved_response = True
break
except concurrent.futures.TimeoutError:
print("Didn't receive response for shard %s" % shard_ord)

This way we only have a timeout in one place, and to wait exactly for the message we are expecting, this will work even if we receive in between several not relevant messages, as long as the expected message arrives before the timeout.

SkidanovAlex and others added 7 commits July 30, 2020 06:42
After this change stress.py passes consistently, and is reintroduced to nightly.

Nearcore fixes:

[v] We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[ ] Removing the old infrastructure for network interference, which relied on
certain node setup, and instead using the new network proxy infrastructure.

[ ] Adding a new argument that controls the percentage of dropped messages between nodes.

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

[ ] Altering `node_restart` process to ocassionally wipe out the data folder of
the node, so that we stress state sync (and syncing in general) more

Other issues discovered while fixing stress.py:
- #2906
Co-authored-by: Bowen Wang <bowenwang1996@users.noreply.github.com>
@SkidanovAlex SkidanovAlex force-pushed the stress_node_restart_and_2916 branch from 3b63b2e to b17ccc3 Compare July 30, 2020 13:58
@codecov
Copy link

codecov bot commented Jul 30, 2020

Codecov Report

Merging #3036 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3036   +/-   ##
=======================================
  Coverage   87.64%   87.64%           
=======================================
  Files         212      212           
  Lines       41861    41861           
=======================================
  Hits        36691    36691           
  Misses       5170     5170           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33befb9...0bb12b1. Read the comment docs.

@nearprotocol-bulldozer nearprotocol-bulldozer bot merged commit 3530e7d into master Jul 31, 2020
@nearprotocol-bulldozer nearprotocol-bulldozer bot deleted the stress_node_restart_and_2916 branch July 31, 2020 04:50
bowenwang1996 pushed a commit that referenced this pull request Aug 14, 2020
…#3036)

After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly.

Nearcore fixes:

- We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run

- (#2916) we had an issue that if a node produced a chunk, and then
crashed, on recovery it was not able to serve it because it didn't have
all the parts and receipts stored in the storage from which we recover
cache entries in the shards manager. Fixing it by always storing all the
parts and receipts (redundantly) for chunks in the shards we care about.

Test fixes

[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;

[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.

Other issues discovered while fixing stress.py:
- #2906
bowenwang1996 added a commit that referenced this pull request Sep 21, 2020
Since #3036, nodes will persist all the partial encoded chunk parts for the shard they track. However, on mainnet there are quite a few blocks produced before the PR landed and therefore we need to do a migration for archival nodes that contain all the historical data.

Test plan
----------
Tested the migration on a mainnet archival node.
chefsale pushed a commit that referenced this pull request Sep 21, 2020
Since #3036, nodes will persist all the partial encoded chunk parts for the shard they track. However, on mainnet there are quite a few blocks produced before the PR landed and therefore we need to do a migration for archival nodes that contain all the historical data.

Test plan
----------
Tested the migration on a mainnet archival node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants