Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast_forward sometimes randomly hangs forever #266

Open
andrew-sol opened this issue Jan 20, 2023 · 13 comments
Open

fast_forward sometimes randomly hangs forever #266

andrew-sol opened this issue Jan 20, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@andrew-sol
Copy link

There is a chance the next fast_forward call will hang forever. It's not reproducible every time, it's just random.
I call fast_forward(100) in a loop and it happens quite often. Also, I was able to make it hang every single time after 15 iterations by just adding a few "view" calls to a contract far before the fast_forward loop starts executing, which is weird.

Reproduces on both Mac (M1) and Linux (Intel).
Workspaces version: 0.7.0

@ChaoticTempest
Copy link
Member

that's quite weird indeed. There is a slow down that can happen when it tries to cross an epoch boundary, but it shouldn't hang.

I played with it, but I'm not able to reproduce this when I changed the following fast_forward.rs example to loop over 100/1000 times with fast_forward(100): https://github.com/near/workspaces-rs/blob/967d93e1dddaacb8cd92f6a3fe19b9d652f96c73/examples/src/fast_forward.rs#L24-L30

But that's a pretty simple view call though, so maybe your view calls might be more complicated. Is there a repo or example you're able to provide and I can take a look at?

Was this only hanging with a single test/node running? Or was this in combination with a bunch of other tests being ran as well?

@andrew-sol
Copy link
Author

andrew-sol commented Feb 1, 2023

Please, check out this repo https://github.com/andrew-sol/workspaces-reproduction

It hangs pretty often. Though it may take several minutes until the test reaches the 'fast_forward` loop point.
You should see the following output:

...
Passed ✅ test_deposit_stake_unstake
Start: test_withdraw
Fast-forwarding 5 epochs...
Fast-forwarded 500 blocks.
Fast-forwarded 500 blocks.
Fast-forwarded 500 blocks.
<=== HERE IT MAY HANG ===>

If it did not hang at this point (after skipping 3 epochs), it will continue executing. The restart is needed then.
But when it hangs, it always happens after skipping 3 epochs (the number of blocks in each epoch may vary).

It hung 4 times out of 10 when I tested it today.

@frol frol added the bug Something isn't working label Jun 4, 2023
@Catmanpooh
Copy link

I have been able to reproduce the problem using the code from this repo https://github.com/andrew-sol/workspaces-reproduction. Seems that the program is hanging in the near-jsonrpc-client request send.

https://github.com/near/near-jsonrpc-client-rs/blob/517ba064ceb41e0b9b49f406f18da63194e90f2d/src/lib.rs#L227-L232

A possible solution that I would like to ask about would be adding the NEAR_RPC_TIMEOUT_SECS which should at least throw an error.

https://github.com/near/workspaces-rs/blob/main/README.md?plain=1#L340

@frol
Copy link
Collaborator

frol commented Jun 13, 2023

@Catmanpooh Have you tried to set that timeout? The client hanging on waiting for the response should definitely have a timeout, but also we need to investigate why the server does not respond and whether we should just re-try sending a request.

Please, identify which call the client hangs on, and what is the state of the server (sandbox neard process)

@Catmanpooh
Copy link

@frol how do I check the (sandbox neard process) and how would I add this variable NEAR_RPC_TIMEOUT_SECS? I am making the assumption that these line are the retry for the call. https://github.com/near/workspaces-rs/blob/ea65434c1e8b5424acda4643539b158c1a149aab/workspaces/src/rpc/client.rs#LL402C1-L412C2
I also tried the timeout locally in the file using request.timeout(std::time::Duration::from_secs(5)).send().
https://github.com/near/near-jsonrpc-client-rs/blob/517ba064ceb41e0b9b49f406f18da63194e90f2d/src/lib.rs#L227-L232

@frol
Copy link
Collaborator

frol commented Jun 15, 2023

@Catmanpooh workspaces-rs spawns neard process when you use sandbox() function, so you should be able to troubleshoot what rpc_url is then used and use it to check if it is alive by fetching /status page (curl http://127.0.0.1:3030/status - the port might/should be different), and if it not alive, check the path of spawning the sandbox to see if it gets collision of several processes trying to use the same homedir, or same RPC port, or anything really (try getting access to the stdout/stderr messages of the sandbox); on the other hand, if it is alive and responds to the status page just fine, let me know what RPC call workspaces-rs gets stuck on when you reproduce the issue.

If you are still stuck, please, schedule a call at https://calendly.com/frol-at-near/30min

@thaodt
Copy link

thaodt commented Jun 26, 2023

Hi @frol, as discusses on reddit, please assign this to me

@frol
Copy link
Collaborator

frol commented Jun 29, 2023

@andrew-sol Thanks for the reproducible example! It hangs indeed, and inspecting the /status page of the near-sandbox (neard / nearcore) while the workspaces-rs test hangs I see that the block height gets stuck (see the block height is 1998) 🤔

image

I am suspicious that it is an issue on the nearcore side.

@thaodt
Copy link

thaodt commented Jun 29, 2023

@frol 1st time I run the repro example from @andrew-sol its not hanged, see my screenshot:
photo_2023-06-29_17-17-53

but after some multiple times of attempts, it really hanged. And in my case, it also hangs at block height 1998.
image

Its quite weird.

@frol
Copy link
Collaborator

frol commented Jun 29, 2023

I feel this might be related to near/nearcore#8328 (mentioned in #253).

The reproduction script hits the issue on my M1 MacBook consistently, but on Ubuntu x86 server (and @thaodt also has Arch Linux) it hangs only occasionally.

@ghost
Copy link

ghost commented Oct 14, 2023

The occasional hanging in the example happens when DoomslugBlockProductionReadiness is stuck at NotReady. Offending line the issue happens someplace lower in the call stack, here is where it manifests: https://github.com/near/nearcore/blob/7e28c38dacb9edbf6d888d8796f8d2e5e2c05d71/chain/chain/src/doomslug.rs#L647

Edit: added issue to nearcore

@nujabes403
Copy link

Hi, when would this issue be resolved?

@nujabes403
Copy link

nujabes403 commented Jun 20, 2024

Someone told me that adding 2 weeks of block number value to the argument of the worker.fast_forward() It just don't proceeds
For example, adding (86400 * 14) to the value of worker.fast_forward halts the program

@frol @thaodt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants