-
Notifications
You must be signed in to change notification settings - Fork 925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky TestSharesAvailable_Full #787
Comments
We need to understand how to extract the data of the random test that already happened and to reproduce the failure, as it seems like it maybe something regarding rsmt2d correctness |
A simple while loop and dumping the shares when the test fails (e.g. by temporarily introducing a global var for that) could get the job done (in a hacky way). I'm currently running |
These shares were used in one failing test: https://gist.github.com/liamsi/34bb1269e311392184581c0191c1cfab |
I've used that as a test-vector above instead of |
Yeah, there's no way to pinpoint any bugs without shares to repro |
Yeah, the problem is that even with the shares above (with which the test failed), I can not repro the failure. So the bug must either really be in the decoding of rsmt2d (unlikely), or, in the Retrieve method itself (more likely because go-routines are almost the only source of non-determinism here). |
So if I ran my modified test with the same testvector it also passes +99% of the time. But also fails rarely with:
So it is not really reproducible via the input. The only interesting thing about this is that it is also the same Index, namely 16 as in Hlib's screenshot. Maybe additionally a stack trace whenever we encounter an unexpected byzantine error could help here. |
Another hint: if I try with 32 instead of 16, the error happens much more often. Again the index is always (or at least every time I tried) the full original square size (or ODS.width), here, 32 and of a column. So it looks like the first parity share in a column (in the last quadrant) seems to cause the problem |
@Wondertan how do I print out the debug logs during tests? Specifically this one: celestia-node/ipld/retriever.go Line 151 in 1c45a66
|
My hypothesis is that the bug is somewhere in celestia-node/ipld/retriever.go Line 60 in 1c45a66
It could also simply have to do sth with the last quadrant but then I would expect we would see this more often. |
Also, when I always fetch from the 1st quadrant by commenting out this line: celestia-node/ipld/retriever_quadrant.go Lines 74 to 75 in e8751aa
It also works more reliably. Do I use the last quadrant (by not shuffling and explicitly using it) and with width 32, the test reliably fails. |
@liamsi, I also believe this is something with retrieval logic and not with rsmt2d. I am able to reproduce the issue in my reconstruction tests in #702 now. Moreover, they always fail with the same error which was not the case before. Now I am trying to understand which PR caused the regression. My first suspect was #738, but it was merged after the issue appeared, so it is not it. Continue investigation and now checking #730 |
Some alpha celestiaorg/rsmt2d#83 fixes the bug completely. |
The current workaround is to reimport the square each reconstruction attempt. |
Mainly we need the share to be set like this, so it's set into axis slice. |
I was able to deflake the test and reproduce the issue with 100% chance. Then to see how reimporting fixes it. |
What is the bug specifically? |
The text was updated successfully, but these errors were encountered: