-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DotNext.Net.Cluster crash in production since I think version 5.4.0 #242
Comments
Have you set |
Yes, It is working when I set it back to the old mode. I always errase the previous data storage when I test. |
Do you mean that it crashes on empty WAL with a new format? |
It happen with empty WAL sometime and sometime after an amount of time with the existing WAL @sakno . |
Do you have a stable repro? I see that the second stack trace is from the tests in your repository. |
It is a kind of random behavior. First logs comes from our production. |
It could happen if you trying to open WAL produced by version |
Yes I am sure. My store for testing was completly errased. |
SlimFaas is compiled in AOT. |
The second stack trace indicates that WAL is trying to read existing files:
There is a code for dotNext/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs Lines 507 to 552 in cacf3e5
To get an exception like in your stack trace the program needs to go to the second or third |
forgot the latest logs @sakno I may made a mistake in our dev kubernetes environment. Here the logs my collegues sent to me from the crash in production. Occur with the new protocol (in random laps of time near 48 hours and do not happen with the old one). I think it manage near 400 000 writes operation by day. I do no kown where can come from the negative number. |
How WAL is configured? How many records per partition, parallel IO, etc? What's the target architecture, x86_64? |
Target architecture is x86 64. Thank you @sakno for your help |
It's hard to say what's the root cause of the problem because there is no stable repro. I can only guess. Possibly it happens because of network timeouts leading to cancellation of the token used by WAL internally to perform I/O. Some I/O were done in a way not safe for cancellation, I've prepared the potential fix. I can't release it right now. |
Did you have a chance to check the fix? |
Hi @sakno do you have a way to publish an alpha? |
My level in c# is not the best 😜 |
You can reference a project explicitly from your csproj file without published alpha. |
Release 5.7.0 has been published. |
Thank you @sakno I test it today and tell you if it fix the problem |
@guillaume-chervet , please use 5.7.3 release |
I test it tomorrow morning @sakno . |
Thank you so much for you help work @sakno |
hi @sakno,
We have crashed in production that lock nodes.
The slimfaas code did not change things like to This part, we only update libraries : AxaFrance/SlimFaas@b26e3bd
I'am not sure but I think it is link to these changes :
DotNext.Net.Cluster 5.4.0
Changed binary file format for WAL for more efficient I/O. A new format is incompatible with all previous versions. To enable legacy format, set PersistentState.Options.UseLegacyBinaryFormat property to true
Introduced a new experimental binary format for WAL based on sparse files. Can be enabled with PersistentState.Options.MaxLogEntrySize property
We took the new default system and we have this new error that happen sometime and crash the node 👍
or like this
The text was updated successfully, but these errors were encountered: