-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BEAM freeze due to persistent_term usage #8917
Comments
backtrace of one coredump
|
Three threads in
This looks like it could be the same as #8613 fixed by #8627 released in OTP-26.2.5.2 (e176896). |
It should be possible to confirm. Inlined function Print |
hmmm the 3 thread are at different position: disassemble
|
Seeing as there's three different fragments with a backoff loop in there, and there's a thread stuck in all three versions, I think it's safe to say that this is the same bug as #8613. Let's see if an update fixes it :) |
Got it, we are rolling out 27.1 and will see if it is fixed, thanks! |
Describe the bug
We sometimes see an issue that the beam is almost non-responsive (cannot make new dist connection, no logs, many gen_server is not replying, but connected socket can be progressing).
We captured a crashdump and a coredump.
In crashdump, we see only 3 processes running: two of them doing prim_socket:nif_recv, hundreds of processes scheduled, including many tls_sender processes pending to send data. It is a 72 core machine so most schedulers are idle (even though most are with non-zero run-queue len).
What we observe is that there is also hundreds of suspended processes. Almost all of them are doing persistent_term:put/get, where most are from OTP logger (get in logger_config:get_module_level(ssl), or put in logger_olp:set_mode).
I checked the persistent_term table, there is tens of {logger_config, MODULE} keys but not that many.
One suspicious thing is that we also see a suspended code_server stuck at code_server:try_finish_module_2/6 + 244 where next message is {get_object_code_for_loading,sasl_report}.
In coredump I did a scan and see 757 processes in suspended state. I picked a random suspended process and trace up to find who suspended it and see it is again a registered logger process. The etp command is too old and not able to print the current stacktrace, but I believe it is a persistent_term:put.
There are few guesses I have:
It would be great if I can get some help, or if there is any information that we can get from coredump I can try, thanks!
To Reproduce
It is really hard to reproduce because it happens rarely, we will wait and see if we can get more information from new frozen instances.
Affected versions
The node was on OTP 26.2.1 (ca8b893)
The text was updated successfully, but these errors were encountered: