Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck in futex #11749

Closed
justinpryzby opened this issue Mar 15, 2021 · 8 comments
Closed

stuck in futex #11749

justinpryzby opened this issue Mar 15, 2021 · 8 comments
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@justinpryzby
Copy link

justinpryzby commented Mar 15, 2021

I haven't heard back about how to debug the previous issue (11641), so here goes anyway.

Postgres autovacuum worker process stuck in futex. I have seen this twice 3 times before on another server.
The process does not die with SIGINT, and I imagine if I SIGKILL it, it won't die, and I'll need to reboot the server.

[pryzbyj@ts-db-new ~]$ ps -O lstart,wchan=wwwwwwwwwwwwwwwwwwww 12583
PID STARTED wwwwwwwwwwwwwwwwwwww S TTY TIME COMMAND
12583 Wed Mar 3 03:41:23 2021 futex_wait_queue_me S ? 00:00:59 postgres: autovacuum worker ....

Distribution Name | Centos
Distribution Version | 7.8
Linux Kernel | kernel-3.10.0-1127.18.2.el7 and 3.10.0-1160.15.2.el7
Architecture | x86_64
ZFS Version | 2.0.1-1 and 2.0.3-1
SPL Version | 2.0.1-1

@justinpryzby justinpryzby added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Mar 15, 2021
@rincebrain
Copy link
Contributor

2.0.1 had #11463 but that was fixed (according to all the reporters...) in 2.0.2 and newer, so if you're sure you're still reproducing on 2.0.3, that means it's not that...probably.

You could follow what was done in #11463 (comment) and see what it says (assuming it says anything).

An alternate test you could run to exclude an issue like #11463 would be to try running a newer kernel (e.g. the kernel-ml package from EPEL would work), since that issue was specific to older versions of the kernel.

@justinpryzby
Copy link
Author

justinpryzby commented Mar 16, 2021 via email

@rincebrain
Copy link
Contributor

It's occurring right now for a customer running zfs 2.0.3 under linux RHEL 3.10.0-1160.15.2.el7.x86_64. Note, #11463 talks about 100% sys CPU, and that's not the case here. In the cases I've seen, postgres autovacuum/autoanalyze is stuck in futex, and fails to stop, but I am able to kill -9 it. The stuck process is like this: @.*** ~]$ sudo strace -p 17087 strace: Process 17087 attached futex(0x7fdbf9e42760, FUTEX_WAIT_PRIVATE, 2, NULL

Okay, so that's one idea eliminated, now for...everything else.

You might find strace -f more informative, since one thread waiting on a futex() call basically means "I'm waiting for another thread to finish", and I don't believe strace will show you the other threads of a process without -f. So it's probably another thread, not the "main" thread of 17087, that's stuck somewhere interesting. (Which one? Dunno. Possibly whichever one is also waiting on some call. Possibly one that's stuck in an endless loop and not making any progress. Story unclear.)

In particular, I'd really like to find a thread that's "stuck" on something in OpenZFS, otherwise it's something of a hard sell that OpenZFS is to blame for the problem. (Other evidence like "we're running this setup without OpenZFS on a number of systems and this never happens" would also be reasonably compelling.)

You suggested in your comment that at least one of the systems reproducing this right now is a production system - do you have any non-production systems that you can reproduce this on, that we might be able to try more exciting things with down the line?

Were these systems running fine for a long time and then this problem suddenly cropped up, or is this the initial setup period of the systems and you have no long-running history to judge whether this would have happened on earlier software versions?

Do other reads/writes to the relevant zpool still work normally while this is going on?

@justinpryzby
Copy link
Author

justinpryzby commented Mar 16, 2021 via email

@rincebrain
Copy link
Contributor

postgres doesn't use threads.

Okay, great, then it's waiting on another process. Time to go look at the different Postgres processes and see whether any of them are stuck in an exciting way.

In particular, I'd really like to find a thread that's "stuck" on something in OpenZFS, otherwise it's something of a hard sell that OpenZFS is to blame for the problem. (Other evidence like "we're running this setup without OpenZFS on a number of systems and this never happens" would also be reasonably compelling.) You suggested in your comment that at least one of the systems reproducing this right now is a production system - do you have any non-production systems that you can reproduce this on, that we might be able to try more exciting things with down the line?
Yes, I'm seeing this several times on production systems since upgrading to openzfs-2 beginning in January. I don't have a reproduction recipe, otherwise I could use a test VM. Depending on how exciting, I can try some things on the production system, in particular if it's already stuck (like now).

Yeah, but one of my questions was going to be "does this happen with, say, MD+XFS", which is understandably not something you can try on a production system (though it sounds like it didn't happen pre-OpenZFS 2.0, so it's probably reasonable to assume for now it wouldn't).

If you need these systems working Now(tm), it might be reasonable to downgrade them to 0.8.6 and try to reproduce this on a testbed even if you don't have a perfect reproduction recipe premade. (I believe it's still the case that the main "zfs" CentOS/RHEL repo ships 0.8.X RPMs and you need "zfs-testing" to get 2.0.X, so this should be pretty straightforward as long as you didn't already run "zpool upgrade".)

@justinpryzby
Copy link
Author

justinpryzby commented Mar 16, 2021 via email

@rincebrain
Copy link
Contributor

On Mon, Mar 15, 2021 at 10:37:18PM -0700, Rich Ercolani wrote: > postgres doesn't use threads. Okay, great, then it's waiting on another process. Time to go look at the different Postgres processes and see whether any of them are stuck in an exciting way.
I don't think so. The stuck autoanalyze is the earliest-started process started that's still running, except for postgres' own processes - if they were stuck, I think the whole system would be unusable.

Did you look, or just conclude that it's not possible?

I'm not particularly convinced it's a ZFS issue either, since nothing has pointed to ZFS.

Could you share some of those bug report links, just for reference by anyone reading this thread in the future?

@justinpryzby
Copy link
Author

justinpryzby commented Apr 1, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants