-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dnode_hold_impl() deadlock with Lustre #8994
Comments
I will work on getting a perf top dump as well. |
@jasimmons1973 thanks for the heads up. I took at look at the Lustre issue and agree with Alex, I'm happy to take a look at the full set of stack traces when you have them. |
Would this stack trace be holding a lock that would block a transaction group?
Here's the source for the 2 inner frames
As James mentioned in #8433, we are still seeing |
f2-mds4_lustre_unhealthy_20190707.tar.gz More logs that are too large to post here are at http://www.infradead.org:~/jsimmons/LU12510.tar.gz |
@jasimmons1973 @mattaezell Have you considered using eBPF to generate a chain graph from the scheduler? It would show who is yielding to whom and why. Might help find the ultimate blocker. |
We are running a prehistoric RHEL7 kernel so eBPF is not of much use. Now if we could reproduce this problem on a system using a newer kernel in the test bed we could collect eBPF data. |
@cbrumgard do you have a eBPF script ready to do that? |
@degremont Take a look at this http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html and http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html#Chain. I'm currently writing a similar ebpf script for profiling zfs. |
I have more logs at http://www.infradead.org:~/jsimmons/f2-mds4_20190709.tar.gz |
So @jasimmons1973 and I are going to try replicate the problem and test with ebpf. |
Yes, it definitely would. From the stack traces it appears that for some reason it's unable to find a free dnode which can be allocated. If you could post the contents of
It may be that's it not blocked in While we investigate, you may be able to prevent the issue by setting the zfs |
perf showed:
|
I had the chance to run a live crash session on one server having a similar issue, here is what I found: the stack trace of lustre io threads
and I managed to find the dnode_handle:
turns out zr_refcount is 1
I suspect what happened here was thread 1 gets read zrl on 1297, and then rele on 1313, then tries to get write zrl on 1314, while thread 2 won the race, gets write zrl and creates the dnode. unlock write zrl and returns, thread 3 comes in, gets read zrl on 1297, and then dnh_dnode is 0xffff88103a8289c0 now, so it continues to 1333 tries to take dn_mtx with read zrl at hand. now if dn_mtx is contented, or not released(like the stack trace showing threads waiting in dmu_tx_wait), thread 1 will spin in the while loop, because the read zrl hold by thread 3 is preventing it from getting the write zrl. It seems like a cascading error of the dn_mtx contention, but in dnode_hold_impl(), I also started to wonder the point of zrl, no other users are there except the dnode stuff, |
@lidongyang that scenario definitely looks like it would be possible. But it depends on there being a process which is holding the
One of the original concerns here is that we expect the common case to be taking a hold on an allocated dnode with a valid
This is mostly historical as I understand it. It believe it should be possible to convert this to a rwlock, or an rrwlock, which I agree would be much simpler and avoid this potential spinning. |
@behlendorf I mean dmu_tx_wait from the stack trace is probably also waiting on dn_mtx. I think rwlock is better in this case, acquiring rrwlock itself need to get hold of rrl->rr_lock in rrw_enter_write/read() |
We hit this issue again and I was able to gather the dnodestats from the unhealthy host and one from a healthy host to compare. Unhealthy host:
Healthy host:
We have since set the dnodesize=legacy on the MDSs where we have been seeing this issue the most often. For reference, it has looked promising so far as a mitigation but I will report back if we continue to see the issue. |
Based on the above, it seems that we are stuck in this while loop:
|
It looks like we hit it again on one of our MDS hosts where I had set the dnodesize=legacy. The dnode stats looked like what I posted above. One thought I had, is that dnodesize something that can be set on the fly or is it something that I would need to reimport to take effect? I did not do that last night when I set it. |
Thanks for the dnode stats. @curtispb the change should have taken effect immediately, since it didn't help I'd suggest returned to the defaults. That jibe's with @lidongyang's stack an analysis from above #8994 (comment) , my hope was that it might reduce the frequency.
Oh, I see. |
@lidongyang thanks for the clue. I believe I see why this is the case, I've posted an analysis to the Lustre issue. |
Thank you Brian!!! BTW internally we got the latest lustre running with Neil's 5.2.0-rc2+ tree with ZFS 0.8.1. We plan to keep this up so we can profile issues like this. |
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994
I'm installing zfs 0.8.1 patches with #9027 so I'm ready to test once Alex pushes a patch |
@jasimmons1973 sounds good! We'll go ahead and get #9027 merged once we're sure it has everything Alex needs. |
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8994 Closes #9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8994 Closes #9027
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
We currently moved a Lustre file system to 2.12 LTS using a ZFS 0.8.1 back end. All servers are running with RHEL7.6 kernel 3.10.0-957.5.1.el7.x86_64. After the move we are seeing the same failure about once a day on our file system. We have filed a ticket at https://jira.whamcloud.com/browse/LU-12510 as well. The back trace is as follows:
2019-07-05T02:16:36.434714-04:00 f2-mds2.ncrc.gov kernel: Lustre: 42858:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562307384/real 1562307384] req@ffff99acbaf11f80 x1638034607021648/t0(0) o6->f2-OST0035-osc-MDT0001@10.10.33.50@o2ib2:28/4 lens 544/432 e 0 to 1 dl 1562307396 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
2019-07-05T02:16:36.434763-04:00 f2-mds2.ncrc.gov kernel: Lustre: 42858:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
2019-07-05T02:16:36.434782-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection to f2-OST0035 (at 10.10.33.50@o2ib2) was lost; in progress operations using this service will wait for recovery to complete
2019-07-05T02:17:26.605024-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection restored to 10.10.33.50@o2ib2 (at 10.10.33.50@o2ib2)
2019-07-05T02:28:56.360456-04:00 f2-mds2.ncrc.gov kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds.
2019-07-05T02:28:56.360500-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-07-05T02:28:56.371558-04:00 f2-mds2.ncrc.gov kernel: txg_quiesce D ffff99aec932a080 0 40218 2 0x00000000
2019-07-05T02:28:56.371580-04:00 f2-mds2.ncrc.gov kernel: Call Trace:
2019-07-05T02:28:56.371599-04:00 f2-mds2.ncrc.gov kernel: [] schedule+0x29/0x70
2019-07-05T02:28:56.384261-04:00 f2-mds2.ncrc.gov kernel: [] cv_wait_common+0x125/0x150 [spl]
2019-07-05T02:28:56.384280-04:00 f2-mds2.ncrc.gov kernel: [] ? wake_up_atomic_t+0x30/0x30
2019-07-05T02:28:56.397218-04:00 f2-mds2.ncrc.gov kernel: [] __cv_wait+0x15/0x20 [spl]
2019-07-05T02:28:56.397238-04:00 f2-mds2.ncrc.gov kernel: [] txg_quiesce_thread+0x2cb/0x3c0 [zfs]
2019-07-05T02:28:56.411080-04:00 f2-mds2.ncrc.gov kernel: [] ? txg_init+0x2b0/0x2b0 [zfs]
2019-07-05T02:28:56.411101-04:00 f2-mds2.ncrc.gov kernel: [] thread_generic_wrapper+0x73/0x80 [spl]
2019-07-05T02:28:56.425331-04:00 f2-mds2.ncrc.gov kernel: [] ? __thread_exit+0x20/0x20 [spl]
2019-07-05T02:28:56.425350-04:00 f2-mds2.ncrc.gov kernel: [] kthread+0xd1/0xe0
2019-07-05T02:28:56.430937-04:00 f2-mds2.ncrc.gov kernel: [] ? insert_kthread_work+0x40/0x40
2019-07-05T02:28:56.437770-04:00 f2-mds2.ncrc.gov kernel: [] ret_from_fork_nospec_begin+0x7/0x21
2019-07-05T02:28:56.444951-04:00 f2-mds2.ncrc.gov kernel: [] ? insert_kthread_work+0x40/0x40
2019-07-05T02:28:56.459337-04:00 f2-mds2.ncrc.gov kernel: INFO: task mdt04_000:42947 blocked for more than 120 seconds.
2019-07-05T02:28:56.459357-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-07-05T02:28:56.478994-04:00 f2-mds2.ncrc.gov kernel: mdt04_000 D ffff99aecc7ae180 0 42947 2 0x00000000
2019-07-05T02:28:56.479015-04:00 f2-mds2.ncrc.gov kernel: Call Trace:
2019-07-05T02:28:56.479037-04:00 f2-mds2.ncrc.gov kernel: [] schedule+0x29/0x70
2019-07-05T02:28:56.491665-04:00 f2-mds2.ncrc.gov kernel: [] cv_wait_common+0x125/0x150 [spl]
2019-07-05T02:28:56.491685-04:00 f2-mds2.ncrc.gov kernel: [] ? wake_up_atomic_t+0x30/0x30
2019-07-05T02:28:56.504560-04:00 f2-mds2.ncrc.gov kernel: [] __cv_wait+0x15/0x20 [spl]
2019-07-05T02:28:56.504580-04:00 f2-mds2.ncrc.gov kernel: [] dmu_tx_wait+0x20b/0x3b0 [zfs]
2019-07-05T02:28:56.517986-04:00 f2-mds2.ncrc.gov kernel: [] dmu_tx_assign+0x91/0x490 [zfs]
2019-07-05T02:28:56.518007-04:00 f2-mds2.ncrc.gov kernel: [] osd_trans_start+0x199/0x440 [osd_zfs]
2019-07-05T02:28:56.532397-04:00 f2-mds2.ncrc.gov kernel: [] mdt_empty_transno+0xf7/0x850 [mdt]
2019-07-05T02:28:56.532416-04:00 f2-mds2.ncrc.gov kernel: [] mdt_mfd_open+0x8de/0xe70 [mdt]
2019-07-05T02:28:56.546398-04:00 f2-mds2.ncrc.gov kernel: [] ? mdt_pack_acl2body+0x1c2/0x9f0 [mdt]
2019-07-05T02:28:56.546419-04:00 f2-mds2.ncrc.gov kernel: [] mdt_finish_open+0x64b/0x760 [mdt]
2019-07-05T02:28:56.553339-04:00 f2-mds2.ncrc.gov kernel: [] mdt_open_by_fid_lock+0x672/0x9b0 [mdt]
2019-07-05T02:28:56.567652-04:00 f2-mds2.ncrc.gov kernel: [] mdt_reint_open+0x760/0x27d0 [mdt]
2019-07-05T02:28:56.567671-04:00 f2-mds2.ncrc.gov kernel: [] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
2019-07-05T02:28:56.582532-04:00 f2-mds2.ncrc.gov kernel: [] ? lu_ucred+0x1e/0x30 [obdclass]
2019-07-05T02:28:56.582553-04:00 f2-mds2.ncrc.gov kernel: [] ? mdt_ucred+0x15/0x20 [mdt]
2019-07-05T02:28:56.588903-04:00 f2-mds2.ncrc.gov kernel: [] ? mdt_root_squash+0x21/0x430 [mdt]
The text was updated successfully, but these errors were encountered: