dnode_hold_impl() deadlock with Lustre #8994

jasimmons1973 · 2019-07-05T14:26:51Z

We currently moved a Lustre file system to 2.12 LTS using a ZFS 0.8.1 back end. All servers are running with RHEL7.6 kernel 3.10.0-957.5.1.el7.x86_64. After the move we are seeing the same failure about once a day on our file system. We have filed a ticket at https://jira.whamcloud.com/browse/LU-12510 as well. The back trace is as follows:

2019-07-05T02:16:36.434714-04:00 f2-mds2.ncrc.gov kernel: Lustre: 42858:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1562307384/real 1562307384] req@ffff99acbaf11f80 x1638034607021648/t0(0) o6->f2-OST0035-osc-MDT0001@10.10.33.50@o2ib2:28/4 lens 544/432 e 0 to 1 dl 1562307396 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
2019-07-05T02:16:36.434763-04:00 f2-mds2.ncrc.gov kernel: Lustre: 42858:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
2019-07-05T02:16:36.434782-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection to f2-OST0035 (at 10.10.33.50@o2ib2) was lost; in progress operations using this service will wait for recovery to complete
2019-07-05T02:17:26.605024-04:00 f2-mds2.ncrc.gov kernel: Lustre: f2-OST0035-osc-MDT0001: Connection restored to 10.10.33.50@o2ib2 (at 10.10.33.50@o2ib2)
2019-07-05T02:28:56.360456-04:00 f2-mds2.ncrc.gov kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds.
2019-07-05T02:28:56.360500-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-07-05T02:28:56.371558-04:00 f2-mds2.ncrc.gov kernel: txg_quiesce D ffff99aec932a080 0 40218 2 0x00000000
2019-07-05T02:28:56.371580-04:00 f2-mds2.ncrc.gov kernel: Call Trace:
2019-07-05T02:28:56.371599-04:00 f2-mds2.ncrc.gov kernel: [] schedule+0x29/0x70
2019-07-05T02:28:56.384261-04:00 f2-mds2.ncrc.gov kernel: [] cv_wait_common+0x125/0x150 [spl]
2019-07-05T02:28:56.384280-04:00 f2-mds2.ncrc.gov kernel: [] ? wake_up_atomic_t+0x30/0x30
2019-07-05T02:28:56.397218-04:00 f2-mds2.ncrc.gov kernel: [] __cv_wait+0x15/0x20 [spl]
2019-07-05T02:28:56.397238-04:00 f2-mds2.ncrc.gov kernel: [] txg_quiesce_thread+0x2cb/0x3c0 [zfs]
2019-07-05T02:28:56.411080-04:00 f2-mds2.ncrc.gov kernel: [] ? txg_init+0x2b0/0x2b0 [zfs]
2019-07-05T02:28:56.411101-04:00 f2-mds2.ncrc.gov kernel: [] thread_generic_wrapper+0x73/0x80 [spl]
2019-07-05T02:28:56.425331-04:00 f2-mds2.ncrc.gov kernel: [] ? __thread_exit+0x20/0x20 [spl]
2019-07-05T02:28:56.425350-04:00 f2-mds2.ncrc.gov kernel: [] kthread+0xd1/0xe0
2019-07-05T02:28:56.430937-04:00 f2-mds2.ncrc.gov kernel: [] ? insert_kthread_work+0x40/0x40
2019-07-05T02:28:56.437770-04:00 f2-mds2.ncrc.gov kernel: [] ret_from_fork_nospec_begin+0x7/0x21
2019-07-05T02:28:56.444951-04:00 f2-mds2.ncrc.gov kernel: [] ? insert_kthread_work+0x40/0x40
2019-07-05T02:28:56.459337-04:00 f2-mds2.ncrc.gov kernel: INFO: task mdt04_000:42947 blocked for more than 120 seconds.
2019-07-05T02:28:56.459357-04:00 f2-mds2.ncrc.gov kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-07-05T02:28:56.478994-04:00 f2-mds2.ncrc.gov kernel: mdt04_000 D ffff99aecc7ae180 0 42947 2 0x00000000
2019-07-05T02:28:56.479015-04:00 f2-mds2.ncrc.gov kernel: Call Trace:
2019-07-05T02:28:56.479037-04:00 f2-mds2.ncrc.gov kernel: [] schedule+0x29/0x70
2019-07-05T02:28:56.491665-04:00 f2-mds2.ncrc.gov kernel: [] cv_wait_common+0x125/0x150 [spl]
2019-07-05T02:28:56.491685-04:00 f2-mds2.ncrc.gov kernel: [] ? wake_up_atomic_t+0x30/0x30
2019-07-05T02:28:56.504560-04:00 f2-mds2.ncrc.gov kernel: [] __cv_wait+0x15/0x20 [spl]
2019-07-05T02:28:56.504580-04:00 f2-mds2.ncrc.gov kernel: [] dmu_tx_wait+0x20b/0x3b0 [zfs]
2019-07-05T02:28:56.517986-04:00 f2-mds2.ncrc.gov kernel: [] dmu_tx_assign+0x91/0x490 [zfs]
2019-07-05T02:28:56.518007-04:00 f2-mds2.ncrc.gov kernel: [] osd_trans_start+0x199/0x440 [osd_zfs]
2019-07-05T02:28:56.532397-04:00 f2-mds2.ncrc.gov kernel: [] mdt_empty_transno+0xf7/0x850 [mdt]
2019-07-05T02:28:56.532416-04:00 f2-mds2.ncrc.gov kernel: [] mdt_mfd_open+0x8de/0xe70 [mdt]
2019-07-05T02:28:56.546398-04:00 f2-mds2.ncrc.gov kernel: [] ? mdt_pack_acl2body+0x1c2/0x9f0 [mdt]
2019-07-05T02:28:56.546419-04:00 f2-mds2.ncrc.gov kernel: [] mdt_finish_open+0x64b/0x760 [mdt]
2019-07-05T02:28:56.553339-04:00 f2-mds2.ncrc.gov kernel: [] mdt_open_by_fid_lock+0x672/0x9b0 [mdt]
2019-07-05T02:28:56.567652-04:00 f2-mds2.ncrc.gov kernel: [] mdt_reint_open+0x760/0x27d0 [mdt]
2019-07-05T02:28:56.567671-04:00 f2-mds2.ncrc.gov kernel: [] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
2019-07-05T02:28:56.582532-04:00 f2-mds2.ncrc.gov kernel: [] ? lu_ucred+0x1e/0x30 [obdclass]
2019-07-05T02:28:56.582553-04:00 f2-mds2.ncrc.gov kernel: [] ? mdt_ucred+0x15/0x20 [mdt]
2019-07-05T02:28:56.588903-04:00 f2-mds2.ncrc.gov kernel: [] ? mdt_root_squash+0x21/0x430 [mdt]

jasimmons1973 · 2019-07-05T14:27:09Z

I will work on getting a perf top dump as well.

behlendorf · 2019-07-05T23:36:09Z

@jasimmons1973 thanks for the heads up. I took at look at the Lustre issue and agree with Alex, I'm happy to take a look at the full set of stack traces when you have them.

mattaezell · 2019-07-08T02:59:32Z

Would this stack trace be holding a lock that would block a transaction group?

2019-07-07T18:32:24.379155-04:00 f2-mds4.ncrc.gov kernel: mdt00_016       R  running task        0 44820      2 0x00000000
2019-07-07T18:32:24.379219-04:00 f2-mds4.ncrc.gov kernel: Call Trace:
2019-07-07T18:32:24.379280-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc0deea58>] ? dnode_slots_tryenter+0x48/0xb0 [zfs]
2019-07-07T18:32:24.386380-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc0df07a3>] ? dnode_hold_impl+0x873/0xcb0 [zfs]
2019-07-07T18:32:24.400291-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc0dd9ff6>] ? dmu_object_next+0x106/0x150 [zfs]
2019-07-07T18:32:24.400355-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc0dda256>] ? dmu_object_alloc_impl+0x216/0x400 [zfs]
2019-07-07T18:32:24.407736-04:00 f2-mds4.ncrc.gov kernel: [<ffffffff90165e92>] ? mutex_lock+0x12/0x2f
2019-07-07T18:32:24.413555-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc0dda4f9>] ? dmu_object_alloc_dnsize+0x39/0x40 [zfs]
2019-07-07T18:32:24.428558-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc1691662>] ? __osd_object_create+0x82/0x170 [osd_zfs]
2019-07-07T18:32:24.428622-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc16919cd>] ? osd_mkreg+0x7d/0x210 [osd_zfs]
2019-07-07T18:32:24.435234-04:00 f2-mds4.ncrc.gov kernel: [<ffffffffc168dcd6>] ? osd_create+0x346/0xb20 [osd_zfs]

Here's the source for the 2 inner frames

(gdb) list *(dnode_slots_tryenter+0x48)
0x2efe8 is in dnode_slots_tryenter (/usr/src/debug/zfs-0.7.13/module/zfs/dnode.c:1049).
1044            ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK);
1045
1046            for (int i = idx; i < idx + slots; i++) {
1047                    dnode_handle_t *dnh = &children->dnc_children[i];
1048
1049                    if (!zrl_tryenter(&dnh->dnh_zrlock)) {
1050                            for (int j = idx; j < i; j++) {
1051                                    dnh = &children->dnc_children[j];
1052                                    zrl_exit(&dnh->dnh_zrlock);
1053                            }
(gdb) list *(dnode_hold_impl+0x873)
0x30873 is in dnode_hold_impl (/usr/src/debug/zfs-0.7.13/module/zfs/dnode.c:1409).
1404                    if (dn->dn_type == DMU_OT_NONE) {
1405                            DNODE_STAT_BUMP(dnode_hold_alloc_type_none);
1406                            mutex_exit(&dn->dn_mtx);
1407                            dnode_slots_rele(dnc, idx, slots);
1408                            dbuf_rele(db, FTAG);
1409                            return (SET_ERROR(ENOENT));
1410                    }
1411
1412                    DNODE_STAT_BUMP(dnode_hold_alloc_hits);
1413            } else if (flag & DNODE_MUST_BE_FREE) {

As James mentioned in #8433, we are still seeing dnode_hold_impl hanging threads for extended periods of time. Most commonly we see this on OSSs, but this might be common for our MDS hangups as well.

jasimmons1973 · 2019-07-08T13:56:56Z

f2-mds4_lustre_unhealthy_20190707.tar.gz

More logs that are too large to post here are at http://www.infradead.org:~/jsimmons/LU12510.tar.gz

cbrumgard · 2019-07-09T16:50:57Z

@jasimmons1973 @mattaezell Have you considered using eBPF to generate a chain graph from the scheduler? It would show who is yielding to whom and why. Might help find the ultimate blocker.

jasimmons1973 · 2019-07-09T16:55:16Z

We are running a prehistoric RHEL7 kernel so eBPF is not of much use. Now if we could reproduce this problem on a system using a newer kernel in the test bed we could collect eBPF data.

degremont · 2019-07-09T17:02:21Z

@cbrumgard do you have a eBPF script ready to do that?

cbrumgard · 2019-07-09T17:06:59Z

@degremont Take a look at this http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html and http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html#Chain. I'm currently writing a similar ebpf script for profiling zfs.

jasimmons1973 · 2019-07-09T17:16:23Z

I have more logs at http://www.infradead.org:~/jsimmons/f2-mds4_20190709.tar.gz

cbrumgard · 2019-07-09T17:26:09Z

So @jasimmons1973 and I are going to try replicate the problem and test with ebpf.

behlendorf · 2019-07-10T00:49:23Z

Would this stack trace be holding a lock that would block a transaction group?

Yes, it definitely would. From the stack traces it appears that for some reason it's unable to find a free dnode which can be allocated. If you could post the contents of /proc/spl/kstat/zfs/dnodestats from a node which has hit this issue that should give us some insight. It'd also be useful to see these stats from a comparable node which hasn't hit the issue.

As James mentioned in #8433, we are still seeing dnode_hold_impl hanging threads for extended periods of time. Most commonly we see this on OSSs, but this might be common for our MDS hangups as well.

It may be that's it not blocked in dnode_hold_impl(), but is instead repeatedly calling dmu_object_alloc_impl()->dmu_object_next()->dnode_hold_impl() to find an available free dnode. We should be able to see this the dnode stats to some degree, and it should also show up in a flame graph or the perf profiling data.

While we investigate, you may be able to prevent the issue by setting the zfs dnodesize=legacy property on the dataset. This could result in increased filesystem fragmentation, but it should reduce lock contention when allocating new dnodes. It can always be restored to the increased 1k value after this issue has been resolved.

mattaezell · 2019-07-10T02:45:24Z

It may be that's it not blocked in dnode_hold_impl(), but is instead repeatedly calling dmu_object_alloc_impl()->dmu_object_next()->dnode_hold_impl() to find an available free dnode. We should be able to see this the dnode stats to some degree, and it should also show up in a flame graph or the perf profiling data.

perf showed:

  Children      Self  Command          Shared Object                 Symbol                              
+   96.46%     0.00%  mdt05_006        [kernel.kallsyms]             [k] ret_from_fork_nospec_begin      
+   96.46%     0.00%  mdt05_006        [kernel.kallsyms]             [k] kthread                         
+   96.46%     0.00%  mdt05_006        [ptlrpc]                      [k] ptlrpc_main                     
+   96.46%     0.00%  mdt05_006        [ptlrpc]                      [k] ptlrpc_server_handle_request    
+   96.46%     0.00%  mdt05_006        [ptlrpc]                      [k] tgt_request_handle              
+   96.46%     0.00%  mdt05_006        [ptlrpc]                      [k] tgt_enqueue                     
+   96.46%     0.00%  mdt05_006        [ptlrpc]                      [k] ldlm_handle_enqueue0            
+   96.46%     0.00%  mdt05_006        [ptlrpc]                      [k] ldlm_lock_enqueue               
+   96.46%     0.00%  mdt05_006        [mdt]                         [k] mdt_intent_policy               
+   96.46%     0.00%  mdt05_006        [mdt]                         [k] mdt_intent_open                 
+   96.46%     0.00%  mdt05_006        [mdt]                         [k] mdt_reint_internal              
+   96.46%     0.00%  mdt05_006        [mdt]                         [k] mdt_reint_rec                   
+   96.46%     0.00%  mdt05_006        [mdt]                         [k] mdt_reint_open                  
+   96.46%     0.00%  mdt05_006        [mdd]                         [k] mdd_create                      
+   96.46%     0.00%  mdt05_006        [mdd]                         [k] mdd_create_object               
+   96.46%     0.00%  mdt05_006        [mdd]                         [k] mdd_create_object_internal      
+   96.46%     0.00%  mdt05_006        [lod]                         [k] lod_create                      
+   96.46%     0.00%  mdt05_006        [lod]                         [k] lod_sub_create                  
+   96.46%     0.00%  mdt05_006        [osd_zfs]                     [k] osd_create                      
+   96.46%     0.00%  mdt05_006        [osd_zfs]                     [k] osd_mkreg                       
+   96.46%     0.00%  mdt05_006        [osd_zfs]                     [k] __osd_object_create             
+   96.46%     0.00%  mdt05_006        [zfs]                         [k] dmu_object_alloc_dnsize         
+   96.46%     0.00%  mdt05_006        [zfs]                         [k] dmu_object_alloc_impl           
+   86.86%    51.32%  mdt05_006        [zfs]                         [k] dnode_hold_impl                 
+   31.88%    29.87%  mdt05_006        [zfs]                         [k] dnode_slots_tryenter            
+    9.02%     9.02%  mdt05_006        [zfs]                         [k] zrl_tryenter                    
+    6.11%     6.11%  mdt05_006        [kernel.kallsyms]             [k] _cond_resched

lidongyang · 2019-07-10T06:14:46Z

I had the chance to run a live crash session on one server having a similar issue, here is what I found:

the stack trace of lustre io threads

[<ffffffff810c4bd6>] __cond_resched+0x26/0x30
[<ffffffffc19cf22e>] dnode_hold_impl+0x64e/0xbf0 [zfs]
[<ffffffffc19cf7eb>] dnode_hold+0x1b/0x20 [zfs]
[<ffffffffc19b8642>] dmu_bonus_hold+0x32/0x1d0 [zfs]
[<ffffffffc0f909b4>] __osd_obj2dnode+0x34/0xc0 [osd_zfs]
[<ffffffffc0f90b5e>] osd_object_init+0x11e/0x850 [osd_zfs]
[<ffffffffc0ccc945>] lu_object_alloc+0xe5/0x320 [obdclass]
[<ffffffffc0cccd60>] lu_object_find_at+0x180/0x2b0 [obdclass]
[<ffffffffc0cccea6>] lu_object_find+0x16/0x20 [obdclass]
[<ffffffffc0b43645>] ofd_object_find+0x35/0x100 [ofd]
[<ffffffffc0b4c5d0>] ofd_preprw+0x490/0x1170 [ofd]
[<ffffffffc16ff3d5>] tgt_brw_read+0x975/0x1860 [ptlrpc]
[<ffffffffc16fd2ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[<ffffffffc16a5e2b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[<ffffffffc16a9572>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[<ffffffff810b4031>] kthread+0xd1/0xe0
[<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

(gdb) l *(dnode_hold_impl+0x64e)
0x3025e is in dnode_hold_impl (/usr/src/debug/zfs-0.7.9/module/zfs/dnode.c:1314).
1309				dnode_slots_rele(dnc, idx, slots);
1310				dbuf_rele(db, FTAG);
1311				return (SET_ERROR(ENOENT));
1312			} else {
1313				dnode_slots_rele(dnc, idx, slots);
1314				while (!dnode_slots_tryenter(dnc, idx, slots)) {
1315					DNODE_STAT_BUMP(dnode_hold_alloc_lock_retry);
1316					cond_resched();
1317				}
1318

and I managed to find the dnode_handle:

crash> struct dnode_handle_t ffff882a5a220588
struct dnode_handle_t {
  dnh_zrlock = {
    zr_mtx = {
      m_mutex = {
        count = {
          counter = 1
        }, 
        wait_lock = {
          {
            rlock = {
              raw_lock = {
                val = {
                  counter = 0
                }
              }
            }
          }
        }, 
        wait_list = {
          next = 0xffff882a5a220590, 
          prev = 0xffff882a5a220590
        }, 
        owner = 0x0, 
        {
          osq = {
            tail = {
              counter = 0
            }
          }, 
          __UNIQUE_ID_rh_kabi_hide3 = {
            spin_mlock = 0x0
          }, 
          {<No data fields>}
        }
      }, 
      m_lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }, 
      m_owner = 0x0
    }, 
    zr_refcount = 1, 
    zr_cv = {
      cv_magic = 879052276, 
      cv_event = {
        lock = {
          {
            rlock = {
              raw_lock = {
                val = {
                  counter = 0
                }
              }
            }
          }
        }, 
        task_list = {
          next = 0xffff882a5a2205d8, 
          prev = 0xffff882a5a2205d8
        }
      }, 
      cv_destroy = {
        lock = {
          {
            rlock = {
              raw_lock = {
                val = {
                  counter = 0
                }
              }
            }
          }
        }, 
        task_list = {
          next = 0xffff882a5a2205f0, 
          prev = 0xffff882a5a2205f0
        }
      }, 
      cv_refs = {
        counter = 1
      }, 
      cv_waiters = {
        counter = 0
      }, 
      cv_mutex = 0x0
    }, 
    zr_pad = 0
  }, 
  dnh_dnode = 0xffff88103a8289c0
}

turns out zr_refcount is 1

1294		if (flag & DNODE_MUST_BE_ALLOCATED) {
1295			slots = 1;
1296	
1297			dnode_slots_hold(dnc, idx, slots);
1298			dnh = &dnc->dnc_children[idx];
1299	
1300			if (DN_SLOT_IS_PTR(dnh->dnh_dnode)) {
1301				dn = dnh->dnh_dnode;
1302			} else if (dnh->dnh_dnode == DN_SLOT_INTERIOR) {
1303				DNODE_STAT_BUMP(dnode_hold_alloc_interior); 
1304				dnode_slots_rele(dnc, idx, slots);
1305				dbuf_rele(db, FTAG);
1306				return (SET_ERROR(EEXIST));
1307			} else if (dnh->dnh_dnode != DN_SLOT_ALLOCATED) {
1308				DNODE_STAT_BUMP(dnode_hold_alloc_misses);
1309				dnode_slots_rele(dnc, idx, slots);
1310				dbuf_rele(db, FTAG);
1311				return (SET_ERROR(ENOENT));
1312			} else {
1313				dnode_slots_rele(dnc, idx, slots); 
1314				while (!dnode_slots_tryenter(dnc, idx, slots)) {
1315					DNODE_STAT_BUMP(dnode_hold_alloc_lock_retry);
1316					cond_resched();
1317				}
1318	
1319				/*
1320				 * Someone else won the race and called dnode_create()
1321				 * after we checked DN_SLOT_IS_PTR() above but before
1322				 * we acquired the lock.
1323				 */
1324				if (DN_SLOT_IS_PTR(dnh->dnh_dnode)) {
1325					DNODE_STAT_BUMP(dnode_hold_alloc_lock_misses);
1326					dn = dnh->dnh_dnode;
1327				} else {
1328					dn = dnode_create(os, dn_block + idx, db,
1329					    object, dnh);
1330				}
1331			}
1332	
1333			mutex_enter(&dn->dn_mtx);
1334			if (dn->dn_type == DMU_OT_NONE) {
1335				DNODE_STAT_BUMP(dnode_hold_alloc_type_none);
1336				mutex_exit(&dn->dn_mtx);
1337				dnode_slots_rele(dnc, idx, slots);
1338				dbuf_rele(db, FTAG);
1339				return (SET_ERROR(ENOENT));
1340			}
1341	
1342			DNODE_STAT_BUMP(dnode_hold_alloc_hits);

I suspect what happened here was thread 1 gets read zrl on 1297, and then rele on 1313, then tries to get write zrl on 1314, while thread 2 won the race, gets write zrl and creates the dnode. unlock write zrl and returns, thread 3 comes in, gets read zrl on 1297, and then dnh_dnode is 0xffff88103a8289c0 now, so it continues to 1333 tries to take dn_mtx with read zrl at hand.

now if dn_mtx is contented, or not released(like the stack trace showing threads waiting in dmu_tx_wait), thread 1 will spin in the while loop, because the read zrl hold by thread 3 is preventing it from getting the write zrl.

It seems like a cascading error of the dn_mtx contention, but in dnode_hold_impl(),
we might need to change the logic to make it similar to the DNODE_MUST_BE_FREE case,
where we get the read zrl, does sanity check for DN_SLOT_INTERIOR and DN_SLOT_ALLOCATED, then release read zrl, gets write zrl before checking dnh_dnode, create dnode if necessary and get dn_mtx. This will make other threads to block in dnode_slots_hold, rather than spin in the loop.

I also started to wonder the point of zrl, no other users are there except the dnode stuff,
and the fact that write zrl is not blocking so we need to loop is not great, shouldn't we just use rwlock here?

behlendorf · 2019-07-10T22:48:57Z

@lidongyang that scenario definitely looks like it would be possible. But it depends on there being a process which is holding the dn_mtx while being blocked on something else. Looking through the stack traces I don't see any suspect stacks. You mentioned the dmu_tx_wait() stacks, but they shouldn't be holding the dn_mtx while waiting. Did I miss it?

we might need to change the logic to make it similar to the DNODE_MUST_BE_FREE case,

One of the original concerns here is that we expect the common case to be taking a hold on an allocated dnode with a valid DN_SLOT_IS_PTR. For performance reasons we'd like to do this without needing to take a write lock. Though, since we're already serializing on the dn_mtx this might not really be an issue in practice.

I also started to wonder the point of zrl,

This is mostly historical as I understand it. It believe it should be possible to convert this to a rwlock, or an rrwlock, which I agree would be much simpler and avoid this potential spinning.

lidongyang · 2019-07-11T06:11:59Z

@behlendorf I mean dmu_tx_wait from the stack trace is probably also waiting on dn_mtx.

I think rwlock is better in this case, acquiring rrwlock itself need to get hold of rrl->rr_lock in rrw_enter_write/read()

curtispb · 2019-07-11T11:36:10Z

We hit this issue again and I was able to gather the dnodestats from the unhealthy host and one from a healthy host to compare.

Unhealthy host:

9 1 0x01 29 7888 221852720787 55905079801298
name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    67013868
dnode_hold_alloc_misses         4    26386
dnode_hold_alloc_interior       4    57977
dnode_hold_alloc_lock_retry     4    1038
dnode_hold_alloc_lock_misses    4    18
dnode_hold_alloc_type_none      4    8355
dnode_hold_free_hits            4    1113232
dnode_hold_free_misses          4    65031
dnode_hold_free_lock_misses     4    1
dnode_hold_free_lock_retry      4    944026670940
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    6103
dnode_hold_free_txg             4    0
dnode_free_interior_lock_retry  4    0
dnode_allocate                  4    1113232
dnode_reallocate                4    0
dnode_buf_evict                 4    356532
dnode_alloc_next_chunk          4    22842
dnode_alloc_race                4    0
dnode_alloc_next_block          4    14244
dnode_move_invalid              4    0
dnode_move_recheck1             4    0
dnode_move_recheck2             4    0
dnode_move_special              4    0
dnode_move_handle               4    0
dnode_move_rwlock               4    0
dnode_move_active               4    0

Healthy host:

9 1 0x01 29 7888 191118803056 90770643461520
name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    127263318
dnode_hold_alloc_misses         4    59711
dnode_hold_alloc_interior       4    149146
dnode_hold_alloc_lock_retry     4    131
dnode_hold_alloc_lock_misses    4    2
dnode_hold_alloc_type_none      4    68410
dnode_hold_free_hits            4    2772614
dnode_hold_free_misses          4    181243
dnode_hold_free_lock_misses     4    5
dnode_hold_free_lock_retry      4    154
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    15747
dnode_hold_free_txg             4    0
dnode_free_interior_lock_retry  4    0
dnode_allocate                  4    2772614
dnode_reallocate                4    0
dnode_buf_evict                 4    825513
dnode_alloc_next_chunk          4    47381
dnode_alloc_race                4    0
dnode_alloc_next_block          4    8933
dnode_move_invalid              4    0
dnode_move_recheck1             4    0
dnode_move_recheck2             4    0
dnode_move_special              4    0
dnode_move_handle               4    0
dnode_move_rwlock               4    0
dnode_move_active               4    0

We have since set the dnodesize=legacy on the MDSs where we have been seeing this issue the most often. For reference, it has looked promising so far as a mitigation but I will report back if we continue to see the issue.

mattaezell · 2019-07-11T12:37:50Z

Based on the above, it seems that we are stuck in this while loop:

   1477                 while (!dnode_slots_tryenter(dnc, idx, slots)) {
   1478                         DNODE_STAT_BUMP(dnode_hold_free_lock_retry);
   1479                         cond_resched();
   1480                 }

curtispb · 2019-07-11T15:48:41Z

It looks like we hit it again on one of our MDS hosts where I had set the dnodesize=legacy. The dnode stats looked like what I posted above.

One thought I had, is that dnodesize something that can be set on the fly or is it something that I would need to reimport to take effect? I did not do that last night when I set it.

behlendorf · 2019-07-11T17:48:28Z

Thanks for the dnode stats. @curtispb the change should have taken effect immediately, since it didn't help I'd suggest returned to the defaults. That jibe's with @lidongyang's stack an analysis from above #8994 (comment) , my hope was that it might reduce the frequency.

I mean dmu_tx_wait from the stack trace is probably also waiting on dn_mtx.

Oh, I see.

behlendorf · 2019-07-11T23:42:22Z

turns out zr_refcount is 1

@lidongyang thanks for the clue. I believe I see why this is the case, I've posted an analysis to the Lustre issue.

jasimmons1973 · 2019-07-12T01:16:12Z

Thank you Brian!!! BTW internally we got the latest lustre running with Neil's 5.2.0-rc2+ tree with ZFS 0.8.1. We plan to keep this up so we can profile issues like this.

External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994

jasimmons1973 · 2019-07-12T23:01:46Z

I'm installing zfs 0.8.1 patches with #9027 so I'm ready to test once Alex pushes a patch

behlendorf · 2019-07-13T00:11:18Z

@jasimmons1973 sounds good! We'll go ahead and get #9027 merged once we're sure it has everything Alex needs.

External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8994 Closes #9027

External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027

External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8994 Closes #9027

stale · 2020-08-24T19:31:47Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 5, 2019

behlendorf changed the title ~~kernel: INFO: task txg_quiesce:40218 blocked for more than 120 seconds.~~ dnode_hold_impl() deadlock with Lustre Jul 11, 2019

behlendorf mentioned this issue Jul 12, 2019

Export dnode symbols #9027

Merged

12 tasks

stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020

behlendorf closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnode_hold_impl() deadlock with Lustre #8994

dnode_hold_impl() deadlock with Lustre #8994

jasimmons1973 commented Jul 5, 2019

jasimmons1973 commented Jul 5, 2019

behlendorf commented Jul 5, 2019

mattaezell commented Jul 8, 2019

jasimmons1973 commented Jul 8, 2019

cbrumgard commented Jul 9, 2019

jasimmons1973 commented Jul 9, 2019

degremont commented Jul 9, 2019

cbrumgard commented Jul 9, 2019

jasimmons1973 commented Jul 9, 2019

cbrumgard commented Jul 9, 2019

behlendorf commented Jul 10, 2019

mattaezell commented Jul 10, 2019

lidongyang commented Jul 10, 2019

behlendorf commented Jul 10, 2019

lidongyang commented Jul 11, 2019

curtispb commented Jul 11, 2019

mattaezell commented Jul 11, 2019

curtispb commented Jul 11, 2019 •

edited

Loading

behlendorf commented Jul 11, 2019

behlendorf commented Jul 11, 2019

jasimmons1973 commented Jul 12, 2019

jasimmons1973 commented Jul 12, 2019

behlendorf commented Jul 13, 2019

stale bot commented Aug 24, 2020

dnode_hold_impl() deadlock with Lustre #8994

dnode_hold_impl() deadlock with Lustre #8994

Comments

jasimmons1973 commented Jul 5, 2019

jasimmons1973 commented Jul 5, 2019

behlendorf commented Jul 5, 2019

mattaezell commented Jul 8, 2019

jasimmons1973 commented Jul 8, 2019

cbrumgard commented Jul 9, 2019

jasimmons1973 commented Jul 9, 2019

degremont commented Jul 9, 2019

cbrumgard commented Jul 9, 2019

jasimmons1973 commented Jul 9, 2019

cbrumgard commented Jul 9, 2019

behlendorf commented Jul 10, 2019

mattaezell commented Jul 10, 2019

lidongyang commented Jul 10, 2019

behlendorf commented Jul 10, 2019

lidongyang commented Jul 11, 2019

curtispb commented Jul 11, 2019

mattaezell commented Jul 11, 2019

curtispb commented Jul 11, 2019 • edited Loading

behlendorf commented Jul 11, 2019

behlendorf commented Jul 11, 2019

jasimmons1973 commented Jul 12, 2019

jasimmons1973 commented Jul 12, 2019

behlendorf commented Jul 13, 2019

stale bot commented Aug 24, 2020

curtispb commented Jul 11, 2019 •

edited

Loading