Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ztest: segfault when hitting race in metaslab_enable #8602

Closed
shartse opened this issue Apr 9, 2019 · 7 comments · Fixed by #9751
Closed

ztest: segfault when hitting race in metaslab_enable #8602

shartse opened this issue Apr 9, 2019 · 7 comments · Fixed by #9751
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@shartse
Copy link
Contributor

shartse commented Apr 9, 2019

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 4.15.0-1035-aws
Architecture x86_64
ZFS Version 0.8.0-rc3
SPL Version

Describe the problem you're observing

After running ztest over the weekend, I hit a segfault:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  metaslab_enable (msp=msp@entry=0x55d7cab608d0, sync=sync@entry=B_TRUE) at ../../module/zfs/metaslab.c:4735
4735		spa_t *spa = mg->mg_vd->vdev_spa;
[Current thread is 1 (Thread 0x7f0bd638f700 (LWP 30904))]
(gdb) bt
#0  metaslab_enable (msp=msp@entry=0x55d7cab608d0, sync=sync@entry=B_TRUE) at ../../module/zfs/metaslab.c:4735
#1  0x00007f0bfe2f24b0 in vdev_initialize_thread (arg=0x55d7cacc0050) at ../../module/zfs/vdev_initialize.c:506
#2  0x00007f0bfdc116db in start_thread (arg=0x7f0bd638f700) at pthread_create.c:463
#3  0x00007f0bfd93a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I was able to verify that the segfault occurred because the metaslab's ms_group was NULL:

(gdb) print ((metaslab_t *)0x55d7cab608d0)->ms_group
$1 = (metaslab_group_t *) 0x0

This implies that we were trying to initialize a vdev that had an unpopulated metaslab. In other cases, we seen issues with races between initialize and removal, so I looked at the other stacks and saw a removal was in progress:

#4  0x00007f0bfe2e0073 in txg_wait_synced (dp=0x55d7cab14310, txg=txg@entry=395) at ../../module/zfs/txg.c:692
#5  0x00007f0bfe2d90af in spa_vdev_config_exit (spa=spa@entry=0x55d7caa681c0, vd=vd@entry=0x0, txg=395, error=error@entry=0, tag=tag@entry=0x7f0bfe4211a0 <__func__.18211> "spa_vdev_remove_log")
    at ../../module/zfs/spa_misc.c:1202
#6  0x00007f0bfe35d615 in spa_vdev_remove_log (vd=vd@entry=0x55d7cab7d000, txg=txg@entry=0x7f0bd6037db0) at ../../module/zfs/vdev_removal.c:1873
#7  0x00007f0bfe361220 in spa_vdev_remove (spa=0x55d7caa681c0, guid=4280776305795769058, unspare=<optimized out>) at ../../module/zfs/vdev_removal.c:2178
#8  0x000055d7c9cd34c9 in ztest_vdev_add_remove (zd=<optimized out>, id=<optimized out>) at ztest.c:3023

The pointer to the vdev being initialized is 0x55d7cacc0050 and the pointer to the one being removed is 0x55d7cab7d000. Looking at the vdev configuration from the pool, we can see that the one being initialized belongs to the second mirror (which is the one we're trying to remove):

0x55d7cac1d000  root
 0x55d7caabc610    mirror
  0x55d7caba5000      /var/tmp/zloop-run/ztest.0a
 0x55d7cab7d000    mirror
 0x55d7cacc0050      /var/tmp/zloop-run/ztest.1a
 0x55d7cab3e410      /var/tmp/zloop-run/ztest.1b

So it seems we're hitting a race where we initialize a vdev while we're removing it's parent.

@shartse shartse added the Component: Test Suite Indicates an issue with the test framework or a test case label Apr 9, 2019
@ahrens ahrens added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 9, 2019
@PrivatePuffin
Copy link
Contributor

This might still be the case with the new B-tree, but it seems somewhat obfusticated:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  zfs_btree_first (tree=tree@entry=0x0, where=where@entry=0x7fd516230c90) at ../../module/zfs/btree.c:1074
1074	../../module/zfs/btree.c: No such file or directory.
[Current thread is 1 (Thread 0x7fd516232700 (LWP 19931))]
*
* Backtrace 
*
#0  zfs_btree_first (tree=tree@entry=0x0, where=where@entry=0x7fd516230c90) at ../../module/zfs/btree.c:1074
#1  0x00007fd55e217b66 in range_tree_walk (rt=0x0, func=func@entry=0x7fd55e217530 <range_tree_remove>, arg=0x7fd5240a3d10) at ../../module/zfs/range_tree.c:696
#2  0x00007fd55e20a427 in metaslab_load_impl (msp=msp@entry=0x7fd524020310) at ../../module/zfs/metaslab.c:2334
#3  0x00007fd55e20aab5 in metaslab_load (msp=msp@entry=0x7fd524020310) at ../../module/zfs/metaslab.c:2483
#4  0x00007fd55e258c78 in vdev_initialize_calculate_progress (vd=vd@entry=0x1eb9650) at ../../module/zfs/vdev_initialize.c:369
#5  0x00007fd55e25971e in vdev_initialize_thread (arg=0x1eb9650) at ../../module/zfs/vdev_initialize.c:504
#6  0x00007fd55dbe16ba in start_thread (arg=0x7fd516232700) at pthread_create.c:333
#7  0x00007fd55d91741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
*

@ikozhukhov
Copy link
Contributor

i see on dilos time to time similar issue too:

12/13 12:10:51 ztest -VVVVV -m 2 -r 0 -R 1 -v 2 -a 9 -T 79 -P 25 -s 128m -f /var/tmp
*** ztest crash found - moving logs to /var/tmp/zloop/zloop-191213-121255-305393552
*** core @ /var/tmp/zloop/zloop-191213-121255-305393552/core:
debugging core file of ztest (64-bit) from zt3-zloop
file: /usr/bin/ztest
initial argv: /usr/bin/ztest
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=8
libzpool.so.1`zfs_btree_first+0xa(0, fffffc7fb2f51ce0)
libzpool.so.1`range_tree_walk+0x2e(0, fffffc7fe94f4e30, d0eb6c0)
libzpool.so.1`metaslab_load_impl+0x184(f3fbac0)
libzpool.so.1`metaslab_load+0xa9(f3fbac0)
libzpool.so.1`vdev_initialize_calculate_progress+0x14f(d08d000)
libzpool.so.1`vdev_initialize_thread+0x165(d08d000)
libc.so.1`_thrp_setup+0x6c(fffffc7feddeea40)
libc.so.1`_lwp_start()

continuing...
12/13 12:13:18 ztest -VVVVV -m 2 -r 0 -R 1 -v 2 -a 12 -T 71 -P 15 -s 128m -f /var/tmp

behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 19, 2019
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

TEST_ZTEST_TIMEOUT=7200
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8602
behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 19, 2019
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

TEST_ZTEST_TIMEOUT=7200
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8602
behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 19, 2019
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

TEST_ZTEST_TIMEOUT=7200
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#8602
behlendorf added a commit that referenced this issue Dec 26, 2019
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8602
Closes #9751
allanjude pushed a commit to allanjude/zfs that referenced this issue Dec 28, 2019
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8602
Closes openzfs#9751
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jan 2, 2020
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8602
Closes openzfs#9751
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jan 7, 2020
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8602
Closes openzfs#9751
tonyhutter pushed a commit that referenced this issue Jan 23, 2020
Any running 'zpool initialize' or TRIM must be cancelled prior
to the vdev_metaslab_fini() call in spa_vdev_remove_log() which
will unload the metaslabs and set ms->ms_group == NULL.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8602
Closes #9751
@behlendorf
Copy link
Contributor

Reopening, this issue has not been entirely resolved. and has been observed in the latest code from June.

@behlendorf behlendorf reopened this Jul 6, 2020
@stale
Copy link

stale bot commented Jul 6, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jul 6, 2021
@PrivatePuffin
Copy link
Contributor

Lets not stale this one...

@stale stale bot removed the Status: Stale No recent activity for issue label Jul 6, 2021
@stale
Copy link

stale bot commented Jul 7, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jul 7, 2022
@behlendorf
Copy link
Contributor

Closing. This was resolved by 793c958.

@behlendorf behlendorf removed the Status: Stale No recent activity for issue label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Test Suite Indicates an issue with the test framework or a test case Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants