Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btree: Reduce opportunities for branch mispredictions in binary search #14866

Merged
merged 1 commit into from
May 26, 2023

Conversation

ryao
Copy link
Contributor

@ryao ryao commented May 13, 2023

Motivation and Context

A conversation in IRC inspired me to read about different strategies for searching within the leaves of a b-tree. We currently do a binary search. However, linear search is a popular option (used by Rust for example), since it can be faster than binary search on small arrays due to cache effects. Various sources online suggest that the size of those arrays is in the range of 50 to 150 elements due to cache effects. Our b-tree leaves typically store 170 or more elements, so I assume linear search would be worse than binary search. However, while reading about this, I learned about a "branchless" binary search algorithm published by Knuth called Shar's algorithm (not to be confused with Shor's algorithm), which people find to be in the range of 2 to 3 times faster than regular binary search.

Description

This implements shar's algorithm for binary search with comparator inlining. Consumers must opt into using the faster algorithm. At present, only B-Trees used inside kernel code have been modified to use the faster algorithm.

How Has This Been Tested?

I wrote a small test program to convince myself that the proposed implementation is correct by comparing the output of the new implementation against the existing implementation by searching for every possible location relative to a 1024 value array:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

#define	BTREE_CORE_ELEMS	126
#define	BTREE_LEAF_SIZE		4096

typedef int boolean_t;

#define B_TRUE 1
#define B_FALSE 0

typedef struct zfs_btree_hdr {
	struct zfs_btree_core	*bth_parent;
	/*
	 * Set to -1 to indicate core nodes. Other values represent first
	 * valid element offset for leaf nodes.
	 */
	uint32_t		bth_first;
	/*
	 * For both leaf and core nodes, represents the number of elements in
	 * the node. For core nodes, they will have bth_count + 1 children.
	 */
	uint32_t		bth_count;
} zfs_btree_hdr_t;

typedef struct zfs_btree_core {
	zfs_btree_hdr_t	btc_hdr;
	zfs_btree_hdr_t	*btc_children[BTREE_CORE_ELEMS + 1];
	uint8_t		btc_elems[];
} zfs_btree_core_t;

typedef struct zfs_btree_leaf {
	zfs_btree_hdr_t	btl_hdr;
	uint8_t		btl_elems[];
} zfs_btree_leaf_t;

typedef struct zfs_btree_index {
	zfs_btree_hdr_t	*bti_node;
	uint32_t	bti_offset;
	/*
	 * True if the location is before the list offset, false if it's at
	 * the listed offset.
	 */
	boolean_t	bti_before;
} zfs_btree_index_t;

typedef struct btree {
	int (*bt_compar) (const void *, const void *);
	size_t			bt_elem_size;
	size_t			bt_leaf_size;
	uint32_t		bt_leaf_cap;
	int32_t			bt_height;
	uint64_t		bt_num_elems;
	uint64_t		bt_num_nodes;
	zfs_btree_hdr_t		*bt_root;
	zfs_btree_leaf_t	*bt_bulk; // non-null if bulk loading
} zfs_btree_t;

#define TREE_CMP(a, b) (((a) > (b)) - ((a) < (b)))

int comparator (const void *ap, const void *bp) {
	int a = *(int*)ap;
	int b = *(int*)bp;

	return (TREE_CMP(a, b));
}

/*
 * Find value in the array of elements provided. Uses a simple binary search.
 */
static void *
zfs_btree_find_in_buf(zfs_btree_t *tree, uint8_t *buf, uint32_t nelems,
    const void *value, zfs_btree_index_t *where)
{
	uint32_t max = nelems;
	uint32_t min = 0;
	while (max > min) {
		uint32_t idx = (min + max) / 2;
		uint8_t *cur = buf + idx * tree->bt_elem_size;
		int comp = tree->bt_compar(cur, value);
		if (comp < 0) {
			min = idx + 1;
		} else if (comp > 0) {
			max = idx;
		} else {
			where->bti_offset = idx;
			where->bti_before = B_FALSE;
			return (cur);
		}
	}

	where->bti_offset = max;
	where->bti_before = B_TRUE;
	return (NULL);
}

#define	ZFS_BTREE_FIND_IN_BUF_FUNC(NAME, T, COMP)			\
_Pragma("GCC diagnostic push")						\
_Pragma("GCC diagnostic ignored \"-Wunknown-pragmas\"")			\
static void *								\
NAME(zfs_btree_t *tree, uint8_t *buf, uint32_t nelems,			\
    const void *value, zfs_btree_index_t *where)			\
{									\
	T *i = (T *)buf;						\
	(void) tree;							\
	_Pragma("GCC unroll 10")					\
	while (nelems > 1) {						\
		uint32_t half = nelems / 2;				\
		nelems -= half;						\
		i += (COMP(&i[half - 1], value) < 0) * half;		\
	}								\
									\
	int comp = COMP(i, value);					\
	where->bti_offset = (size_t)(i - (T *)buf) + (comp < 0);	\
	where->bti_before = (comp != 0);				\
									\
	if (comp == 0) {						\
		return (i);						\
	}								\
									\
	return (NULL);							\
}									\
_Pragma("GCC diagnostic pop")
ZFS_BTREE_FIND_IN_BUF_FUNC(zfs_btree_find_in_buf_new, uint32_t,  tree->bt_compar);

int main (void)
{
	union {
		uint8_t buf[BTREE_LEAF_SIZE];
		int arr[BTREE_LEAF_SIZE / sizeof (int)];
	} u;
	zfs_btree_t tree = {
		.bt_elem_size = sizeof (u.arr[0]),
		.bt_compar = comparator,
	};

	int size = 1024;

	for (int i = 0; i < size; i++)
		u.arr[i] = 2*i+1;

	for (int i = 0; i <= size*2+1; i++)
	{
		zfs_btree_index_t a, b;
		uint8_t *x, *y;
		x = zfs_btree_find_in_buf(&tree, &u.buf, size, &i, &a);
		y = zfs_btree_find_in_buf_new(&tree, &u.buf, size, &i, &b);
		if (a.bti_offset != b.bti_offset) {
			printf("Offsets do not match\n");
			return (1);
		}

		if (a.bti_before != b.bti_before) {
			printf("before does not match\n");
			return (1);
		}

		if (x != y) {
			printf("Return pointers do not match\n");
			return (1);
		}
	}

	printf("The two match\n");

	return (0);
}

On my machine, this reports that the two function's output matches and changing the array size from 1024 to 768 does not change that. This is consistent with the theory that the new function does not change what is returned from this function.

Some micro-benchmarks that I did on uncached arrays sized to match our B-Tree leaves suggest that this can improve binary search performance by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when compiling with GCC 12.2:

#14866 (comment)

Note that this has been updated to reflect the current state of the PR. An earlier version talked about an alternative binary search implementation, that micro-benchmarks later revealed to be slower than the current code.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ryao ryao force-pushed the btree branch 3 times, most recently from a00a492 to 9d203da Compare May 13, 2023 20:04
@KungFuJesus
Copy link

KungFuJesus commented May 13, 2023

Hmm this smells like it might be a compiler bug. The most recent one I found in gcc 11+ I managed to find by turning off O2 specific optimizations one by one.

For me the bug was the store merging optimization. This has been fixed in all newer minor versions of the affected gccs.

@ryao
Copy link
Contributor Author

ryao commented May 13, 2023

Hmm this smells like it might be a compiler bug. The most recent one I found in gcc 11+ I managed to find by turning off O2 specific optimizations one by one.

For me the bug was the store merging optimization. This has been fixed in all newer minor versions of the affected gccs.

It smells like a compiler bug to me too. Interesting, compiling with gcc -O2 -fsanitize=undefined will trigger it too, while gcc -O2 does not have a problem. This does not bode well for the buildbot's zloop, which should be run a ztest compiled with those options. Even if this is right, I will need to figure out a workaround for the compiler bug before it can be merged.

@ryao ryao marked this pull request as draft May 13, 2023 20:14
@ryao
Copy link
Contributor Author

ryao commented May 13, 2023

=================================================================
==8243==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fffa953af40 at pc 0x5617ca7a280b bp 0x7fffa9539d90 sp 0x7fffa9539d80
READ of size 4 at 0x7fffa953af40 thread T0
    #0 0x5617ca7a280a in comparator (/tmp/a.out+0x180a)
    #1 0x5617ca7a29fe in zfs_btree_find_in_buf_new.constprop.0 (/tmp/a.out+0x19fe)
    #2 0x5617ca7a2489 in main (/tmp/a.out+0x1489)
    #3 0x7f6d7eb828c9 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #4 0x7f6d7eb82984 in __libc_start_main_impl ../csu/libc-start.c:360
    #5 0x5617ca7a26a0 in _start (/tmp/a.out+0x16a0)

Address 0x7fffa953af40 is located in stack of thread T0 at offset 4320 in frame
    #0 0x5617ca7a213f in main (/tmp/a.out+0x113f)

  This frame has 5 object(s):
    [48, 52) 'i' (line 156)
    [64, 80) 'a' (line 158)
    [96, 112) 'b' (line 158)
    [128, 192) 'tree' (line 146)
    [224, 4320) 'u' (line 145) <== Memory access at offset 4320 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/tmp/a.out+0x180a) in comparator
Shadow bytes around the buggy address:
  0x10007529f590: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007529f5a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007529f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007529f5c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007529f5d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x10007529f5e0: 00 00 00 00 00 00 00 00[f3]f3 f3 f3 f3 f3 f3 f3
  0x10007529f5f0: f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
  0x10007529f600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007529f610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007529f620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007529f630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==8243==ABORTING

ASAN caught the problem. I will need to rework this.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting game. I looked on assembler generated by Clang and it indeed has only one branching per loop iteration (i) plus two more global.

Extra comparisons needed remained me that we've had pretty bad comparison function in scrub, using two divisions per comparison. I've fixed that one, but I wonder if we may have more of those, don't remember if I checked.

I guess without branching CPU should not be able to predict and so prefetch at all? I guess it may indeed reduce cache pollution in case of mis-prediction, but I guess in other (lucky) case we could get data some earlier? Is the mis-prediction penalty overweight that by much?

@ryao
Copy link
Contributor Author

ryao commented May 13, 2023

@KungFuJesus @amotin I have identified and fixed the problem. I just repushed. I am going to edit the original PR message to reflect the changes.

@ryao ryao marked this pull request as ready for review May 13, 2023 20:34
@ryao
Copy link
Contributor Author

ryao commented May 13, 2023

Interesting game. I looked on assembler generated by Clang and it indeed has only one branching per loop iteration (i) plus two more global.

I built my test program with clang -O2 -fno-inline ....

Clang's inner loop for the original:

.LBB2_2:                                #   in Loop: Header=BB2_1 Depth=1
        incl    %ebp
        movl    %ebp, %ebx
        cmpl    %ebx, %r14d
        jbe     .LBB2_6
.LBB2_1:                                # =>This Inner Loop Header: Depth=1
        leal    (%r14,%rbx), %ebp
        shrl    %ebp
        movq    8(%r12), %r13
        imulq   %rbp, %r13
        addq    %r15, %r13
        movq    %r13, %rdi
        movq    16(%rsp), %rsi                  # 8-byte Reload
        callq   *(%r12)
        testl   %eax, %eax
        js      .LBB2_2
# %bb.3:                                #   in Loop: Header=BB2_1 Depth=1
        je      .LBB2_8
# %bb.4:                                #   in Loop: Header=BB2_1 Depth=1
        movl    %ebp, %r14d
        cmpl    %ebx, %r14d
        ja      .LBB2_1

Clang's inner loop for my version in the test program:

.LBB3_1:                                # =>This Inner Loop Header: Depth=1
        leal    (%r14,%rbx), %ecx
        shrl    %ecx
        testl   %eax, %eax
        cmovnel %r12d, %ebp
        addl    %ecx, %ebp
        movq    16(%rsp), %rax                  # 8-byte Reload
        movq    8(%rax), %r13
        imulq   %rbp, %r13
        addq    24(%rsp), %r13                  # 8-byte Folded Reload
        movq    %r13, %rdi
        movq    32(%rsp), %rsi                  # 8-byte Reload
        callq   *(%rax)
        xorl    %ecx, %ecx
        cmpl    $1023, %r14d                    # imm = 0x3FF
        setne   %cl
        addl    %ebp, %ecx
        movl    %eax, %edx
        sarl    $31, %edx
        andl    %edx, %ecx
        testl   %eax, %eax
        cmovlel %r12d, %r14d
        movl    $0, %esi
        cmovgl  %ebp, %esi
        addl    %ecx, %r14d
        andl    %ebx, %edx
        movl    %edx, %ebx
        addl    %esi, %ebx
        incl    %r15d
        jne     .LBB3_1

Clang was too clever here. It noticed that the test program will always call the function with the same size array, so it generated code based on that, which is why we see the constant 1023. Interestingly, it becomes even more clever at -O3, where it will unroll the loop using this information. Unfortunately, this kind of cleverness keeps us from seeing what the assembly would be inside the kernel module, so I insert the following inline assembly right after the size variable is declared so that clang has no idea what the value is:

asm ("movl %1, %0;"
    :"=r"(size)
    :"r"(1024)
    :);

Then we get this:

.LBB3_2:                                # =>This Inner Loop Header: Depth=1
        leal    (%r12,%r15), %ecx
        shrl    %ecx
        testl   %eax, %eax
        cmovnel %r13d, %r14d
        addl    %ecx, %r14d
        movq    16(%rsp), %rax                  # 8-byte Reload
        movq    8(%rax), %rbx
        imulq   %r14, %rbx
        addq    32(%rsp), %rbx                  # 8-byte Folded Reload
        movq    %rbx, %rdi
        movq    24(%rsp), %rsi                  # 8-byte Reload
        callq   *(%rax)
        xorl    %ecx, %ecx
        cmpl    4(%rsp), %r12d                  # 4-byte Folded Reload
        setne   %cl
        addl    %r14d, %ecx
        movl    %eax, %edx
        sarl    $31, %edx
        andl    %edx, %ecx
        testl   %eax, %eax
        cmovlel %r13d, %r12d
        movl    $0, %esi
        cmovgl  %r14d, %esi
        addl    %ecx, %r12d
        andl    %r15d, %edx
        movl    %edx, %r15d
        addl    %esi, %r15d
        decl    %ebp
        jne     .LBB3_2

I count 4 branches for the original version and 1 for my version as far as the loop is concerned. I am not that concerned about the check if we had a match in my version, although I imagine it is an opportunity for misprediction too. It is possible to remove it via the trick I used to remove the other branches, but that would make this code incompatible with cheribsd, so I left it alone.

Perhaps if we mark the callback with the leaf function attribute we can avoid register spilling by keeping things in registers that leaf functions typically do not touch to make this more succinct.

I guess without branching CPU should not be able to predict and so prefetch at all? I guess it may indeed reduce cache pollution in case of mis-prediction, but I guess in other (lucky) case we could get data some earlier? Is the mis-prediction penalty overweight that by much?

The CPU should be doing prediction either way. However, with the current version, it has to guess whether the callback returns < 0, == 0 or >0. There is a greater than 50% chance for it to guess wrong. With the proposed version, the CPU does not need to guess, so its prediction should be right every time.

As for your other questions, I was surprised a branchless version has been found to be better by others, so it would seem that the misprediction penalties really are that high. Mispredictions would cause us to get data later because we would only start the fetch for the correct data after the fetch that was started has been found to be incorrect, which might explain it. Not mispredicting would avoid delays in getting data rather than make the data arrive earlier.

That said, the performance of this version still needs to be evaluated. If it is indistinguishable from the current version in benchmarks / profiling, I would consider this to be better for the reduced cache pollution. However, I am hopeful that this will be at least a slight improvement over the current version.

@ryao
Copy link
Contributor Author

ryao commented May 13, 2023

The buildbot is indicating that there are problems. I do not have time to examine those now, so I am going to withdraw this until I have time to revisit it.

@ryao ryao closed this May 13, 2023
@ryao
Copy link
Contributor Author

ryao commented May 14, 2023

An idea for fixing this occurred to me, so I am reopening it to see what the buildbot says.

@ryao ryao reopened this May 14, 2023
@ryao ryao force-pushed the btree branch 3 times, most recently from 3363c10 to feeb3f2 Compare May 14, 2023 15:22
@ryao
Copy link
Contributor Author

ryao commented May 14, 2023

It looks like the latest change to ensure output is the same when looping hits a min >= max condition worked. I can see a few tweaks that I could do to this, but the current version should be good enough to have its performance evaluated under a btree heavy workload should anyone be willing to volunteer.

@ryao
Copy link
Contributor Author

ryao commented May 15, 2023

This site has a good description of what causes binary search to be slow:

https://en.algorithmica.org/hpc/data-structures/binary-search/

Our current binary search that attempts to do early exit from the loop adds a branch, which is probably slower than a non-early exit version. Doing additional branching to understand the result of the comparator is probably suboptimal too. My version “fixed” those things, but it uses excessive instructions and misses the improvement others have found to make things even faster, which is that the branch doing comparison in the comparator can be replaced with predication on amd64 to make the loop body truly branchless.

The cited performance boost is big enough that I am thinking of refactoring the code to support search functions that in-line the comparator just so that we can get a boost from predication in cases where it is possible.

@ryao
Copy link
Contributor Author

ryao commented May 15, 2023

Here is an experimental binary search function for use in my test program that assumes that the b-tree leaf is full of 32-bit keys. A realistic implementation would have potentially 4 versions of this function, based on whether the comparison is u32, s32, u64 or s64 and it would read a value from zfs_btree_t that indicates what the offset inside the array object is in addition to ->bt_elem_size. However, the assembly output illustrates roughly what we can get from using predication:

static void *
zfs_btree_find_in_buf_new(zfs_btree_t *tree, uint8_t *buf, uint32_t nelems,
    const void *value, zfs_btree_index_t *where)
{
        uint32_t x = *(int *)value;
        uint32_t *i = (uint32_t *)buf;
        while (nelems > 1) {
                uint32_t half = nelems / 2;
                nelems -= half;
                i += (i[half - 1] < x) * half;
        }

        where->bti_offset = ((unsigned char *)i - buf) / 4;
        int comp = *i - x;
        where->bti_offset += (comp < 0);

        if (comp == 0) {
                where->bti_before = B_FALSE;
                return (i);
        }

        where->bti_before = B_TRUE;
        return (NULL);
}

Here is what clang -O2 -fno-inline ... outputs:

zfs_btree_find_in_buf_new:              # @zfs_btree_find_in_buf_new
        .cfi_startproc
# %bb.0:
        movl    (%rdx), %edx
        movq    %rdi, %r8
        cmpl    $2, %esi
        jb      .LBB3_3
# %bb.1:
        xorl    %eax, %eax
        movq    %rdi, %r8
        .p2align        4, 0x90
.LBB3_2:                                # =>This Inner Loop Header: Depth=1
        movl    %esi, %r9d
        shrl    %r9d
        subl    %r9d, %esi
        leal    -1(%r9), %r10d
        cmpl    %edx, (%r8,%r10,4)
        cmovael %eax, %r9d
        leaq    (%r8,%r9,4), %r8
        cmpl    $1, %esi
        ja      .LBB3_2
.LBB3_3:
        movq    %r8, %rax
        subq    %rdi, %rax
        leaq    3(%rax), %rsi
        testq   %rax, %rax
        cmovnsq %rax, %rsi
        shrq    $2, %rsi
        movl    %esi, 8(%rcx)
        movl    (%r8), %edi
        xorl    %eax, %eax
        xorl    %r9d, %r9d
        subl    %edx, %edi
        setne   %r9b
        cmoveq  %r8, %rax
        shrl    $31, %edi
        addl    %esi, %edi
        movl    %edi, 8(%rcx)
        movl    %r9d, 12(%rcx)
        retq

That is precisely two branches. One to jump over the loop in the unlikely event that we are given an array size of 1 and another to actually iterate the loop. The inner loop is fairly tight at just 9 instructions.

In hindsight, it was somehow premature to post this PR, but posting it gave me feedback from the buildbot on the first pass that convinced myself that a branchless binary search is doable in a compatible way.

@ryao ryao marked this pull request as draft May 15, 2023 03:00
@ryao
Copy link
Contributor Author

ryao commented May 15, 2023

I am abusing the issue to keep links to references and notes of my thoughts over time as I think about this, so my apologies to readers for the verbosity.

This site has a good description of what causes binary search to be slow:

https://en.algorithmica.org/hpc/data-structures/binary-search/

Another good resource is this:

https://github.com/scandum/binary_search

The algorithmica.org lower_bound() is basically a pointer optimized version of the monobound binary search listed there. The data presented shows the monobound binary search to be the fastest binary search for small arrays.

Looking at the code makes it clear why. We have a predictable loop condition without any other branching, so a modern superscalar processor should never mispredict. There is no data dependency between the next iteration’s index calculation and the current iteration’s array access, so the next array access can begin immediately after the current loop has finished executing. There are also few instructions, such that another thread sharing the CPU core will be faster just by not having much contention for execution resources.

I was surprised to see the compiler use predication to avoid a branch after the loop.

I considered whether it is possible to do even faster than the pointer optimized monobound binary search and had three ideas:

  1. Due to the size of our leaves, faster searching could be achieved via a B-Tree in B-Tree leaf strategy where we store the nodes in an array. This would give us the ability to combine linear search’s fast performance on small arrays with the reduced lookups from a branching factor bigger than 2. However, under the constraint of no additional memory, just reasoning about performance when the branching factor is 2 shows that we would be getting O(nlog(n)) insertion and deletion times, which due to n being bounded is technically O(1), but it is a bigger than our current O(n) time that is technically also O(1). There is also a branch misprediction issue from the tree not being complete unless predication is used to fix it. Unless we have an extreme bias toward search operations, I doubt this would be worthwhile.

  2. A hybrid search that does a binary search with early termination and a linear search to finish. This might work, although there is very little literature on the idea and what literature does exist did not compare it to the monobound binary search. The linear search would suffer from a branch misprediction when an item is found unless it is made to use predication so that it can continue iterating until the last element even after finding the element we want. This idea is also what makes the monobound version avoid branch misprediction (in theory).

  3. The mono-bound approach has the trade-off that we are technically polluting the cache whenever we have found the solution before the last iteration, but the cache pollution is bounded. On misses, we have no opportunities for avoidable cache pollution and on hits, there should be no cache pollution half the time (when the hit is in the last iteration). We can use predication to avoid changing the pointer when the solution has been found early. This would impose a penalty of a few cycles to each memory access, but it would cause us to read from L1 cache whenever a match is found early, making the loop exit faster nearly half the time on matches. This would only be beneficial when the hit rate is over a certain threshold. The benefits from avoiding cache pollution would lower that threshold depending on how likely the evicted cache lines would have been a cache hit and how big the penalty the eviction was. The threshold would likely only be lowered by a small amount due to L2 cache, so I suspect we can disregard that effect.

The first is likely impractical while the other two might be tiny improvements.

This blog entry comparing binary search to linear search references an Intel tool that will output information on how Intel processors execute code:

https://dirtyhandscoding.wordpress.com/2017/08/25/performance-comparison-linear-search-vs-binary-search/

It might be useful for future analysis. Intel has since discontinued the tool in favor of llvm-mca.

@ryao
Copy link
Contributor Author

ryao commented May 17, 2023

A prototype using the ideas from 3 days ago has been pushed for feedback from the buildbot.

@ryao ryao force-pushed the btree branch 4 times, most recently from cf3489d to 630be9b Compare May 17, 2023 21:02
@ryao
Copy link
Contributor Author

ryao commented May 20, 2023

Extra iterations in case of early match still make me shiver, but logically I can understand that match probability should likely be low.

That bothered me too, but benchmarking my previous idea for avoiding that showed this is better.

I just think I would add another argument to zfs_btree_create_custom() rather then patched structure after creation. In the proposed version of metaslab_rt_create() size and compare variables do not make much sense.

I will do that as cleanup.

@ryao
Copy link
Contributor Author

ryao commented May 25, 2023

I have added some documentation, squashed the commits, rebased on master and updated the top comment to reflect the current status of this PR. Excluding less than a half dozen lines of documentation, the code changes are identical to what @amotin approved.

@behlendorf This is ready to merge. It should make our in-kernel B-Tree searches faster. :)

@ryao
Copy link
Contributor Author

ryao commented May 25, 2023

@amotin It turns out that I missed a B-Tree in module/zfs/zap_micro.c. I just repushed to include that.

Note that micro-benchmarks suggest that there is only a small improvement (e.g. 1.4x) on the smaller leaf size used in that B-Tree, and I only saw it with Clang. With GCC, the two performed close enough that they might as well have performed the same. There are some limitations to the way that the micro-benchmark was designed (such as using int elements instead of actual elements, using a simple TREE_CMP() for comparisons and assuming that the entire element is the key). Having looked at the assembly, I do not expect a more accurate micro-benchmark to change my conclusion that the new search function should be faster (or at the very least, not worse in the case of this particular B-Tree), so instead of spending time to make the micro-benchmark more accurate to get more certainty, I decided to not think much more deeply about this and opted that B-Tree into the new way of doing searches too.

I should add that the two different ways of doing binary search are similar to different ways of sorting in that there is no always fastest algorithm. In sorting, quick sort is faster in general, but in the worst case, it can be slower. In unrealistic situations where the binary search becomes predictable (e.g. we search for 0 and all keys are bigger than 0), the new code is slower (from the current code becoming faster from accurate branch prediction reducing the total number of instructions executed). However, as you increase the unpredictability of binary search to a more realistic level, you will see the new code become faster. If we were to modify the micro-benchmark to be even less predictable, then we would see an even bigger improvement from the newer code, which makes me feel that analyzing it in more depth is at the point of diminishing returns. I feel that any further performance analysis should be done through macro-benchmarks of actual workloads.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It is not too invasive. The only difference of zap_micro I can think about is a smaller leaf size -- only 512 bytes (up to 62 entries), that reduces number of iterations and cache misses (last 3 iterations out of 5-6 likely hit the same cache line). In my commit message 9dcdee7 I described how I benchmarked it, but I mostly optimized for memory moves, don't remember now if the search time was significant there.

include/sys/btree.h Outdated Show resolved Hide resolved
@ryao
Copy link
Contributor Author

ryao commented May 25, 2023

Looks good to me. It is not too invasive. The only difference of zap_micro I can think about is a smaller leaf size -- only 512 bytes (up to 62 entries), that reduces number of iterations and cache misses (last 3 iterations out of 5-6 likely hit the same cache line). In my commit message 9dcdee7 I described how I benchmarked it, but I mostly optimized for memory moves, don't remember now if the search time was significant there.

I just tried unrolling the loop via a pragma. Surprisingly, this is faster:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685033425, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|                              current_binary_search |       1024 |        998 |       9002 |   0.000695 |
|           current_binary_search_inlined_comparator |       1024 |        998 |       9002 |   0.000581 |
|                               custom_binary_search |       1024 |        998 |       9002 |   0.000261 |
|                     custom_binary_search_no_inline |       1024 |        998 |       9002 |   0.000461 |
|                        custom_binary_search_unroll |       1024 |        998 |       9002 |   0.000196 |


Uneven distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|                              current_binary_search |       1024 |        998 |       9002 |   0.000696 |
|           current_binary_search_inlined_comparator |       1024 |        998 |       9002 |   0.000555 |
|                               custom_binary_search |       1024 |        998 |       9002 |   0.000261 |
|                     custom_binary_search_no_inline |       1024 |        998 |       9002 |   0.000460 |
|                        custom_binary_search_unroll |       1024 |        998 |       9002 |   0.000196 |

Both GCC and Clang support #pragma GCC unroll 9 (which is the highest unroll value that makes sense in the current codebase). I will push another revision to incorporate this after I have finished some local checks.

@ryao ryao force-pushed the btree branch 4 times, most recently from 4bc7921 to d4cde97 Compare May 25, 2023 19:23
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label May 25, 2023
@ryao
Copy link
Contributor Author

ryao commented May 25, 2023

An obsolete comment has been fixed, unrolling has been added through a C99 pragma, an unknown pragma warning has been suppressed on older compilers through C99 pragmas and a cstyle complaint about the C99 pragmas is fixed. Hopefully, this will be the last time I need to touch this before it is merged.

@ryao
Copy link
Contributor Author

ryao commented May 25, 2023

I spoke too soon. The most recent push makes minor changes to two comments to make them more accurate. There are no code changes in this push.

Copy link
Contributor Author

@ryao ryao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message should say “inline” not “inlines”.

There are also some other trivial issues for which I have added comments. I will correct these in the next push.

module/Makefile.bsd Show resolved Hide resolved
module/Kbuild.in Outdated Show resolved Hide resolved
include/sys/btree.h Show resolved Hide resolved
This implements a binary search algorithm for B-Trees that reduces
branching to the absolute minimum necessary for a binary search
algorithm. It also enables the compiler to inline the comparator to
ensure that the only slowdown when doing binary search is from waiting
for memory accesses. Additionally, it instructs the compiler to unroll
the loop, which gives an additional 40% improve with Clang and 8%
improvement with GCC.

Consumers must opt into using the faster algorithm. At present, only
B-Trees used inside kernel code have been modified to use the faster
algorithm.

Micro-benchmarks suggest that this can improve binary search performance
by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
compiling with GCC 12.2.

Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
@ryao
Copy link
Contributor Author

ryao commented May 25, 2023

I have fixed the last minute nits that I found, rebased on master and repushed.

@ryao
Copy link
Contributor Author

ryao commented May 26, 2023

@behlendorf This comment is just to record some further R&D that I did that yielded interesting results, but will not result in any further changes to this PR (unless others really want to leverage it). This PR will not see any more revisions unless others give feedback that prompts it. Feel free is disregard the following.

I was able to adapt Bentley's version of Shar's algorithm to work for variable sized arrays and the result is slightly better than the existing code in this PR:

https://muscar.eu/shar-binary-search-meta.html

Inspired by Duff's device, I unrolled the loop into a switch statement by calculating the bit_floor() using highbit(). I also peeled the first iteration of the loop out of the switch statement so that I could modify the first comparison to resize the array into a power of 2 size. The code generation for the D version by Alex Muscar uses only 3 instructions per unrolled loop iteration, which is amazing compared to our current code, so I restructured the C to resemble the D code. This caused GCC 12.2 to begin emitting the 3 instruction sequence (although this regressed to 4 instructions with GCC 13.1), while LLVM emits a 4 instruction sequence.

Unfortunately, generalizing this requires implementing a second comparator function that has the semantics that it returns 1 when the comparison is less than or equal to and 0 otherwise. Otherwise, we lose the nice 3 or 4 instructions per unrolled iteration output. This would be an ugly hack around the limitations of present optimizing compiler technology.

I also noticed that GCC inserted a branch at the end. GCC does not have Clang's __builtin_unpredictable(), which coincidentally does not actually do anything when used with Clang due to a Clang bug, so I abused __builtin_expect_with_probability() to get GCC to use predication for that branch. Also, to save a few instructions, I disregarded the 0 case and inlined highbit into the function.

That yielded the following implementation, which is implemented according to the semantics of libc's bsearch() due to the micro-benchmark being designed to use it (and refactoring not making sense as the results should be similar either way):

#define TREE_CMP(a, b) (((a) > (b)) - ((a) < (b))) 
__attribute__((always_inline)) inline
int inline_comparator (const void *ap, const void *bp) {
        int a = *(int*)ap;
        int b = *(int*)bp;
    
        return (TREE_CMP(a, b));
}

__attribute__((always_inline)) inline
int le_comparator (const void *ap, const void *bp) {
        int a = *(int*)ap;
        int b = *(int*)bp;

        return (a <= b);
}

int custom_binary_search_unroll_switch_v3(int *array, unsigned int array_size, int key)
{
    int *base;

    unsigned int u = sizeof (array_size) * NBBY - __builtin_clz(2U*array_size-1U);
    unsigned int p = 1U << (u - 2U);
    unsigned int i = le_comparator(&array[p], &key) * (array_size - p);

    switch (u - 1) {
    case 11:
        if (le_comparator(&array[i + 512], &key))
                i += 512;
    case 10:
        if (le_comparator(&array[i + 256], &key))
                i += 256;
    case 9:
        if (le_comparator(&array[i + 128], &key))
                i += 128;
    case 8:
        if (le_comparator(&array[i + 64], &key))
                i += 64;
    case 7:
        if (le_comparator(&array[i + 32], &key))
                i += 32;
    case 6:
        if (le_comparator(&array[i + 16], &key))
                i += 16;
    case 5:
        if (le_comparator(&array[i + 8], &key))
                i += 8;
    case 4:
        if (le_comparator(&array[i + 4], &key))
                i += 4;
    case 3:
        if (le_comparator(&array[i + 2], &key))
                i += 2;
    case 2:
        if (le_comparator(&array[i + 1], &key))
                i += 1;
    case 1:
        break;
    case 0:
        break;
    default:
            base = array;
            while (array_size > 1) {
                int half = array_size / 2;
                array_size -= half;
                base += (inline_comparator(&base[half - 1], &key) < 0) * half;
            }
            int comp = inline_comparator(base, &key);
            return (comp == 0) ? (base - array) + (comp < 0) : -1;
    }

    int comp = inline_comparator(&array[i], &key);
    return (__builtin_expect_with_probability(comp == 0, 1, 0.5)) ? i + (comp < 0) : -1;
}

The second loop iteration (for 1024 elements) when compiled with GCC 12.2 is:

.L56:
        leal    256(%rdx), %ecx
        cmpl    %esi, (%rdi,%rcx,4)
        cmovle  %ecx, %edx

Unfortunately, the final iteration has a branch, yet somehow GCC's output on this is the best that I have tested so far:

.L64:
        leal    1(%rdx), %ecx
        movq    %rcx, %rax
        movl    (%rdi,%rcx,4), %ecx
        cmpl    %esi, %ecx
        jle     .L66
.L76:
        movl    %edx, %eax
        movl    (%rdi,%rax,4), %ecx
        movl    %edx, %eax

For comparison, here is the second iteration from LLVM/Clang's output:

.LBB5_7:
        leal    256(%rcx), %eax
        cmpl    %edx, (%rdi,%rax,4)
        cmovgl  %ecx, %eax
        movl    %eax, %ecx

Here are some micro-benchmark data comparing GCC 12.2's output for the current code, the code in this PR and the experimental version:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685119292, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|                              current_binary_search |       1024 |        980 |       9020 |   0.000701 |
|                        custom_binary_search_unroll |       1024 |        980 |       9020 |   0.000385 |
|              custom_binary_search_unroll_switch_v3 |       1024 |        980 |       9020 |   0.000181 |

Here is data from Clang 16:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685119349, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|                              current_binary_search |       1024 |        986 |       9014 |   0.000728 |
|                        custom_binary_search_unroll |       1024 |        986 |       9014 |   0.000209 |
|              custom_binary_search_unroll_switch_v3 |       1024 |        986 |       9014 |   0.000570 |

That does not look right. After debugging it, I discovered that LLVM emitted movb+subb instead of movl+subl before the switch statement to save 3 bytes on the mov instruction, but movb is extremely expensive on Zen 3 (and as it turns out, on other recent amd64 processors too). I manually patch the assembly output from LLVM to undo the space saving optimization:

diff --git a/out.s b/out.s
index f42f44c..24d6940 100644
--- a/out.s
+++ b/out.s
@@ -481,8 +481,8 @@ custom_binary_search_unroll_switch_v3:  # @custom_binary_search_unroll_switch_v3
        bsrl    %eax, %eax
        movl    %eax, %r8d
        xorl    $31, %r8d
-       movb    $30, %cl
-       subb    %r8b, %cl
+       movl    $30, %ecx
+       subl    %r8d, %ecx
        movl    $1, %r8d
        shll    %cl, %r8d
        movl    %esi, %r9d

Now performance is close to what we had from GCC, despite the extra instruction for each case statement:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685119464, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|                              current_binary_search |       1024 |       1037 |       8963 |   0.000660 |
|                        custom_binary_search_unroll |       1024 |       1037 |       8963 |   0.000194 |
|              custom_binary_search_unroll_switch_v3 |       1024 |       1037 |       8963 |   0.000187 |

Additional testing confirmed that loading a single intermediate with movb instead of movl tripled the runtime of the entire binary search function. I could not find any references to this in literature. llvm-mca did not show any issue and Agner Fog's tables did not suggest a problem. I filed llvm/llvm-project#62948 regarding the performance hit.

The first reply to that issue explaining that recent processors do not do register renaming for movb while movb preserves the upper bits of the register, which creates a dependency on the previous value. Register renaming is an extremely important technique for making AMD/Intel's CISC frontend RISC backend design performant, so it would make sense that movb is slow because of a lack of register renaming on modern x86 processors.

@behlendorf behlendorf merged commit 677c6f8 into openzfs:master May 26, 2023
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Jun 27, 2023
This implements a binary search algorithm for B-Trees that reduces
branching to the absolute minimum necessary for a binary search
algorithm. It also enables the compiler to inline the comparator to
ensure that the only slowdown when doing binary search is from waiting
for memory accesses. Additionally, it instructs the compiler to unroll
the loop, which gives an additional 40% improve with Clang and 8%
improvement with GCC.

Consumers must opt into using the faster algorithm. At present, only
B-Trees used inside kernel code have been modified to use the faster
algorithm.

Micro-benchmarks suggest that this can improve binary search performance
by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
compiling with GCC 12.2.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes openzfs#14866
andrewc12 added a commit to andrewc12/openzfs that referenced this pull request Jun 28, 2023
1
Squashed commit of the following:

commit 1e255365c9bf0e7858561d527c0ebdf8f90bc925
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue Jun 27 20:03:37 2023 -0400

    ZIL: Fix another use-after-free.

    lwb->lwb_issued_txg can not be accessed after lwb_state is set to
    LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
    freed by zil_sync().  We must save the txg number before that.

    This is similar to the 55b1842f92, but as I see the bug is not new.
    It existed for quite a while, just was not triggered due to smaller
    race window.

    Reviewed-by: Allan Jude <allan@klarasystems.com>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14988
    Closes #14999

commit 233893e7cb7a98895061100ef8363f0ac30204b5
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue Jun 27 20:00:30 2023 -0400

    Use big transactions for small recordsize writes.

    When ZFS appends files in chunks bigger than recordsize, it borrows
    buffer from ARC and fills it before opening transaction.  This
    supposed to help in case of page faults to not hold transaction open
    indefinitely.  The problem appears when recordsize is set lower than
    default 128KB. Since each block is committed in separate transaction,
    per-transaction overhead becomes significant, and what is even worse,
    active use of of per-dataset and per-pool locks to protect space use
    accounting for each transaction badly hurts the code SMP scalability.
    The same transaction size limitation applies in case of file rewrite,
    but without even excuse of buffer borrowing.

    To address the issue, disable the borrowing mechanism if recordsize
    is smaller than default and the write request is 4x bigger than it.
    In such case writes up to 32MB are executed in single transaction,
    that dramatically reduces overhead and lock contention.  Since the
    borrowing mechanism is not used for file rewrites, and it was never
    used by zvols, which seem to work fine, I don't think this change
    should create significant problems, partially because in addition to
    the borrowing mechanism there are also used pre-faults.

    My tests with 4/8 threads writing several files same time on datasets
    with 32KB recordsize in 1MB requests show reduction of CPU usage by
    the user threads by 25-35%.  I would measure it in GB/s, but at that
    block size we are now limited by the lock contention of single write
    issue taskqueue, which is a separate problem we are going to work on.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14964

commit aea27422747921798a9b9e1b8e0f6230d5672ba5
Author: Laevos <5572812+Laevos@users.noreply.github.com>
Date:   Tue Jun 27 16:58:32 2023 -0700

    Remove unnecessary commas in zpool-create.8

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Laevos <5572812+Laevos@users.noreply.github.com>
    Closes #15011

commit 38a821c0d8f6bb51a866354e76078abf6a6ba1fc
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue Jun 27 12:09:48 2023 -0400

    Another set of vdev queue optimizations.

    Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
    time-sorted AVL-trees to simple lists.  AVL-trees are too expensive
    for such a simple task.  To change I/O priority without searching
    through the trees, add io_queue_state field to struct zio.

    To not check number of queued I/Os for each priority add vq_cqueued
    bitmap to struct vdev_queue.  Update it when adding/removing I/Os.
    Make vq_cactive a separate array instead of struct vdev_queue_class
    member.  Together those allow to avoid lots of cache misses when
    looking for work in vdev_queue_class_to_issue().

    Introduce deadline of ~0.5s for LBA-sorted queues.  Before this I
    saw some I/Os waiting in a queue for up to 8 seconds and possibly
    more due to starvation.  With this change I no longer see it.  I
    had to slightly more complicate the comparison function, but since
    it uses all the same cache lines the difference is minimal.  For a
    sequential I/Os the new code in vdev_queue_io_to_issue() actually
    often uses more simple avl_first(), falling back to avl_find() and
    avl_nearest() only when needed.

    Arrange members in struct zio to access only one cache line when
    searching through vdev queues.  While there, remove io_alloc_node,
    reusing the io_queue_node instead.  Those two are never used same
    time.

    Remove zfs_vdev_aggregate_trim parameter.  It was disabled for 4
    years since implemented, while still wasted time maintaining the
    offset-sorted tree of TRIM requests.  Just remove the tree.

    Remove locking from txg_all_lists_empty().  It is racy by design,
    while 2 pair of locks/unlocks take noticeable time under the vdev
    queue lock.

    With these changes in my tests with volblocksize=4KB I measure vdev
    queue lock spin time reduction by 50% on read and 75% on write.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14925

commit 1737e75ab4e09a2d20e7cc64fa83dae047a302e9
Author: Rich Ercolani <214141+rincebrain@users.noreply.github.com>
Date:   Mon Jun 26 16:57:12 2023 -0400

    Add a delay to tearing down threads.

    It's been observed that in certain workloads (zvol-related being a
    big one), ZFS will end up spending a large amount of time spinning
    up taskqs only to tear them down again almost immediately, then
    spin them up again...

    I noticed this when I looked at what my mostly-idle system was doing
    and wondered how on earth taskq creation/destroy was a bunch of time...

    So I added a configurable delay to avoid it tearing down tasks the
    first time it notices them idle, and the total number of threads at
    steady state went up, but the amount of time being burned just
    tearing down/turning up new ones almost vanished.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #14938

commit 68b8e2ffab23cba6ae87f18c59b044c833934f2f
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Sat Jun 17 22:51:37 2023 -0400

    Fix memory leak in zil_parse().

    482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly
    leaking up to 128KB of memory per dataset during ZIL replay.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Paul Dagnelie <pcd@delphix.com>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14987

commit ea0d03a8bd040e438bcaa43b8e449cbf717e14f3
Author: George Amanakis <gamanakis@gmail.com>
Date:   Thu Jun 15 21:45:36 2023 +0200

    Shorten arcstat_quiescence sleep time

    With the latest L2ARC fixes, 2 seconds is too long to wait for
    quiescence of arcstats like l2_size. Shorten this interval to avoid
    having the persistent L2ARC tests in ZTS prematurely terminated.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14981

commit 3fa141285b8105b3cc11c1296b77ad6d24250f2c
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu Jun 15 13:49:03 2023 -0400

    Remove ARC/ZIO physdone callbacks.

    Those callbacks were introduced many years ago as part of a bigger
    patch to smoothen the write throttling within a txg. They allow to
    account completion of individual physical writes within a logical
    one, improving cases when some of physical writes complete much
    sooner than others, gradually opening the write throttle.

    Few years after that ZFS got allocation throttling, working on a
    level of logical writes and limiting number of writes queued to
    vdevs at any point, and so limiting latency distribution between
    the physical writes and especially writes of multiple copies.
    The addition of scheduling deadline I proposed in #14925 should
    further reduce the latency distribution.  Grown memory sizes over
    the past 10 years should also reduce importance of the smoothing.

    While the use of physdone callback may still in theory provide
    some smoother throttling, there are cases where we simply can not
    afford it.  Since dirty data accounting is protected by pool-wide
    lock, in case of 6-wide RAIDZ, for example, it requires us to take
    it 8 times per logical block write, creating huge lock contention.

    My tests of this patch show radical reduction of the lock spinning
    time on workloads when smaller blocks are written to RAIDZ pools,
    when each of the disks receives 8-16KB chunks, but the total rate
    reaching 100K+ blocks per second.  Same time attempts to measure
    any write time fluctuations didn't show anything noticeable.

    While there, remove also io_child_count/io_parent_count counters.
    They are used only for couple assertions that can be avoided.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14948

commit 9efc735904d194987f06870f355e08d94e39ab81
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Wed Jun 14 10:04:05 2023 -0500

    ZTS: Skip send_raw_ashift on FreeBSD

    On FreeBSD 14 this test runs slowly in the CI environment
    and is killed by the 10 minute timeout.  Skip the test on
    FreeBSD until the slow down is resolved.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14961

commit 9c54894bfc77f585806984f44c70a839543e6715
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Wed Jun 14 11:02:27 2023 -0400

    Switch refcount tracking from lists to AVL-trees.

    With large number of tracked references list searches under the lock
    become too expensive, creating enormous lock contention.

    On my tests with ZFS_DEBUG enabled this increases write throughput
    with 32KB blocks from ~1.2GB/s to ~7.5GB/s.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14970

commit 4e62540827a6ed15e08b2a627896d24bc661fa38
Author: George Amanakis <gamanakis@gmail.com>
Date:   Wed Jun 14 17:01:17 2023 +0200

    Store the L2ARC device ashift in the vdev label

    If this is not done, and the pool has an ashift other than the default
    (at the moment 9) then the following happens:

    1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but
       upon export it is not stored anywhere
    2) at the first import, vdev_open() sees an vdev_ashift() of 0 and
       assigns the logical_ashift, which is 9
    3) reading the contents of L2ARC, including the header fails
    4) L2ARC buffers are not restored in ARC.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14313
    Closes #14963

commit adaa3e64ea46f21cc5f544228c48363977b7733e
Author: George Amanakis <gamanakis@gmail.com>
Date:   Sat Jun 10 02:05:47 2023 +0200

    Fix the L2ARC write size calculating logic (2)

    While commit bcd5321 adjusts the write size based on the size of the log
    block, this happens after comparing the unadjusted write size to the
    evicted (target) size.

    In this case l2ad_hand will exceed l2ad_evict and violate an assertion
    at the end of l2arc_write_buffers().

    Fix this by adding the max log block size to the allocated size of the
    buffer to be committed before comparing the result to the target
    size.

    Also reset the l2arc_trim_ahead ZFS module variable when the adjusted
    write size exceeds the size of the L2ARC device.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14936
    Closes #14954

commit 67118a7d6e74a6e818127096162478017610d13e
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 28 12:31:10 2023 +0800

    Windows: Finally drop long disabled vdev cache.

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 5d80c98c28c931339138753a4e4c1156dbf951f4
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 15:40:55 2023 -0400

    Finally drop long disabled vdev cache.

    It was a vdev level read cache, designed to aggregate many small
    reads by speculatively issuing bigger reads instead and caching
    the result.  But since it has almost no idea about what is going
    on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
    it was found to make more harm than good, for which reason it was
    disabled for the past 12 years.  These days we have much better
    instruments to enlarge the I/Os, such as speculative and prescient
    prefetches, I/O scheduler, I/O aggregation etc.

    Besides just the dead code removal this removes one extra mutex
    lock/unlock per write inside vdev_cache_write(), not otherwise
    disabled and trying to do some work.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14953

commit 1f1ab33781b5736654b988e2e618ea79788fa1f7
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri Jun 9 11:10:01 2023 -0700

    ZTS: Skip checkpoint_discard_busy

    Until the ASSERT which is occasionally hit while running
    checkpoint_discard_busy is resolved skip this test case.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #12053
    Closes #14952

commit b94049c2cbedbbe2af8e629bf974a6ed93f11acb
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 13:14:05 2023 -0400

    Improve l2arc reporting in arc_summary.

    - Do not report L2ARC as FAULTED in presence of in-flight writes.
    - Report read and write I/Os, bytes and errors.
    - Remove few numbers not important to average user.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #12304
    Closes #14946

commit 31044b5cfb6f91d376034c4d6374f61baaf03232
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 28 12:00:39 2023 +0800

    Windows: Use list_remove_head() where possible.

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 32eda54d0d75a94b6aa71dc80aa958095feb8011
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 13:12:52 2023 -0400

    Use list_remove_head() where possible.

    ... instead of list_head() + list_remove().  On FreeBSD the list
    functions are not inlined, so in addition to more compact code
    this also saves another function call.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14955

commit fe7693a3f87229d1ae93b5ce2bb84d8bb86a9f5c
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 13:08:05 2023 -0400

    ZIL: Fix race introduced by f63811f0721.

    We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE
    state and dropping zl_lock, since it may be freed by zil_sync().
    To free itxs and waiters after dropping the lock we need to move
    lwb_itxs and lwb_waiters lists elements to local storage.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14957
    Closes #14959

commit 44c5a0c92f98e8c21221bd7051729d1947a10736
Author: Rich Ercolani <214141+rincebrain@users.noreply.github.com>
Date:   Wed Jun 7 14:14:05 2023 -0400

    Revert "systemd: Use non-absolute paths in Exec* lines"

    This reverts commit 79b20949b25c8db4d379f6486b0835a6613b480c since it
    doesn't work with the systemd version shipped with RHEL7-based systems.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #14943
    Closes #14945

commit ba5af00257eb4eb3363f297819a21c4da811392f
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Wed Jun 7 10:43:43 2023 -0700

    Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)

    When a kmem cache is exhausted and needs to be expanded a new
    slab is allocated.  KM_SLEEP callers can block and wait for the
    allocation, but KM_NOSLEEP callers were incorrectly allowed to
    block as well.

    Resolve this by attempting an emergency allocation as a best
    effort.  This may fail but that's fine since any KM_NOSLEEP
    consumer is required to handle an allocation failure.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Adam Moss <c@yotes.com>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>

commit d4ecd4efde1692641d1d0b89851e7a15e90632f8
Author: George Amanakis <gamanakis@gmail.com>
Date:   Tue Jun 6 21:32:37 2023 +0200

    Fix the L2ARC write size calculating logic

    l2arc_write_size() should return the write size after adjusting for trim
    and overhead of the L2ARC log blocks. Also take into account the
    allocated size of log blocks when deciding when to stop writing buffers
    to L2ARC.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14939

commit 8692ab174e18faf444681d67d7ea4418600553cc
Author: Rob Norris <rob.norris@klarasystems.com>
Date:   Wed Mar 15 18:18:10 2023 +1100

    zdb: add -B option to generate backup stream

    This is more-or-less like `zfs send`, but specifying the snapshot by its
    objset id for situations where it can't be referenced any other way.

    Sponsored-By: Klara, Inc.
    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Reviewed-by: WHR <msl0000023508@gmail.com>
    Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
    Closes #14642

commit df84ca3f3bf9f265ebc76de17394df529fd07af6
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 28 11:05:55 2023 +0800

    Windows: znode: expose zfs_get_zplprop to libzpool

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 944c58247a13a92c9e4ffb2c0a9e6b6293dca37e
Author: Rob Norris <rob.norris@klarasystems.com>
Date:   Sun Jun 4 11:14:20 2023 +1000

    znode: expose zfs_get_zplprop to libzpool

    There's no particular reason this function should be kernel-only, and I
    want to use it (indirectly) from zdb. I've moved it to zfs_znode.c
    because libzpool does not compile in zfs_vfsops.c, and this at least
    matches the header its imported from.

    Sponsored-By: Klara, Inc.
    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Reviewed-by: WHR <msl0000023508@gmail.com>
    Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
    Closes #14642

commit 429f58cdbb195c8d50ed95c7309ee54d37526b70
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Mon Jun 5 14:51:44 2023 -0400

    Introduce zfs_refcount_(add|remove)_few().

    There are two places where we need to add/remove several references
    with semantics of zfs_refcount_(add|remove). But when debug/tracing
    is disabled, it is a crime to run multiple atomic_inc() in a loop,
    especially under congested pool-wide allocator lock.

    Introduced new functions implement the same semantics as the loop,
    but without overhead in production builds.

    Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14934

commit 077c2f359feb69a13bee37ac4220d271d1c7bf27
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon Jun 5 11:08:24 2023 -0700

    Linux 6.3 compat: META (#14930)

    Update the META file to reflect compatibility with the 6.3 kernel.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>

commit c2fcd6e484107fc7435087771757e88ba84f6093
Author: Graham Perrin <grahamperrin@gmail.com>
Date:   Fri Jun 2 19:25:13 2023 +0100

    zfs-create(8): ZFS for swap: caution, clarity

    Make the section heading more generic (the section relates to ZFS files
    as well as ZFS volumes).

    Swapping to a ZFS volume is prone to deadlock. Remove the related
    instruction, direct readers to OpenZFS FAQ. Related, but not linked
    from within the manual page:

    <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux>
    (Using a zvol for a swap device on Linux).

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Graham Perrin <grahamperrin@freebsd.org>
    Issue #7734
    Closes #14756

commit 251dbe83e14085a26100aa894d79772cbb69dcda
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 2 14:01:58 2023 -0400

    ZIL: Allow to replay blocks of any size.

    There seems to be no reason for ZIL blocks to be limited by 128KB
    other than replay code is written in such a way.  This change does
    not increase the limit yet, just removes the artificial limitation.

    Avoided extra memcpy() may save us a second during replay.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14910

commit 76170249d538965655dbd3206cd59566b1d3944b
Author: Val Packett <val@packett.cool>
Date:   Thu May 11 18:16:57 2023 -0300

    PAM: enable testing on FreeBSD

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit d1b68a45441cae8c399a8a3ed60b29726ed031ff
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 22:17:12 2023 -0300

    PAM: support password changes even when not mounted

    There's usually no requirement that a user be logged in for changing
    their password, so let's not be surprising here.

    We need to use the fetch_lazy mechanism for the old password to avoid
    a double prompt for it, so that mechanism is now generalized a bit.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit 7424feff72f1e17ea27bcfe0d36cabce7c732eea
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 22:34:58 2023 -0300

    PAM: add 'uid_min' and 'uid_max' options for changing the uid range

    Instead of a fixed >=1000 check, allow the configuration to override
    the minimum UID and add a maximum one as well. While here, add the
    uid range check to the authenticate method as well, and fix the return
    in the chauthtok method (seems very wrong to report success when we've
    done absolutely nothing).

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit fc9e012f5fc7e7997acee2b6d8d759622b319f0e
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 22:02:13 2023 -0300

    PAM: add 'forceunmount' flag

    Probably not always a good idea, but it's nice to have the option.
    It is a workaround for FreeBSD calling the PAM session end earier than
    the last process is actually done touching the mount, for example.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit a39ed83bd31cc0c8c98dc3c4cc3d11b03d9af620
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 19:35:57 2023 -0300

    PAM: add 'recursive_homes' flag to use with 'prop_mountpoint'

    It's not always desirable to have a fixed flat homes directory.
    With the 'recursive_homes' flag, 'prop_mountpoint' search would
    traverse the whole tree starting at 'homes' (which can now be '*'
    to mean all pools) to find a dataset with a mountpoint matching
    the home directory.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit 7f8d5ef815b7559fcc671ff2add33ba9c2a74867
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 21:56:39 2023 -0300

    PAM: use boolean_t for config flags

    Since we already use boolean_t in the file, we can use it here.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit e2872932c85189f06a68f0ad10bd8eb6895d79c2
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 20:00:48 2023 -0300

    PAM: do not fail to mount if the key's already loaded

    If we're expecting a working home directory on login, it would be
    rather frustrating to not have it mounted just because it e.g. failed to
    unmount once on logout.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit b897137e2044c3ef6120820f753d940b7dfb58be
Author: Rich Ercolani <214141+rincebrain@users.noreply.github.com>
Date:   Wed May 31 19:58:41 2023 -0400

    Revert "initramfs: use `mount.zfs` instead of `mount`"

    This broke mounting of snapshots on / for users.

    See https://github.com/openzfs/zfs/issues/9461#issuecomment-1376162949 for more context.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #14908

commit 10cde4f8f60d4d55887d7122a5742e6e4f90280c
Author: Luís Henriques <73643340+lumigch@users.noreply.github.com>
Date:   Tue May 30 23:15:24 2023 +0100

    Fix NULL pointer dereference when doing concurrent 'send' operations

    A NULL pointer will occur when doing a 'zfs send -S' on a dataset that
    is still being received.  The problem is that the new 'send' will
    rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will
    fail), but then dmu_send() will still do the dsl_dataset_disown().

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Luís Henriques <henrix@camandro.org>
    Closes #14903
    Closes #14890

commit 12452d79a3fd29af1dc0b95f3e367e3ce339702b
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon May 29 12:55:35 2023 -0700

    ZTS: zvol_misc_trim disable blk mq

    Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted
    kernels.  This issue is being actively worked in #14872 and as part of that
    fix this commit will be reverted.

        VERIFY(zh->zh_claim_txg == 0) failed
        PANIC at zil.c:904:zil_create()

    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14872
    Closes #14870

commit 803c04f233e60a2d23f0463f299eba96c0968602
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 26 18:47:52 2023 -0400

    Use __attribute__((malloc)) on memory allocation functions

    This informs the C compiler that pointers returned from these functions
    do not alias other functions, which allows it to do better code
    optimization and should make the compiled code smaller.

    References:
    https://stackoverflow.com/a/53654773
    https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute
    https://clang.llvm.org/docs/AttributeReference.html#malloc

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14827

commit 64d8bbe15f77876ae9639b9971a743776a41bf9a
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri May 26 15:39:23 2023 -0700

    ZTS: Add zpool_resilver_concurrent exception

    The zpool_resilver_concurrent test case requires the ZED which is not used
    on FreeBSD.  Add this test to the known list of skipped tested for FreeBSD.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14904

commit e396d30d29ed131194605222e6ba1fec1ef8b2ca
Author: Mike Swanson <mikeonthecomputer@gmail.com>
Date:   Fri May 26 15:37:15 2023 -0700

    Add compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2}

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
    Closes #14902

commit f6dd0b8c1cc41707d299b7123f80912f43d03340
Author: Colm <colm@tuatha.org>
Date:   Fri May 26 10:04:19 2023 -0700

    Adding new read-only compatible zpool features to compatibility.d/grub2

    GRUB2 is compatible with all "read-only compatible" features,
    so it is safe to add new features of this type to the grub2
    compatibility list. We generally want to include all compatible
    features, to minimize the differences between grub2-compatible
    pools and no-compatibility pools.

    Adding new properties `livelist` and `zpool_checkpoint` accordingly.

    Also adding them to the man page which references this file as an
    example, for consistency.

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Colm Buckley <colm@tuatha.org>
    Closes #14893

commit 013d3a1e0e00d83dabe70837b23dab48c1bac592
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 26 13:03:12 2023 -0400

    btree: Implement faster binary search algorithm

    This implements a binary search algorithm for B-Trees that reduces
    branching to the absolute minimum necessary for a binary search
    algorithm. It also enables the compiler to inline the comparator to
    ensure that the only slowdown when doing binary search is from waiting
    for memory accesses. Additionally, it instructs the compiler to unroll
    the loop, which gives an additional 40% improve with Clang and 8%
    improvement with GCC.

    Consumers must opt into using the faster algorithm. At present, only
    B-Trees used inside kernel code have been modified to use the faster
    algorithm.

    Micro-benchmarks suggest that this can improve binary search performance
    by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
    compiling with GCC 12.2.

    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14866

commit 1854df330aa57cda39f076e8ab11e17ca3697bb8
Author: George Amanakis <gamanakis@gmail.com>
Date:   Fri May 26 18:53:00 2023 +0200

    Fix inconsistent definition of zfs_scrub_error_blocks_per_txg

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14894

commit 8735e6ac03742fcf43adde3ce127af698a32c53a
Author: Damiano Albani <damiano.albani@gmail.com>
Date:   Fri May 26 01:10:54 2023 +0200

    Add missing files to Debian DKMS package

    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Damiano Albani <damiano.albani@gmail.com>
    Closes #14887
    Closes #14889

commit d439021bd05a5cc0bb271a5470abb67af2f7bcda
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu May 25 13:53:08 2023 -0700

    Update compatibility.d files

    Add an openzfs-2.2 compatibility file for the next release.

    Edon-R support has been enabled for FreeBSD removing the need
    for different FreeBSD and Linux files.  Symlinks for the -linux
    and -freebsd names are created for any scripts expecting that
    convention.

    Additionally, a symlink for ubunutu-22.04 was added.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14833

commit da54d5f3f9576b958e3eadf4f4d8f68c91b3d6e4
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu May 25 16:51:53 2023 -0400

    zil: Add some more statistics.

    In addition to a number of actual log bytes written, account also a
    total written bytes including padding and total allocated bytes (bytes
    <= write <= alloc).  It should allow to monitor zil traffic and space
    efficiency.

    Add dtrace probe for zil block size selection.

    Make zilstat report more information and fit it into less width.

    Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14863

commit faa4955023d089668bd6c564c195a933d1eac455
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu May 25 12:48:43 2023 -0400

    ZIL: Reduce scope of per-dataset zl_issuer_lock.

    Before this change ZIL copied all log data while holding the lock.
    It caused huge lock contention on workloads with many big parallel
    writes.  This change splits the process into two parts: first,
    zil_lwb_assign() estimates the log space needed for all transactions,
    and zil_lwb_write_close() allocates blocks and zios while holding the
    lock, then, after the lock in dropped, zil_lwb_commit() copies the
    data, and zil_lwb_write_issue() issues the I/Os.

    Also while there slightly reduce scope of zl_lock.

    Reviewed-by: Paul Dagnelie <pcd@delphix.com>
    Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14841

commit f77b9f7ae83834ade1da21cfc16b8a273df3acfc
Author: Dimitri John Ledkov <19779+xnox@users.noreply.github.com>
Date:   Wed May 24 20:31:28 2023 +0100

    systemd: Use non-absolute paths in Exec* lines

    Since systemd v239, Exec* binaries are resolved from PATH when they
    are not-absolute. Switch to this by default for ease of downstream
    maintenance. Many downstream distributions move individual binaries
    to locations that existing compile-time configurations cannot
    accommodate.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
    Closes #14880

commit 4bfb9d28cffd4dfeb4b91359b497d100f668bb34
Author: Akash B <akash-b@hpe.com>
Date:   Thu May 25 00:58:09 2023 +0530

    Fix concurrent resilvers initiated at same time

    For draid vdevs it was possible to initiate both the
    sequential and healing resilver at same time.

    This fixes the following two scenarios.
         1) There's a window where a sequential rebuild can
    be started via ZED even if a healing resilver has been
    scheduled.
    	- This is fixed by adding additional check in
    spa_vdev_attach() for any scheduled resilver and return
    appropriate error code when a resilver is already in
    progress.

         2) It was possible for zpool clear to start a healing
    resilver when it wasn't needed at all. This occurs because
    during a vdev_open() the device is presumed to be healthy not
    until the device is validated by vdev_validate() and it's set
    unavailable. However, by this point an async resilver will
    have already been requested if the DTL isn't empty.
    	- This is fixed by cancelling the SPA_ASYNC_RESILVER
    request immediately at the end of vdev_reopen() when a resilver
    is unneeded.

    Finally, added a testcase in ZTS for verification.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
    Signed-off-by: Akash B <akash-b@hpe.com>
    Closes #14881
    Closes #14892

commit c9bb406d177a00aa1f0058d29aeb29e478223273
Author: youzhongyang <youzhong@gmail.com>
Date:   Wed May 24 15:23:42 2023 -0400

    Linux 6.4 compat: reclaimed_slab renamed to reclaimed

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Youzhong Yang <yyang@mathworks.com>
    Closes #14891

commit 79e61a873b136f13fcf140beb925ceddc1f94767
Author: Brian Atkinson <batkinson@lanl.gov>
Date:   Fri May 19 16:05:53 2023 -0400

    Hold db_mtx when updating db_state

    Commit 555ef90 did some general code refactoring for
    dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was
    not held when update db->db_state in those code block. The rest of the
    dbuf code always holds the db_mtx when updating db_state. This is
    important because cv_wait() db_changed is used to check for db_state
    changes.

    Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the
    db_mtx when updating db_state.

    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
    Closes #14875

commit d7be0cdf93a568b6c9b4a4e15a88a5d88ebbb764
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri May 19 13:05:09 2023 -0700

    Probe vdevs before marking removed

    Before allowing the ZED to mark a vdev as REMOVED due to a
    hotplug event confirm that it is non-responsive with probe.
    Any device which can be successfully probed should be left
    ONLINE to prevent a healthy pool from being incorrectly
    SUSPENDED.  This may occur for at least the following two
    scenarios.

    1) Drive expansion (zpool online -e) in VMware environments.
       If, during the partition resize operation, a partition is
       removed and re-created then udev will send a removed event.

    2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan)
       may result in a udev remove and add event being delivered.

    Finally, update the ZED to only kick in a spare when the
    removal was successful.

    Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14859
    Closes #14861

commit 054bb22686045ea1499065a4456568f0c21d939b
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Tue Jun 27 09:20:56 2023 +0800

    Windows: Teach zpool scrub to scrub only blocks in error log

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit b61e89a3e68ae19819493183ff3d1fe7bf4ffe2b
Author: George Amanakis <gamanakis@gmail.com>
Date:   Fri Dec 17 21:35:28 2021 +0100

    Teach zpool scrub to scrub only blocks in error log

    Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
    user can pause, resume and cancel the error scrub by passing additional
    command line arguments -p -s just like a regular scrub. This involves
    adding a new flag, creating new libzfs interfaces, a new ioctl, and the
    actual iteration and read-issuing logic. Error scrubbing is executed in
    multiple txg to make sure pool performance is not affected.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Co-authored-by: TulsiJain tulsi.jain@delphix.com
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #8995
    Closes #12355

commit 61bfb3cb5dd792ec7ca0fbfca59b165f3ddbe1f5
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu May 18 10:02:20 2023 -0700

    Add the ability to uninitialize

    zpool initialize functions well for touching every free byte...once.
    But if we want to do it again, we're currently out of luck.

    So let's add zpool initialize -u to clear it.

    Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #12451
    Closes #14873

commit 855b62942d4ca5dab3d65b7000f9d284fd1560bb
Author: Antonio Russo <aerusso@aerusso.net>
Date:   Mon May 15 17:11:33 2023 -0600

    test-runner: pass kmemleak and kmsg to Cmd.run

    test-runner.py orchestrates all of the ZTS executions. The `Cmd` object
    manages these process, and its `run` method specifically invokes these
    possibly long-running processes, possibly retrying in the event of a
    timeout. Since its inception, memory leak detection using the kmemleak
    infrastructure [1], and kernel logging [2] have been added to this run
    mechanism.

    However, the callback to cull a process beyond its timeout threshold,
    `kill_cmd`, has evaded modernization by both of these changes. As a
    result, this function fails to properly invoke `run`, leading to an
    untrapped exception and unreported test failure.

    This patch extends `kill_cmd` to receive these kernel devices through
    the `options` parameter, and regularizes all the `.run` calls from
    `Cmd`, and its subclasses, to accept that parameter.

    [1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1
    [2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba

    Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
    Signed-off-by: Antonio Russo <aerusso@aerusso.net>
    Closes #14849

commit 537939565123fd2afa097e9a56ee3efd28779e5f
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 12 17:10:14 2023 -0400

    Fix undefined behavior in spa_sync_props()

    8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin
    complaining about "Improper use of negative value" in two places in
    spa_sync_props() because Coverity correctly inferred from `prop ==
    ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name()
    and zpool_prop_get_type() use it an array index, which is undefined
    behavior.

    Assuming that the system does not panic from an attempt to read invalid
    memory, the case statement for ZPOOL_PROP_INVAL will ensure that only
    user properties will reach this code when prop is ZPOOL_PROP_INVAL, such
    that execution will continue safely. However, if we are unlucky enough
    to read invalid memory, then the system will panic.

    This issue predates the patch that caused coverity to begin complaining.
    Thankfully, our userland tools do not pass nonsense to us, so this bug
    should not be triggered unless a future userland tool attempts to set a
    property that we do not understand.

    Reported-by: Coverity (CID-1561129)
    Reported-by: Coverity (CID-1561130)
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: George Amanakis <gamanakis@gmail.com>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14860

commit 02351b380f0430980bfb92e83d0800df104bd06a
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 12 16:47:56 2023 -0400

    Fix use after free regression in spa_remove_healed_errors()

    6839ec6f1098c28ff7b772f1b31b832d05e6b567 placed code in
    spa_remove_healed_errors() that uses a pointer after the kmem_free()
    call that frees it.

    Reported-by: Coverity (CID-1562375)
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: George Amanakis <gamanakis@gmail.com>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14860

commit e9b315ffb79ff6419694a2713fcd5fd448317904
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Mon May 15 13:52:35 2023 +0800

    Use python3 on windows

commit 3346a5b78c2db15801ce54a70a323952fdf67fa5
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Thu Jun 22 08:56:38 2023 +0900

    zfs_write() ignores errors

    If files were advanced by zfs_freesp() we ignored
    any errors returned by it.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit cce49c08316bc6a5dff287f4fa15856e26d5b18a
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Thu Jun 22 08:55:55 2023 +0900

    Correct Stream event path

    The Stream path events used the incorrect name
    "stream", now uses "file.txt:stream" as per ntfs.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 0f83d31e288d789fb4e10a7e4b12e27887820498
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 21 14:30:13 2023 +0900

    Add stub for file_hard_link_information()

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 8d6db9490364e4d281546445571d2ca9d5abda22
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 21 14:29:43 2023 +0900

    Return correct FileID in dirlist

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 4c011397229e3c38259d6956458a4fd287dca72d
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 21 10:17:30 2023 +0800

    Fix logic (#232)

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 467436b676ad897025b7ed90d8f033969da441cc
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 21 09:47:38 2023 +0800

    Run winbtrfs tests by default (#231)

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 56eca2a5d116c66b10579f9cf6d5f271991c7e2e
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 21 09:54:00 2023 +0900

    SetFilePositionInformation SetFileValidDataLengthInformation

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit b4fbbda470f27aee565dfa9bc0d68217b969339c
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Tue Jun 20 16:33:12 2023 +0800

    Add sleep to tests (#230)

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 94f1f52807d1f8c0c2931e9e52b91f0ce5e488f4
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 20 16:53:50 2023 +0900

    CreateFile of newfile:newstream should create both

    In addition, many more stream fixes, illegal chars, and names

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 894d512880d39ecf40e841c6d7b73157dfe397e0
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 20 08:41:37 2023 +0900

    Windows streams should return parent file ID

    When asked for File ID of a stream, it should return
    the FileID of the parent file, which is two levels up.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 0cc45d2154a2866b2f494c3790a57555c29e60c3
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 20 08:32:44 2023 +0900

    Support FILE_STANDARD_INFORMATION_EX

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit a6edd02999d581db56f4a53567f4c5db11778f64
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Mon Jun 19 10:36:13 2023 +0900

    Add xattr compat code from upstream

    and adjust calls to new API calls.
    This adds xattr=sa support to Windows.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 0e1476a3942990385d32c02403ebe2c815d567db
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 14 11:56:09 2023 +0900

    Set EA can panic

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 4a1adef6b8c2851195d692a42d5718c9a1b03490
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 14 09:49:57 2023 +0900

    Incorrect MAXPATH used in delete entry

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 2c0d119e37cb3eed1acac90efa9fe0f8c173e0f0
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 13 16:19:42 2023 +0900

    Large changes fixing FS notify events

    Some incorrect behavior still, query name of
    a stream is wrong.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 5b2b2b0550a493497a0b460206079fd57c639543
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue May 16 14:42:52 2023 +0900

    file name and file full information buffer overrun

    When a buffer is not big enough, we would still
    null terminate on the full string, beyond the supplied
    buffer.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 94bfb92951a5ccdef7b2a1fb818fafdafbc4fff0
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue May 16 11:48:12 2023 +0900

    Correct Query EA and Query Streams

    Which includes:

    * NextEntryOffset is not offset from Buffer, but from one struct to
    the next struct.
    * Pack only complete EAs, and return Overflow if does not fit
    * query file EA information would return from Information=size
    * Call cleareaszie on VP when EAs have changed

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 9c7a4071fcfc99c3308620fc1943355f9ade34b3
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri May 12 12:49:26 2023 -0400

    zil: Free lwb_buf after write completion.

    There is no sense to keep that memory allocated during the flush.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14855

commit 7e91b3222ddaadc10c92d1065529886dd3806acc
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri May 12 12:14:29 2023 -0400

    zil: Some micro-optimizations.

    Should not cause functional changes.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14854

commit 6b62c3b0e10de782c3aef0e1206aa48875519c4e
Author: Don Brady <dev.fs.zfs@gmail.com>
Date:   Fri May 12 10:12:28 2023 -0600

    Refine special_small_blocks property validation

    When the special_small_blocks property is being set during a pool
    create it enforces a limit of 128KiB even if the pool's record size
    is larger.

    If the recordsize property is being set during a pool create, then
    use that value instead of the default SPA_OLD_MAXBLOCKSIZE value.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
    Closes #13815
    Closes #14811

commit d0ab2dddde618c394fa7fe88211276786ba8ca12
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri May 12 09:07:58 2023 -0700

    ZTS: Add auto_replace_001_pos to exceptions

    The auto_replace_001_pos test case does not reliably pass on
    Fedora 37 and newer.  Until the test case can be updated to make
    it reliable add it to the list of "maybe" exceptions on Linux.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14851
    Closes #14852

commit 1e3e7a103a5026e9a2005acec7017e4024d95115
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Tue May 9 22:32:30 2023 -0700

    Make sure we are not trying to clone a spill block.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit a22891c3272d8527d4c8cb7ff52a25ef396e7add
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Thu May 4 16:14:19 2023 -0700

    Correct comment.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 9b016166dd5875db87963b5deeca8eeda094b571
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Wed May 3 23:25:22 2023 -0700

    Remove badly placed comment.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 6bcd48e213a279781ecd6df22799532cbec353d6
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Wed May 3 00:24:47 2023 -0700

    Don't call zfs_exit_two() before zfs_enter_two().

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 0919c985e294a89169adacd5ed4a240945e5fbee
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Tue May 2 15:46:14 2023 -0700

    Don't use dmu_buf_is_dirty() for unassigned transaction.

    The dmu_buf_is_dirty() call doesn't make sense here for two reasons:
    1. txg is 0 for unassigned tx, so it was a no-op.
    2. It is equivalent of checking if we have dirty records and we are doing
       this few lines earlier.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 7f88494ac91c61aeffad810e7d167badb875166e
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Tue May 2 14:24:43 2023 -0700

    Deny block cloning is dbuf size doesn't match BP size.

    I don't know an easy way to shrink down dbuf size, so just deny block cloning
    into dbufs that don't match our BP's size.

    This fixes the following situation:
    1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB.
    2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB.
    3. Truncate the large file to 0. Its dbuf will remain 2kB.
    4. Clone the small file into the large file. Small file's BP lsize is
       1kB, but the large file's dbuf is 2kB.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 49657002f9cb57b9b4675100aaf58e1e93984bbf
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Sun Apr 30 02:47:09 2023 -0700

    Additional block cloning fixes.

    Reimplement some of the block cloning vs dbuf logic, mostly to fix
    situation where we clone a block and in the same transaction group
    we want to partially overwrite the clone.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 4d31369d3055bf0cf1d4f3e1e7d43d745f2fd05f
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu May 11 17:27:12 2023 -0400

    zil: Don't expect zio_shrink() to succeed.

    At least for RAIDZ zio_shrink() does not reduce zio size, but reduced
    wsz in that case likely results in writing uninitialized memory.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14853

commit 663dc5f616e6d0427207ffcf7a83dd02fe06a707
Author: Ameer Hamza <ahamza@ixsystems.com>
Date:   Wed May 10 05:56:35 2023 +0500

    Prevent panic during concurrent snapshot rollback and zvol read

    Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent
    release of the dnode, avoiding panic when a snapshot is rolled back
    in parallel during ongoing zvol read operation.

    Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
    Closes #14839

commit 7375f4f61ca587f893435184f398a767ae52fbea
Author: Tony Hutter <hutter2@llnl.gov>
Date:   Tue May 9 17:55:19 2023 -0700

    pam: Fix "buffer overflow" in pam ZTS tests on F38

    The pam ZTS tests were reporting a buffer overflow on F38, possibly
    due to F38 now setting _FORTIFY_SOURCE=3 by default.  gdb and
    valgrind narrowed this down to a snprintf() buffer overflow in
    zfs_key_config_modify_session_counter().  I'm not clear why this
    particular snprintf() was being flagged as an overflow, but when
    I replaced it with an asprintf(), the test passed reliably.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>
    Closes #14802
    Closes #14842

commit 9d3ed831f309e28a9cad56c8b1520292dbad0d7b
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Tue May 9 09:03:10 2023 -0700

    Add dmu_tx_hold_append() interface

    Provides an interface which callers can use to declare a write when
    the exact starting offset in not yet known.  Since the full range
    being updated is not available only the first L0 block at the
    provided offset will be prefetched.

    Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14819

commit 2b6033d71da38015c885297d1ee6577871099744
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Tue May 9 08:57:02 2023 -0700

    Debug auto_replace_001_pos failures

    Reduced the timeout to 60 seconds which should be more than
    sufficient and allow the test to be marked as FAILED rather
    than KILLED.  Also dump the pool status on cleanup.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14829

commit f4adc2882fb162c82e9738c5d2d30e3ba8a66367
Author: George Amanakis <gamanakis@gmail.com>
Date:   Tue May 9 17:54:41 2023 +0200

    Remove duplicate code in l2arc_evict()

    l2arc_evict() performs the adjustment of the size of buffers to be
    written on L2ARC unnecessarily. l2arc_write_size() is called right
    before l2arc_evict() and performs those adjustments.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14828

commit 9b2c182d291bbb3ece9ceb1c72800d238d19b2e7
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue May 9 11:54:01 2023 -0400

    Remove single parent assertion from zio_nowait().

    We only need to know if ZIO has any parent there.  We do not care if
    it has more than one, but use of zio_unique_parent() == NULL asserts
    that.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14823

commit 4def61804c052a1235179e3a7c98305d8075e0e9
Author: George Amanakis <gamanakis@gmail.com>
Date:   Tue May 9 17:53:27 2023 +0200

    Enable the head_errlog feature to remove errors

    In case check_filesystem() does not error out and does not report
    an error, remove that error block from error lists and logs
    without requiring a scrub. This can happen when the original file and
    all snapshots/clones referencing it have been removed.

    Otherwise zpool status will still report that "Permanent errors have
    been detected..." without actually reporting any of them.

    To implement this change the functions introduced in corrective
    receive were modified to take into account the head_errlog feature.

    Before this change:
    =============================
    pool: test
     state: ONLINE
    status: One or more devices has experienced an error resulting in data
            corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
            entire pool from backup.
       see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
    config:

            NAME                   STATE     READ WRITE CKSUM
            test                   ONLINE       0     0     0
              /home/user/vdev_a    ONLINE       0     0     2

    errors: Permanent errors have been detected in the following files:

    =============================

    After this change:
    =============================
      pool: test
     state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are
    unaffected.
    action: Determine if the device needs to be replaced, and clear the
    errors
            using 'zpool clear' or replace the device with 'zpool replace'.
       see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
    config:

            NAME                   STATE     READ WRITE CKSUM
            test                   ONLINE       0     0     0
              /home/user/vdev_a    ONLINE       0     0     2

    errors: No known data errors
    =============================

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14813

commit 3f2f9533ca8512ef515a73ac5661598a65b896b6
Author: George Amanakis <gamanakis@gmail.com>
Date:   Mon May 8 22:35:03 2023 +0200

    Fixes in head_errlog feature with encryption

    For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
    dsl_dataset_hold_obj() in order to enable access to the encryption keys
    (if loaded). This enables reporting of errors in encrypted filesystems
    which are not mounted but have their keys loaded.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14837

commit 288ea63effae3ba24fcb6dc412a3125b9f3e1da9
Author: Matthew Ahrens <mahrens@delphix.com>
Date:   Mon May 8 11:20:23 2023 -0700

    Verify block pointers before writing them out

    If a block pointer is corrupted (but the block containing it checksums
    correctly, e.g. due to a bug that overwrites random memory), we can
    often detect it before the block is read, with the `zfs_blkptr_verify()`
    function, which is used in `arc_read()`, `zio_free()`, etc.

    However, such corruption is not typically recoverable.  To recover from
    it we would need to detect the memory error before the block pointer is
    written to disk.

    This PR verifies BP's that are contained in indirect blocks and dnodes
    before they are written to disk, in `dbuf_write_ready()`. This way,
    we'll get a panic before the on-disk data is corrupted. This will help
    us to diagnose what's causing the corruption, as well as being much
    easier to recover from.

    To minimize performance impact, only checks that can be done without
    holding the spa_config_lock are performed.

    Additionally, when corruption is detected, the raw words of the block
    pointer are logged.  (Note that `dprintf_bp()` is a no-op by default,
    but if enabled it is not safe to use with invalid block pointers.)

    Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
    Closes #14817

commit 23132688b9d54ef11413925f88c02d83d607ec2b
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon May 8 11:17:41 2023 -0700

    zdb: consistent xattr output

    When using zdb to output the value of an xattr only interpret it
    as printable characters if the entire byte array is printable.
    Additionally, if the --parseable option is set always output the
    buffer contents as octal for easy parsing.

    Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14830

commit 6deb342248e10af92e2d3fbb4e4b1221812188ff
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon May 8 10:09:30 2023 -0700

    ZTS: add snapshot/snapshot_002_pos exception

    Add snapshot_002_pos to the known list of occasional failures
    for FreeBSD until it can be made entirely reliable.

    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14831
    Closes #14832

commit a0a125bab291fe005d29be5375a5bb2a1c8261c7
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri May 5 12:17:55 2023 -0400

    Fix two abd_gang_add_gang() issues.

    - There is no reason to assert that added gang is not empty.  It
    may be weird to add an empty gang, but it is legal.
     - When moving chain list from the added gang clear its size, or it
    will trigger assertion in abd_verify() when that gang is freed.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14816

commit aefb80389458dcccdcb9659914714264248b8e52
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Sat May 6 01:09:12 2023 +0900

    Simplify and optimize random_int_between().

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14805

commit cf53b4376d902baecc04e450038d49c84c848e56
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Sat May 6 00:51:41 2023 +0900

    Plug memory leak in zfsdev_state.

    On kernel module unload, free all zfsdev state structures, except for
    zfsdev_state_listhead, which is statically allocated.

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14824

commit 409f6b6fa0caba14be1995bbe28ca70e55ab7666
Author: Ameer Hamza <ahamza@ixsystems.com>
Date:   Thu May 4 03:10:32 2023 +0500

    zpool import -m also removing spare and cache when log device is missing

    spa_import() relies on a pool config fetched by spa_try_import() for
    spare/cache devices. Import flags are not passed to spa_tryimport(),
    which makes it return early due to a missing log device and missing
    retrieving the cache dev…
andrewc12 added a commit to andrewc12/openzfs that referenced this pull request Jun 29, 2023
commit 9cde9c07739f76a37d729d3a323f49f5d4bc100f
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 28 19:27:10 2023 +0800

    Revert various glitches

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit d0c8c0fb05088bb016bc208d5f8cb709195cff87
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Thu Jun 29 08:24:13 2023 +0800

    Windows: znode: expose zfs_get_zplprop to libzpool

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 3d747f29b2864b661223d09bc8375d34e2105825
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Sun Dec 4 17:42:43 2022 -0500

    Fix TOCTOU race in zpool_do_labelclear()

    Coverity reported a TOCTOU race in `zpool_do_labelclear()`. This is not
    believed to be a real security issue, but fixing it reduces the number
    of syscalls we do and will prevent other static analyzers from
    complaining about this.

    The code is expected to be equivalent. However, under rare
    circumstances, such as ELOOP, ENAMETOOLONG, ENOMEM, ENOTDIR and
    EOVERFLOW, we will display the error message that we currently display
    for the `open()` syscall rather than the one that we currently display
    for the `stat()` syscall. This is considered to be an improvement.

    Reported-by: Coverity (CID-1524188)
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14575

commit 1e255365c9bf0e7858561d527c0ebdf8f90bc925
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue Jun 27 20:03:37 2023 -0400

    ZIL: Fix another use-after-free.

    lwb->lwb_issued_txg can not be accessed after lwb_state is set to
    LWB_STATE_FLUSH_DONE and zl_lock is dropped, since the lwb may be
    freed by zil_sync().  We must save the txg number before that.

    This is similar to the 55b1842f92, but as I see the bug is not new.
    It existed for quite a while, just was not triggered due to smaller
    race window.

    Reviewed-by: Allan Jude <allan@klarasystems.com>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14988
    Closes #14999

commit 233893e7cb7a98895061100ef8363f0ac30204b5
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue Jun 27 20:00:30 2023 -0400

    Use big transactions for small recordsize writes.

    When ZFS appends files in chunks bigger than recordsize, it borrows
    buffer from ARC and fills it before opening transaction.  This
    supposed to help in case of page faults to not hold transaction open
    indefinitely.  The problem appears when recordsize is set lower than
    default 128KB. Since each block is committed in separate transaction,
    per-transaction overhead becomes significant, and what is even worse,
    active use of of per-dataset and per-pool locks to protect space use
    accounting for each transaction badly hurts the code SMP scalability.
    The same transaction size limitation applies in case of file rewrite,
    but without even excuse of buffer borrowing.

    To address the issue, disable the borrowing mechanism if recordsize
    is smaller than default and the write request is 4x bigger than it.
    In such case writes up to 32MB are executed in single transaction,
    that dramatically reduces overhead and lock contention.  Since the
    borrowing mechanism is not used for file rewrites, and it was never
    used by zvols, which seem to work fine, I don't think this change
    should create significant problems, partially because in addition to
    the borrowing mechanism there are also used pre-faults.

    My tests with 4/8 threads writing several files same time on datasets
    with 32KB recordsize in 1MB requests show reduction of CPU usage by
    the user threads by 25-35%.  I would measure it in GB/s, but at that
    block size we are now limited by the lock contention of single write
    issue taskqueue, which is a separate problem we are going to work on.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14964

commit aea27422747921798a9b9e1b8e0f6230d5672ba5
Author: Laevos <5572812+Laevos@users.noreply.github.com>
Date:   Tue Jun 27 16:58:32 2023 -0700

    Remove unnecessary commas in zpool-create.8

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Laevos <5572812+Laevos@users.noreply.github.com>
    Closes #15011

commit 38a821c0d8f6bb51a866354e76078abf6a6ba1fc
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue Jun 27 12:09:48 2023 -0400

    Another set of vdev queue optimizations.

    Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
    time-sorted AVL-trees to simple lists.  AVL-trees are too expensive
    for such a simple task.  To change I/O priority without searching
    through the trees, add io_queue_state field to struct zio.

    To not check number of queued I/Os for each priority add vq_cqueued
    bitmap to struct vdev_queue.  Update it when adding/removing I/Os.
    Make vq_cactive a separate array instead of struct vdev_queue_class
    member.  Together those allow to avoid lots of cache misses when
    looking for work in vdev_queue_class_to_issue().

    Introduce deadline of ~0.5s for LBA-sorted queues.  Before this I
    saw some I/Os waiting in a queue for up to 8 seconds and possibly
    more due to starvation.  With this change I no longer see it.  I
    had to slightly more complicate the comparison function, but since
    it uses all the same cache lines the difference is minimal.  For a
    sequential I/Os the new code in vdev_queue_io_to_issue() actually
    often uses more simple avl_first(), falling back to avl_find() and
    avl_nearest() only when needed.

    Arrange members in struct zio to access only one cache line when
    searching through vdev queues.  While there, remove io_alloc_node,
    reusing the io_queue_node instead.  Those two are never used same
    time.

    Remove zfs_vdev_aggregate_trim parameter.  It was disabled for 4
    years since implemented, while still wasted time maintaining the
    offset-sorted tree of TRIM requests.  Just remove the tree.

    Remove locking from txg_all_lists_empty().  It is racy by design,
    while 2 pair of locks/unlocks take noticeable time under the vdev
    queue lock.

    With these changes in my tests with volblocksize=4KB I measure vdev
    queue lock spin time reduction by 50% on read and 75% on write.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14925

commit 1737e75ab4e09a2d20e7cc64fa83dae047a302e9
Author: Rich Ercolani <214141+rincebrain@users.noreply.github.com>
Date:   Mon Jun 26 16:57:12 2023 -0400

    Add a delay to tearing down threads.

    It's been observed that in certain workloads (zvol-related being a
    big one), ZFS will end up spending a large amount of time spinning
    up taskqs only to tear them down again almost immediately, then
    spin them up again...

    I noticed this when I looked at what my mostly-idle system was doing
    and wondered how on earth taskq creation/destroy was a bunch of time...

    So I added a configurable delay to avoid it tearing down tasks the
    first time it notices them idle, and the total number of threads at
    steady state went up, but the amount of time being burned just
    tearing down/turning up new ones almost vanished.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #14938

commit 68b8e2ffab23cba6ae87f18c59b044c833934f2f
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Sat Jun 17 22:51:37 2023 -0400

    Fix memory leak in zil_parse().

    482da24e2 missed arc_buf_destroy() calls on log parse errors, possibly
    leaking up to 128KB of memory per dataset during ZIL replay.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Paul Dagnelie <pcd@delphix.com>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14987

commit ea0d03a8bd040e438bcaa43b8e449cbf717e14f3
Author: George Amanakis <gamanakis@gmail.com>
Date:   Thu Jun 15 21:45:36 2023 +0200

    Shorten arcstat_quiescence sleep time

    With the latest L2ARC fixes, 2 seconds is too long to wait for
    quiescence of arcstats like l2_size. Shorten this interval to avoid
    having the persistent L2ARC tests in ZTS prematurely terminated.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14981

commit 3fa141285b8105b3cc11c1296b77ad6d24250f2c
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu Jun 15 13:49:03 2023 -0400

    Remove ARC/ZIO physdone callbacks.

    Those callbacks were introduced many years ago as part of a bigger
    patch to smoothen the write throttling within a txg. They allow to
    account completion of individual physical writes within a logical
    one, improving cases when some of physical writes complete much
    sooner than others, gradually opening the write throttle.

    Few years after that ZFS got allocation throttling, working on a
    level of logical writes and limiting number of writes queued to
    vdevs at any point, and so limiting latency distribution between
    the physical writes and especially writes of multiple copies.
    The addition of scheduling deadline I proposed in #14925 should
    further reduce the latency distribution.  Grown memory sizes over
    the past 10 years should also reduce importance of the smoothing.

    While the use of physdone callback may still in theory provide
    some smoother throttling, there are cases where we simply can not
    afford it.  Since dirty data accounting is protected by pool-wide
    lock, in case of 6-wide RAIDZ, for example, it requires us to take
    it 8 times per logical block write, creating huge lock contention.

    My tests of this patch show radical reduction of the lock spinning
    time on workloads when smaller blocks are written to RAIDZ pools,
    when each of the disks receives 8-16KB chunks, but the total rate
    reaching 100K+ blocks per second.  Same time attempts to measure
    any write time fluctuations didn't show anything noticeable.

    While there, remove also io_child_count/io_parent_count counters.
    They are used only for couple assertions that can be avoided.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14948

commit 9efc735904d194987f06870f355e08d94e39ab81
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Wed Jun 14 10:04:05 2023 -0500

    ZTS: Skip send_raw_ashift on FreeBSD

    On FreeBSD 14 this test runs slowly in the CI environment
    and is killed by the 10 minute timeout.  Skip the test on
    FreeBSD until the slow down is resolved.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14961

commit 9c54894bfc77f585806984f44c70a839543e6715
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Wed Jun 14 11:02:27 2023 -0400

    Switch refcount tracking from lists to AVL-trees.

    With large number of tracked references list searches under the lock
    become too expensive, creating enormous lock contention.

    On my tests with ZFS_DEBUG enabled this increases write throughput
    with 32KB blocks from ~1.2GB/s to ~7.5GB/s.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14970

commit 4e62540827a6ed15e08b2a627896d24bc661fa38
Author: George Amanakis <gamanakis@gmail.com>
Date:   Wed Jun 14 17:01:17 2023 +0200

    Store the L2ARC device ashift in the vdev label

    If this is not done, and the pool has an ashift other than the default
    (at the moment 9) then the following happens:

    1) vdev_alloc() assigns the ashift of the pool to L2ARC device, but
       upon export it is not stored anywhere
    2) at the first import, vdev_open() sees an vdev_ashift() of 0 and
       assigns the logical_ashift, which is 9
    3) reading the contents of L2ARC, including the header fails
    4) L2ARC buffers are not restored in ARC.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14313
    Closes #14963

commit adaa3e64ea46f21cc5f544228c48363977b7733e
Author: George Amanakis <gamanakis@gmail.com>
Date:   Sat Jun 10 02:05:47 2023 +0200

    Fix the L2ARC write size calculating logic (2)

    While commit bcd5321 adjusts the write size based on the size of the log
    block, this happens after comparing the unadjusted write size to the
    evicted (target) size.

    In this case l2ad_hand will exceed l2ad_evict and violate an assertion
    at the end of l2arc_write_buffers().

    Fix this by adding the max log block size to the allocated size of the
    buffer to be committed before comparing the result to the target
    size.

    Also reset the l2arc_trim_ahead ZFS module variable when the adjusted
    write size exceeds the size of the L2ARC device.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14936
    Closes #14954

commit 67118a7d6e74a6e818127096162478017610d13e
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 28 12:31:10 2023 +0800

    Windows: Finally drop long disabled vdev cache.

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 5d80c98c28c931339138753a4e4c1156dbf951f4
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 15:40:55 2023 -0400

    Finally drop long disabled vdev cache.

    It was a vdev level read cache, designed to aggregate many small
    reads by speculatively issuing bigger reads instead and caching
    the result.  But since it has almost no idea about what is going
    on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
    it was found to make more harm than good, for which reason it was
    disabled for the past 12 years.  These days we have much better
    instruments to enlarge the I/Os, such as speculative and prescient
    prefetches, I/O scheduler, I/O aggregation etc.

    Besides just the dead code removal this removes one extra mutex
    lock/unlock per write inside vdev_cache_write(), not otherwise
    disabled and trying to do some work.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14953

commit 1f1ab33781b5736654b988e2e618ea79788fa1f7
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri Jun 9 11:10:01 2023 -0700

    ZTS: Skip checkpoint_discard_busy

    Until the ASSERT which is occasionally hit while running
    checkpoint_discard_busy is resolved skip this test case.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #12053
    Closes #14952

commit b94049c2cbedbbe2af8e629bf974a6ed93f11acb
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 13:14:05 2023 -0400

    Improve l2arc reporting in arc_summary.

    - Do not report L2ARC as FAULTED in presence of in-flight writes.
    - Report read and write I/Os, bytes and errors.
    - Remove few numbers not important to average user.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #12304
    Closes #14946

commit 31044b5cfb6f91d376034c4d6374f61baaf03232
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 28 12:00:39 2023 +0800

    Windows: Use list_remove_head() where possible.

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 32eda54d0d75a94b6aa71dc80aa958095feb8011
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 13:12:52 2023 -0400

    Use list_remove_head() where possible.

    ... instead of list_head() + list_remove().  On FreeBSD the list
    functions are not inlined, so in addition to more compact code
    this also saves another function call.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14955

commit fe7693a3f87229d1ae93b5ce2bb84d8bb86a9f5c
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 9 13:08:05 2023 -0400

    ZIL: Fix race introduced by f63811f0721.

    We are not allowed to access lwb after setting LWB_STATE_FLUSH_DONE
    state and dropping zl_lock, since it may be freed by zil_sync().
    To free itxs and waiters after dropping the lock we need to move
    lwb_itxs and lwb_waiters lists elements to local storage.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14957
    Closes #14959

commit 44c5a0c92f98e8c21221bd7051729d1947a10736
Author: Rich Ercolani <214141+rincebrain@users.noreply.github.com>
Date:   Wed Jun 7 14:14:05 2023 -0400

    Revert "systemd: Use non-absolute paths in Exec* lines"

    This reverts commit 79b20949b25c8db4d379f6486b0835a6613b480c since it
    doesn't work with the systemd version shipped with RHEL7-based systems.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #14943
    Closes #14945

commit ba5af00257eb4eb3363f297819a21c4da811392f
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Wed Jun 7 10:43:43 2023 -0700

    Linux: Never sleep in kmem_cache_alloc(..., KM_NOSLEEP) (#14926)

    When a kmem cache is exhausted and needs to be expanded a new
    slab is allocated.  KM_SLEEP callers can block and wait for the
    allocation, but KM_NOSLEEP callers were incorrectly allowed to
    block as well.

    Resolve this by attempting an emergency allocation as a best
    effort.  This may fail but that's fine since any KM_NOSLEEP
    consumer is required to handle an allocation failure.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Adam Moss <c@yotes.com>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>

commit d4ecd4efde1692641d1d0b89851e7a15e90632f8
Author: George Amanakis <gamanakis@gmail.com>
Date:   Tue Jun 6 21:32:37 2023 +0200

    Fix the L2ARC write size calculating logic

    l2arc_write_size() should return the write size after adjusting for trim
    and overhead of the L2ARC log blocks. Also take into account the
    allocated size of log blocks when deciding when to stop writing buffers
    to L2ARC.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14939

commit 8692ab174e18faf444681d67d7ea4418600553cc
Author: Rob Norris <rob.norris@klarasystems.com>
Date:   Wed Mar 15 18:18:10 2023 +1100

    zdb: add -B option to generate backup stream

    This is more-or-less like `zfs send`, but specifying the snapshot by its
    objset id for situations where it can't be referenced any other way.

    Sponsored-By: Klara, Inc.
    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Reviewed-by: WHR <msl0000023508@gmail.com>
    Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
    Closes #14642

commit df84ca3f3bf9f265ebc76de17394df529fd07af6
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 28 11:05:55 2023 +0800

    Windows: znode: expose zfs_get_zplprop to libzpool

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 944c58247a13a92c9e4ffb2c0a9e6b6293dca37e
Author: Rob Norris <rob.norris@klarasystems.com>
Date:   Sun Jun 4 11:14:20 2023 +1000

    znode: expose zfs_get_zplprop to libzpool

    There's no particular reason this function should be kernel-only, and I
    want to use it (indirectly) from zdb. I've moved it to zfs_znode.c
    because libzpool does not compile in zfs_vfsops.c, and this at least
    matches the header its imported from.

    Sponsored-By: Klara, Inc.
    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Reviewed-by: WHR <msl0000023508@gmail.com>
    Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
    Closes #14642

commit 429f58cdbb195c8d50ed95c7309ee54d37526b70
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Mon Jun 5 14:51:44 2023 -0400

    Introduce zfs_refcount_(add|remove)_few().

    There are two places where we need to add/remove several references
    with semantics of zfs_refcount_(add|remove). But when debug/tracing
    is disabled, it is a crime to run multiple atomic_inc() in a loop,
    especially under congested pool-wide allocator lock.

    Introduced new functions implement the same semantics as the loop,
    but without overhead in production builds.

    Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14934

commit 077c2f359feb69a13bee37ac4220d271d1c7bf27
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon Jun 5 11:08:24 2023 -0700

    Linux 6.3 compat: META (#14930)

    Update the META file to reflect compatibility with the 6.3 kernel.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>

commit c2fcd6e484107fc7435087771757e88ba84f6093
Author: Graham Perrin <grahamperrin@gmail.com>
Date:   Fri Jun 2 19:25:13 2023 +0100

    zfs-create(8): ZFS for swap: caution, clarity

    Make the section heading more generic (the section relates to ZFS files
    as well as ZFS volumes).

    Swapping to a ZFS volume is prone to deadlock. Remove the related
    instruction, direct readers to OpenZFS FAQ. Related, but not linked
    from within the manual page:

    <https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux>
    (Using a zvol for a swap device on Linux).

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Graham Perrin <grahamperrin@freebsd.org>
    Issue #7734
    Closes #14756

commit 251dbe83e14085a26100aa894d79772cbb69dcda
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri Jun 2 14:01:58 2023 -0400

    ZIL: Allow to replay blocks of any size.

    There seems to be no reason for ZIL blocks to be limited by 128KB
    other than replay code is written in such a way.  This change does
    not increase the limit yet, just removes the artificial limitation.

    Avoided extra memcpy() may save us a second during replay.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14910

commit 76170249d538965655dbd3206cd59566b1d3944b
Author: Val Packett <val@packett.cool>
Date:   Thu May 11 18:16:57 2023 -0300

    PAM: enable testing on FreeBSD

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit d1b68a45441cae8c399a8a3ed60b29726ed031ff
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 22:17:12 2023 -0300

    PAM: support password changes even when not mounted

    There's usually no requirement that a user be logged in for changing
    their password, so let's not be surprising here.

    We need to use the fetch_lazy mechanism for the old password to avoid
    a double prompt for it, so that mechanism is now generalized a bit.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit 7424feff72f1e17ea27bcfe0d36cabce7c732eea
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 22:34:58 2023 -0300

    PAM: add 'uid_min' and 'uid_max' options for changing the uid range

    Instead of a fixed >=1000 check, allow the configuration to override
    the minimum UID and add a maximum one as well. While here, add the
    uid range check to the authenticate method as well, and fix the return
    in the chauthtok method (seems very wrong to report success when we've
    done absolutely nothing).

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit fc9e012f5fc7e7997acee2b6d8d759622b319f0e
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 22:02:13 2023 -0300

    PAM: add 'forceunmount' flag

    Probably not always a good idea, but it's nice to have the option.
    It is a workaround for FreeBSD calling the PAM session end earier than
    the last process is actually done touching the mount, for example.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit a39ed83bd31cc0c8c98dc3c4cc3d11b03d9af620
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 19:35:57 2023 -0300

    PAM: add 'recursive_homes' flag to use with 'prop_mountpoint'

    It's not always desirable to have a fixed flat homes directory.
    With the 'recursive_homes' flag, 'prop_mountpoint' search would
    traverse the whole tree starting at 'homes' (which can now be '*'
    to mean all pools) to find a dataset with a mountpoint matching
    the home directory.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit 7f8d5ef815b7559fcc671ff2add33ba9c2a74867
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 21:56:39 2023 -0300

    PAM: use boolean_t for config flags

    Since we already use boolean_t in the file, we can use it here.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit e2872932c85189f06a68f0ad10bd8eb6895d79c2
Author: Val Packett <val@packett.cool>
Date:   Fri May 5 20:00:48 2023 -0300

    PAM: do not fail to mount if the key's already loaded

    If we're expecting a working home directory on login, it would be
    rather frustrating to not have it mounted just because it e.g. failed to
    unmount once on logout.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Felix Dörre <felix@dogcraft.de>
    Signed-off-by: Val Packett <val@packett.cool>
    Closes #14834

commit b897137e2044c3ef6120820f753d940b7dfb58be
Author: Rich Ercolani <214141+rincebrain@users.noreply.github.com>
Date:   Wed May 31 19:58:41 2023 -0400

    Revert "initramfs: use `mount.zfs` instead of `mount`"

    This broke mounting of snapshots on / for users.

    See https://github.com/openzfs/zfs/issues/9461#issuecomment-1376162949 for more context.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #14908

commit 10cde4f8f60d4d55887d7122a5742e6e4f90280c
Author: Luís Henriques <73643340+lumigch@users.noreply.github.com>
Date:   Tue May 30 23:15:24 2023 +0100

    Fix NULL pointer dereference when doing concurrent 'send' operations

    A NULL pointer will occur when doing a 'zfs send -S' on a dataset that
    is still being received.  The problem is that the new 'send' will
    rightfully fail to own the datasets (i.e. dsl_dataset_own_force() will
    fail), but then dmu_send() will still do the dsl_dataset_disown().

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Luís Henriques <henrix@camandro.org>
    Closes #14903
    Closes #14890

commit 12452d79a3fd29af1dc0b95f3e367e3ce339702b
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon May 29 12:55:35 2023 -0700

    ZTS: zvol_misc_trim disable blk mq

    Disable the zvol_misc_fua.ksh and zvol_misc_trim.ksh test cases on impacted
    kernels.  This issue is being actively worked in #14872 and as part of that
    fix this commit will be reverted.

        VERIFY(zh->zh_claim_txg == 0) failed
        PANIC at zil.c:904:zil_create()

    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14872
    Closes #14870

commit 803c04f233e60a2d23f0463f299eba96c0968602
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 26 18:47:52 2023 -0400

    Use __attribute__((malloc)) on memory allocation functions

    This informs the C compiler that pointers returned from these functions
    do not alias other functions, which allows it to do better code
    optimization and should make the compiled code smaller.

    References:
    https://stackoverflow.com/a/53654773
    https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute
    https://clang.llvm.org/docs/AttributeReference.html#malloc

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14827

commit 64d8bbe15f77876ae9639b9971a743776a41bf9a
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri May 26 15:39:23 2023 -0700

    ZTS: Add zpool_resilver_concurrent exception

    The zpool_resilver_concurrent test case requires the ZED which is not used
    on FreeBSD.  Add this test to the known list of skipped tested for FreeBSD.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14904

commit e396d30d29ed131194605222e6ba1fec1ef8b2ca
Author: Mike Swanson <mikeonthecomputer@gmail.com>
Date:   Fri May 26 15:37:15 2023 -0700

    Add compatibility symlinks for FreeBSD 12.{3,4} and 13.{0,1,2}

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
    Closes #14902

commit f6dd0b8c1cc41707d299b7123f80912f43d03340
Author: Colm <colm@tuatha.org>
Date:   Fri May 26 10:04:19 2023 -0700

    Adding new read-only compatible zpool features to compatibility.d/grub2

    GRUB2 is compatible with all "read-only compatible" features,
    so it is safe to add new features of this type to the grub2
    compatibility list. We generally want to include all compatible
    features, to minimize the differences between grub2-compatible
    pools and no-compatibility pools.

    Adding new properties `livelist` and `zpool_checkpoint` accordingly.

    Also adding them to the man page which references this file as an
    example, for consistency.

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Colm Buckley <colm@tuatha.org>
    Closes #14893

commit 013d3a1e0e00d83dabe70837b23dab48c1bac592
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 26 13:03:12 2023 -0400

    btree: Implement faster binary search algorithm

    This implements a binary search algorithm for B-Trees that reduces
    branching to the absolute minimum necessary for a binary search
    algorithm. It also enables the compiler to inline the comparator to
    ensure that the only slowdown when doing binary search is from waiting
    for memory accesses. Additionally, it instructs the compiler to unroll
    the loop, which gives an additional 40% improve with Clang and 8%
    improvement with GCC.

    Consumers must opt into using the faster algorithm. At present, only
    B-Trees used inside kernel code have been modified to use the faster
    algorithm.

    Micro-benchmarks suggest that this can improve binary search performance
    by up to 3.5 times when compiling with Clang 16 and up to 1.9 times when
    compiling with GCC 12.2.

    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14866

commit 1854df330aa57cda39f076e8ab11e17ca3697bb8
Author: George Amanakis <gamanakis@gmail.com>
Date:   Fri May 26 18:53:00 2023 +0200

    Fix inconsistent definition of zfs_scrub_error_blocks_per_txg

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14894

commit 8735e6ac03742fcf43adde3ce127af698a32c53a
Author: Damiano Albani <damiano.albani@gmail.com>
Date:   Fri May 26 01:10:54 2023 +0200

    Add missing files to Debian DKMS package

    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Damiano Albani <damiano.albani@gmail.com>
    Closes #14887
    Closes #14889

commit d439021bd05a5cc0bb271a5470abb67af2f7bcda
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu May 25 13:53:08 2023 -0700

    Update compatibility.d files

    Add an openzfs-2.2 compatibility file for the next release.

    Edon-R support has been enabled for FreeBSD removing the need
    for different FreeBSD and Linux files.  Symlinks for the -linux
    and -freebsd names are created for any scripts expecting that
    convention.

    Additionally, a symlink for ubunutu-22.04 was added.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14833

commit da54d5f3f9576b958e3eadf4f4d8f68c91b3d6e4
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu May 25 16:51:53 2023 -0400

    zil: Add some more statistics.

    In addition to a number of actual log bytes written, account also a
    total written bytes including padding and total allocated bytes (bytes
    <= write <= alloc).  It should allow to monitor zil traffic and space
    efficiency.

    Add dtrace probe for zil block size selection.

    Make zilstat report more information and fit it into less width.

    Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14863

commit faa4955023d089668bd6c564c195a933d1eac455
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu May 25 12:48:43 2023 -0400

    ZIL: Reduce scope of per-dataset zl_issuer_lock.

    Before this change ZIL copied all log data while holding the lock.
    It caused huge lock contention on workloads with many big parallel
    writes.  This change splits the process into two parts: first,
    zil_lwb_assign() estimates the log space needed for all transactions,
    and zil_lwb_write_close() allocates blocks and zios while holding the
    lock, then, after the lock in dropped, zil_lwb_commit() copies the
    data, and zil_lwb_write_issue() issues the I/Os.

    Also while there slightly reduce scope of zl_lock.

    Reviewed-by: Paul Dagnelie <pcd@delphix.com>
    Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14841

commit f77b9f7ae83834ade1da21cfc16b8a273df3acfc
Author: Dimitri John Ledkov <19779+xnox@users.noreply.github.com>
Date:   Wed May 24 20:31:28 2023 +0100

    systemd: Use non-absolute paths in Exec* lines

    Since systemd v239, Exec* binaries are resolved from PATH when they
    are not-absolute. Switch to this by default for ease of downstream
    maintenance. Many downstream distributions move individual binaries
    to locations that existing compile-time configurations cannot
    accommodate.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
    Closes #14880

commit 4bfb9d28cffd4dfeb4b91359b497d100f668bb34
Author: Akash B <akash-b@hpe.com>
Date:   Thu May 25 00:58:09 2023 +0530

    Fix concurrent resilvers initiated at same time

    For draid vdevs it was possible to initiate both the
    sequential and healing resilver at same time.

    This fixes the following two scenarios.
         1) There's a window where a sequential rebuild can
    be started via ZED even if a healing resilver has been
    scheduled.
    	- This is fixed by adding additional check in
    spa_vdev_attach() for any scheduled resilver and return
    appropriate error code when a resilver is already in
    progress.

         2) It was possible for zpool clear to start a healing
    resilver when it wasn't needed at all. This occurs because
    during a vdev_open() the device is presumed to be healthy not
    until the device is validated by vdev_validate() and it's set
    unavailable. However, by this point an async resilver will
    have already been requested if the DTL isn't empty.
    	- This is fixed by cancelling the SPA_ASYNC_RESILVER
    request immediately at the end of vdev_reopen() when a resilver
    is unneeded.

    Finally, added a testcase in ZTS for verification.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Dipak Ghosh <dipak.ghosh@hpe.com>
    Signed-off-by: Akash B <akash-b@hpe.com>
    Closes #14881
    Closes #14892

commit c9bb406d177a00aa1f0058d29aeb29e478223273
Author: youzhongyang <youzhong@gmail.com>
Date:   Wed May 24 15:23:42 2023 -0400

    Linux 6.4 compat: reclaimed_slab renamed to reclaimed

    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Youzhong Yang <yyang@mathworks.com>
    Closes #14891

commit 79e61a873b136f13fcf140beb925ceddc1f94767
Author: Brian Atkinson <batkinson@lanl.gov>
Date:   Fri May 19 16:05:53 2023 -0400

    Hold db_mtx when updating db_state

    Commit 555ef90 did some general code refactoring for
    dmu_buf_will_not_fill() and dmu_buf_will_fill(). However, the db_mtx was
    not held when update db->db_state in those code block. The rest of the
    dbuf code always holds the db_mtx when updating db_state. This is
    important because cv_wait() db_changed is used to check for db_state
    changes.

    Updating dmu_buf_will_not_fill() and dmu_buf_will_fill() to hold the
    db_mtx when updating db_state.

    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
    Closes #14875

commit d7be0cdf93a568b6c9b4a4e15a88a5d88ebbb764
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri May 19 13:05:09 2023 -0700

    Probe vdevs before marking removed

    Before allowing the ZED to mark a vdev as REMOVED due to a
    hotplug event confirm that it is non-responsive with probe.
    Any device which can be successfully probed should be left
    ONLINE to prevent a healthy pool from being incorrectly
    SUSPENDED.  This may occur for at least the following two
    scenarios.

    1) Drive expansion (zpool online -e) in VMware environments.
       If, during the partition resize operation, a partition is
       removed and re-created then udev will send a removed event.

    2) Re-scanning the namespaces of an NVMe device (nvme ns-rescan)
       may result in a udev remove and add event being delivered.

    Finally, update the ZED to only kick in a spare when the
    removal was successful.

    Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14859
    Closes #14861

commit 054bb22686045ea1499065a4456568f0c21d939b
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Tue Jun 27 09:20:56 2023 +0800

    Windows: Teach zpool scrub to scrub only blocks in error log

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit b61e89a3e68ae19819493183ff3d1fe7bf4ffe2b
Author: George Amanakis <gamanakis@gmail.com>
Date:   Fri Dec 17 21:35:28 2021 +0100

    Teach zpool scrub to scrub only blocks in error log

    Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
    user can pause, resume and cancel the error scrub by passing additional
    command line arguments -p -s just like a regular scrub. This involves
    adding a new flag, creating new libzfs interfaces, a new ioctl, and the
    actual iteration and read-issuing logic. Error scrubbing is executed in
    multiple txg to make sure pool performance is not affected.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Co-authored-by: TulsiJain tulsi.jain@delphix.com
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #8995
    Closes #12355

commit 61bfb3cb5dd792ec7ca0fbfca59b165f3ddbe1f5
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu May 18 10:02:20 2023 -0700

    Add the ability to uninitialize

    zpool initialize functions well for touching every free byte...once.
    But if we want to do it again, we're currently out of luck.

    So let's add zpool initialize -u to clear it.

    Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
    Closes #12451
    Closes #14873

commit 855b62942d4ca5dab3d65b7000f9d284fd1560bb
Author: Antonio Russo <aerusso@aerusso.net>
Date:   Mon May 15 17:11:33 2023 -0600

    test-runner: pass kmemleak and kmsg to Cmd.run

    test-runner.py orchestrates all of the ZTS executions. The `Cmd` object
    manages these process, and its `run` method specifically invokes these
    possibly long-running processes, possibly retrying in the event of a
    timeout. Since its inception, memory leak detection using the kmemleak
    infrastructure [1], and kernel logging [2] have been added to this run
    mechanism.

    However, the callback to cull a process beyond its timeout threshold,
    `kill_cmd`, has evaded modernization by both of these changes. As a
    result, this function fails to properly invoke `run`, leading to an
    untrapped exception and unreported test failure.

    This patch extends `kill_cmd` to receive these kernel devices through
    the `options` parameter, and regularizes all the `.run` calls from
    `Cmd`, and its subclasses, to accept that parameter.

    [1] Commit a69765ea5b563e0cd4d15fac4b1ac08c6ccf12d1
    [2] Commit fc2c0256c55a2859d1988671b0896d22b75c8aba

    Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
    Signed-off-by: Antonio Russo <aerusso@aerusso.net>
    Closes #14849

commit 537939565123fd2afa097e9a56ee3efd28779e5f
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 12 17:10:14 2023 -0400

    Fix undefined behavior in spa_sync_props()

    8eae2d214cfa53862833eeeda9a5c1e9d5ded47d caused Coverity to begin
    complaining about "Improper use of negative value" in two places in
    spa_sync_props() because Coverity correctly inferred from `prop ==
    ZPOOL_PROP_INVAL` that prop could be -1 while both zpool_prop_to_name()
    and zpool_prop_get_type() use it an array index, which is undefined
    behavior.

    Assuming that the system does not panic from an attempt to read invalid
    memory, the case statement for ZPOOL_PROP_INVAL will ensure that only
    user properties will reach this code when prop is ZPOOL_PROP_INVAL, such
    that execution will continue safely. However, if we are unlucky enough
    to read invalid memory, then the system will panic.

    This issue predates the patch that caused coverity to begin complaining.
    Thankfully, our userland tools do not pass nonsense to us, so this bug
    should not be triggered unless a future userland tool attempts to set a
    property that we do not understand.

    Reported-by: Coverity (CID-1561129)
    Reported-by: Coverity (CID-1561130)
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: George Amanakis <gamanakis@gmail.com>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14860

commit 02351b380f0430980bfb92e83d0800df104bd06a
Author: Richard Yao <richard.yao@alumni.stonybrook.edu>
Date:   Fri May 12 16:47:56 2023 -0400

    Fix use after free regression in spa_remove_healed_errors()

    6839ec6f1098c28ff7b772f1b31b832d05e6b567 placed code in
    spa_remove_healed_errors() that uses a pointer after the kmem_free()
    call that frees it.

    Reported-by: Coverity (CID-1562375)
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: George Amanakis <gamanakis@gmail.com>
    Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
    Closes #14860

commit e9b315ffb79ff6419694a2713fcd5fd448317904
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Mon May 15 13:52:35 2023 +0800

    Use python3 on windows

commit 3346a5b78c2db15801ce54a70a323952fdf67fa5
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Thu Jun 22 08:56:38 2023 +0900

    zfs_write() ignores errors

    If files were advanced by zfs_freesp() we ignored
    any errors returned by it.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit cce49c08316bc6a5dff287f4fa15856e26d5b18a
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Thu Jun 22 08:55:55 2023 +0900

    Correct Stream event path

    The Stream path events used the incorrect name
    "stream", now uses "file.txt:stream" as per ntfs.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 0f83d31e288d789fb4e10a7e4b12e27887820498
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 21 14:30:13 2023 +0900

    Add stub for file_hard_link_information()

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 8d6db9490364e4d281546445571d2ca9d5abda22
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 21 14:29:43 2023 +0900

    Return correct FileID in dirlist

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 4c011397229e3c38259d6956458a4fd287dca72d
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 21 10:17:30 2023 +0800

    Fix logic (#232)

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 467436b676ad897025b7ed90d8f033969da441cc
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Wed Jun 21 09:47:38 2023 +0800

    Run winbtrfs tests by default (#231)

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 56eca2a5d116c66b10579f9cf6d5f271991c7e2e
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 21 09:54:00 2023 +0900

    SetFilePositionInformation SetFileValidDataLengthInformation

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit b4fbbda470f27aee565dfa9bc0d68217b969339c
Author: Andrew Innes <andrew.c12@gmail.com>
Date:   Tue Jun 20 16:33:12 2023 +0800

    Add sleep to tests (#230)

    Signed-off-by: Andrew Innes <andrew.c12@gmail.com>

commit 94f1f52807d1f8c0c2931e9e52b91f0ce5e488f4
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 20 16:53:50 2023 +0900

    CreateFile of newfile:newstream should create both

    In addition, many more stream fixes, illegal chars, and names

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 894d512880d39ecf40e841c6d7b73157dfe397e0
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 20 08:41:37 2023 +0900

    Windows streams should return parent file ID

    When asked for File ID of a stream, it should return
    the FileID of the parent file, which is two levels up.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 0cc45d2154a2866b2f494c3790a57555c29e60c3
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 20 08:32:44 2023 +0900

    Support FILE_STANDARD_INFORMATION_EX

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit a6edd02999d581db56f4a53567f4c5db11778f64
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Mon Jun 19 10:36:13 2023 +0900

    Add xattr compat code from upstream

    and adjust calls to new API calls.
    This adds xattr=sa support to Windows.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 0e1476a3942990385d32c02403ebe2c815d567db
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 14 11:56:09 2023 +0900

    Set EA can panic

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 4a1adef6b8c2851195d692a42d5718c9a1b03490
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Wed Jun 14 09:49:57 2023 +0900

    Incorrect MAXPATH used in delete entry

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 2c0d119e37cb3eed1acac90efa9fe0f8c173e0f0
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue Jun 13 16:19:42 2023 +0900

    Large changes fixing FS notify events

    Some incorrect behavior still, query name of
    a stream is wrong.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 5b2b2b0550a493497a0b460206079fd57c639543
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue May 16 14:42:52 2023 +0900

    file name and file full information buffer overrun

    When a buffer is not big enough, we would still
    null terminate on the full string, beyond the supplied
    buffer.

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 94bfb92951a5ccdef7b2a1fb818fafdafbc4fff0
Author: Jorgen Lundman <lundman@lundman.net>
Date:   Tue May 16 11:48:12 2023 +0900

    Correct Query EA and Query Streams

    Which includes:

    * NextEntryOffset is not offset from Buffer, but from one struct to
    the next struct.
    * Pack only complete EAs, and return Overflow if does not fit
    * query file EA information would return from Information=size
    * Call cleareaszie on VP when EAs have changed

    Signed-off-by: Jorgen Lundman <lundman@lundman.net>

commit 9c7a4071fcfc99c3308620fc1943355f9ade34b3
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri May 12 12:49:26 2023 -0400

    zil: Free lwb_buf after write completion.

    There is no sense to keep that memory allocated during the flush.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14855

commit 7e91b3222ddaadc10c92d1065529886dd3806acc
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri May 12 12:14:29 2023 -0400

    zil: Some micro-optimizations.

    Should not cause functional changes.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14854

commit 6b62c3b0e10de782c3aef0e1206aa48875519c4e
Author: Don Brady <dev.fs.zfs@gmail.com>
Date:   Fri May 12 10:12:28 2023 -0600

    Refine special_small_blocks property validation

    When the special_small_blocks property is being set during a pool
    create it enforces a limit of 128KiB even if the pool's record size
    is larger.

    If the recordsize property is being set during a pool create, then
    use that value instead of the default SPA_OLD_MAXBLOCKSIZE value.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
    Closes #13815
    Closes #14811

commit d0ab2dddde618c394fa7fe88211276786ba8ca12
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri May 12 09:07:58 2023 -0700

    ZTS: Add auto_replace_001_pos to exceptions

    The auto_replace_001_pos test case does not reliably pass on
    Fedora 37 and newer.  Until the test case can be updated to make
    it reliable add it to the list of "maybe" exceptions on Linux.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14851
    Closes #14852

commit 1e3e7a103a5026e9a2005acec7017e4024d95115
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Tue May 9 22:32:30 2023 -0700

    Make sure we are not trying to clone a spill block.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit a22891c3272d8527d4c8cb7ff52a25ef396e7add
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Thu May 4 16:14:19 2023 -0700

    Correct comment.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 9b016166dd5875db87963b5deeca8eeda094b571
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Wed May 3 23:25:22 2023 -0700

    Remove badly placed comment.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 6bcd48e213a279781ecd6df22799532cbec353d6
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Wed May 3 00:24:47 2023 -0700

    Don't call zfs_exit_two() before zfs_enter_two().

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 0919c985e294a89169adacd5ed4a240945e5fbee
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Tue May 2 15:46:14 2023 -0700

    Don't use dmu_buf_is_dirty() for unassigned transaction.

    The dmu_buf_is_dirty() call doesn't make sense here for two reasons:
    1. txg is 0 for unassigned tx, so it was a no-op.
    2. It is equivalent of checking if we have dirty records and we are doing
       this few lines earlier.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 7f88494ac91c61aeffad810e7d167badb875166e
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Tue May 2 14:24:43 2023 -0700

    Deny block cloning is dbuf size doesn't match BP size.

    I don't know an easy way to shrink down dbuf size, so just deny block cloning
    into dbufs that don't match our BP's size.

    This fixes the following situation:
    1. Create a small file, eg. 1kB of random bytes. Its dbuf will be 1kB.
    2. Create a larger file, eg. 2kB of random bytes. Its dbuf will be 2kB.
    3. Truncate the large file to 0. Its dbuf will remain 2kB.
    4. Clone the small file into the large file. Small file's BP lsize is
       1kB, but the large file's dbuf is 2kB.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 49657002f9cb57b9b4675100aaf58e1e93984bbf
Author: Pawel Jakub Dawidek <pawel@dawidek.net>
Date:   Sun Apr 30 02:47:09 2023 -0700

    Additional block cloning fixes.

    Reimplement some of the block cloning vs dbuf logic, mostly to fix
    situation where we clone a block and in the same transaction group
    we want to partially overwrite the clone.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
    Closes #14825

commit 4d31369d3055bf0cf1d4f3e1e7d43d745f2fd05f
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Thu May 11 17:27:12 2023 -0400

    zil: Don't expect zio_shrink() to succeed.

    At least for RAIDZ zio_shrink() does not reduce zio size, but reduced
    wsz in that case likely results in writing uninitialized memory.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
    Sponsored by:   iXsystems, Inc.
    Closes #14853

commit 663dc5f616e6d0427207ffcf7a83dd02fe06a707
Author: Ameer Hamza <ahamza@ixsystems.com>
Date:   Wed May 10 05:56:35 2023 +0500

    Prevent panic during concurrent snapshot rollback and zvol read

    Protect zvol_cdev_read with zv_suspend_lock to prevent concurrent
    release of the dnode, avoiding panic when a snapshot is rolled back
    in parallel during ongoing zvol read operation.

    Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
    Closes #14839

commit 7375f4f61ca587f893435184f398a767ae52fbea
Author: Tony Hutter <hutter2@llnl.gov>
Date:   Tue May 9 17:55:19 2023 -0700

    pam: Fix "buffer overflow" in pam ZTS tests on F38

    The pam ZTS tests were reporting a buffer overflow on F38, possibly
    due to F38 now setting _FORTIFY_SOURCE=3 by default.  gdb and
    valgrind narrowed this down to a snprintf() buffer overflow in
    zfs_key_config_modify_session_counter().  I'm not clear why this
    particular snprintf() was being flagged as an overflow, but when
    I replaced it with an asprintf(), the test passed reliably.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Tony Hutter <hutter2@llnl.gov>
    Closes #14802
    Closes #14842

commit 9d3ed831f309e28a9cad56c8b1520292dbad0d7b
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Tue May 9 09:03:10 2023 -0700

    Add dmu_tx_hold_append() interface

    Provides an interface which callers can use to declare a write when
    the exact starting offset in not yet known.  Since the full range
    being updated is not available only the first L0 block at the
    provided offset will be prefetched.

    Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14819

commit 2b6033d71da38015c885297d1ee6577871099744
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Tue May 9 08:57:02 2023 -0700

    Debug auto_replace_001_pos failures

    Reduced the timeout to 60 seconds which should be more than
    sufficient and allow the test to be marked as FAILED rather
    than KILLED.  Also dump the pool status on cleanup.

    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14829

commit f4adc2882fb162c82e9738c5d2d30e3ba8a66367
Author: George Amanakis <gamanakis@gmail.com>
Date:   Tue May 9 17:54:41 2023 +0200

    Remove duplicate code in l2arc_evict()

    l2arc_evict() performs the adjustment of the size of buffers to be
    written on L2ARC unnecessarily. l2arc_write_size() is called right
    before l2arc_evict() and performs those adjustments.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14828

commit 9b2c182d291bbb3ece9ceb1c72800d238d19b2e7
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Tue May 9 11:54:01 2023 -0400

    Remove single parent assertion from zio_nowait().

    We only need to know if ZIO has any parent there.  We do not care if
    it has more than one, but use of zio_unique_parent() == NULL asserts
    that.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
    Sponsored by:	iXsystems, Inc.
    Closes #14823

commit 4def61804c052a1235179e3a7c98305d8075e0e9
Author: George Amanakis <gamanakis@gmail.com>
Date:   Tue May 9 17:53:27 2023 +0200

    Enable the head_errlog feature to remove errors

    In case check_filesystem() does not error out and does not report
    an error, remove that error block from error lists and logs
    without requiring a scrub. This can happen when the original file and
    all snapshots/clones referencing it have been removed.

    Otherwise zpool status will still report that "Permanent errors have
    been detected..." without actually reporting any of them.

    To implement this change the functions introduced in corrective
    receive were modified to take into account the head_errlog feature.

    Before this change:
    =============================
    pool: test
     state: ONLINE
    status: One or more devices has experienced an error resulting in data
            corruption.  Applications may be affected.
    action: Restore the file in question if possible.  Otherwise restore the
            entire pool from backup.
       see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
    config:

            NAME                   STATE     READ WRITE CKSUM
            test                   ONLINE       0     0     0
              /home/user/vdev_a    ONLINE       0     0     2

    errors: Permanent errors have been detected in the following files:

    =============================

    After this change:
    =============================
      pool: test
     state: ONLINE
    status: One or more devices has experienced an unrecoverable error.  An
            attempt was made to correct the error.  Applications are
    unaffected.
    action: Determine if the device needs to be replaced, and clear the
    errors
            using 'zpool clear' or replace the device with 'zpool replace'.
       see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
    config:

            NAME                   STATE     READ WRITE CKSUM
            test                   ONLINE       0     0     0
              /home/user/vdev_a    ONLINE       0     0     2

    errors: No known data errors
    =============================

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14813

commit 3f2f9533ca8512ef515a73ac5661598a65b896b6
Author: George Amanakis <gamanakis@gmail.com>
Date:   Mon May 8 22:35:03 2023 +0200

    Fixes in head_errlog feature with encryption

    For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
    dsl_dataset_hold_obj() in order to enable access to the encryption keys
    (if loaded). This enables reporting of errors in encrypted filesystems
    which are not mounted but have their keys loaded.

    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: George Amanakis <gamanakis@gmail.com>
    Closes #14837

commit 288ea63effae3ba24fcb6dc412a3125b9f3e1da9
Author: Matthew Ahrens <mahrens@delphix.com>
Date:   Mon May 8 11:20:23 2023 -0700

    Verify block pointers before writing them out

    If a block pointer is corrupted (but the block containing it checksums
    correctly, e.g. due to a bug that overwrites random memory), we can
    often detect it before the block is read, with the `zfs_blkptr_verify()`
    function, which is used in `arc_read()`, `zio_free()`, etc.

    However, such corruption is not typically recoverable.  To recover from
    it we would need to detect the memory error before the block pointer is
    written to disk.

    This PR verifies BP's that are contained in indirect blocks and dnodes
    before they are written to disk, in `dbuf_write_ready()`. This way,
    we'll get a panic before the on-disk data is corrupted. This will help
    us to diagnose what's causing the corruption, as well as being much
    easier to recover from.

    To minimize performance impact, only checks that can be done without
    holding the spa_config_lock are performed.

    Additionally, when corruption is detected, the raw words of the block
    pointer are logged.  (Note that `dprintf_bp()` is a no-op by default,
    but if enabled it is not safe to use with invalid block pointers.)

    Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
    Reviewed-by: Alexander Motin <mav@FreeBSD.org>
    Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
    Closes #14817

commit 23132688b9d54ef11413925f88c02d83d607ec2b
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon May 8 11:17:41 2023 -0700

    zdb: consistent xattr output

    When using zdb to output the value of an xattr only interpret it
    as printable characters if the entire byte array is printable.
    Additionally, if the --parseable option is set always output the
    buffer contents as octal for easy parsing.

    Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #14830

commit 6deb342248e10af92e2d3fbb4e4b1221812188ff
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Mon May 8 10:09:30 2023 -0700

    ZTS: add snapshot/snapshot_002_pos exception

    Add snapshot_002_pos to the known list of occasional failures
    for FreeBSD until it can be made entirely reliable.

    Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #14831
    Closes #14832

commit a0a125bab291fe005d29be5375a5bb2a1c8261c7
Author: Alexander Motin <mav@FreeBSD.org>
Date:   Fri May 5 12:17:55 2023 -0400

    Fix two abd_gang_add_gang() issues.

    - There is no reason to assert that added gang is not empty.  It
    may be weird to add an empty gang, but it is legal.
     - When moving chain list from the added gang clear its size, or it
    will trigger assertion in abd_verify() when that gang is freed.

    Revie…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants