-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dsl_dataset: put IO-inducing frees on the pool deadlist #16722
Conversation
I wonder if this might have any impact on the OOMs I was having when working with building and tearing down several large docker containers in sequence, as mentioned in #10255 (comment) . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This looks like a good workaround in case we were unable to chunk large delete between several TXGs. But I've got a couple thoughts about it:
- We usually consider deletes as netfree transactions, since they free more space than allocate. This though does not promise when exactly the space is freed, that I guess in theory may allow pool overflow. Sure things are already far from predictable in case of dedup or cloning, but as I have told, just a thought.
- As I can see,
zfs_max_async_dedup_frees
does not apply to cloned blocks, since it is not easily visible from the block pointer, so I wonder if the issue may still be reproducible when sync thread will still try to free all the cloned blocks at once, or at least as many as it can within few seconds of TXG timeout. I wonder if we could somehow add limit for cloned blocks there, or we should re-enablezfs_async_block_max_blocks
.
@satmandu ZFS ZIOs, massive allocation of which is addressed here, are not accounted towards ARC, so they may increase memory pressure on the system and make kernel to request ARC eviction, which being ignored due to evil |
@amotin I do have block cloning enabled. zpool get feature@block_cloning rpool
NAME PROPERTY VALUE SOURCE
rpool feature@block_cloning active local |
Enabled would not be a sign, but active -- may be. But you may also look how much you have cloned via |
I also have it active, I believe as per cat /sys/module/zfs/parameters/zfs_bclone_enabled
1 Looking further: zpool get all rpool | grep bclone
rpool bcloneused 1.60G -
rpool bclonesaved 2.24G -
rpool bcloneratio 2.39x - (Maybe that's just not enough to warrant this PR having any effect on my system.) Also, as an aside, looking at https://github.com/moby/moby/blob/master/vendor/golang.org/x/sys/unix/ioctl_linux.go, I see the use of both |
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Yeah, it's a good thought. I agree not for this PR (at least, I think "just a thought" means that heh). I wonder if the answer will be something like, if an allocation (or transaction?) comes in that we can't fulfill, we check the freeing list and if its running, we wait, or we force it to free enough right now, or... something. I was also thinking today that maybe if I think that's an interesting idea in its own right, but maybe it could wired in even higher up, like, past a certain size it's not marked netfree, and we use the netfree mark as a "now or later" flag. I guess then it might not be assignable, so idk, maybe we have to break it up instead of nuking the entire range in one go? Maybe the whole answer will be in the ZIO layer though. Maybe we want a single "batch free" op that can take a list of BPs instead. Maybe that allows some batching down at the BRT and DDT layers too. Hmm hmm. Ehh, this is all complicated, it's just deciding where you put the complexity.
Untested, but the clone block check might be just: diff --git module/zfs/dsl_scan.c module/zfs/dsl_scan.c
index 6cd0dbdea..0de1df948 100644
--- module/zfs/dsl_scan.c
+++ module/zfs/dsl_scan.c
@@ -3578,7 +3578,7 @@ dsl_scan_free_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
-bp_get_dsize_sync(scn->scn_dp->dp_spa, bp),
-BP_GET_PSIZE(bp), -BP_GET_UCSIZE(bp), tx);
scn->scn_visited_this_txg++;
- if (BP_GET_DEDUP(bp))
+ if (BP_GET_DEDUP(bp) || brt_maybe_exists(scn->scn_dp->dp_spa, bp))
scn->scn_dedup_frees_this_txg++;
return (0);
} It's a bit tedious, since we did that test to get into this path, and again moments before in |
Looking at the CI results it looks like with the change we're running afoul of some of the accounting. https://github.com/openzfs/zfs/actions/runs/11714317314/job/32635388639?pr=16722#step:11:2332
|
Hrm. Definitely transient, can't reproduce locally and most of the test runners don't show it. I'll have to do a code read. |
It looks like 2 of the test runners hit it. Here's the other one from the ubuntu20 runner.
|
I could use a hand with this one. @amotin you're probably my best hope here. AIUI, it's trying to credit
By this call stack, this free is not a deferred free; we go straight into
Now I don't really understand the By the conditional, there's no L1 header or references (fine). Then in There's three failing tests above. The logs make it hard to see what, if anything, is triggering it. One time it's running the I don't know enough about the ARC and especially L2ARC to know what to do here. My high-level guesses are one or more of:
|
@robn L2ARC is quite a special beast. It is written solely by |
@robn After second look, ordering between |
Yea, it is not exactly free, despite checking only for number of entries in AVL-tree without traversal, which I think may give too many false positives, don't like it. But the worse I saw is lock contention, which should reduce after #16740 , but unlikely go away completely. I don't think we can avoid second check before and after queuing since the list is still used for snapshots deletion, and we have no way to pass there the knowledge, but if we checked it just above the stack... |
@behlendorf thanks for setting that up; that was my plan too, just couldn't get to it until today. Much appreciated! I'll wait for your review on #16743 and here. Let me know if you want me to rebase this on top of #16743 if/when you merge it. |
Merged. No rebase needed. We'll want to keep our eye on subsequent CI runs to make sure that accounting bug is really resolved. |
Thanks all! @behlendorf yep, will do. Ping me if you see anything! |
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
Closes: #16697
Closes: #16708
Closes: #6783
Description
dsl_free()
callszio_free()
to free the block. For most blocks, this simply callsmetaslab_free()
without doing any IO or putting anything on the IO pipeline.Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those,
zio_free()
will issue aZIO_TYPE_FREE
IO and return.If a huge number of blocks are being freed all at once, it's possible for
dsl_dataset_block_kill()
to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, azio_t
is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from thezio_cache
. If that can't be satisfied by the kernel, an out-of-memory condition is raised.This would be better handled by improving the cases that the
dmu_tx_assign()
throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks.For now, we simply check for the cases that would cause
zio_free()
to create aFREE
IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal.Implementation notes
I thought about a separate task queue or throttle for this, but the using the pool deadlist and the async destroy machinery was much nicer. It's already there, and quiet most of the time. We have talked in the past about using if for
unlink()
for performance, and this does actually help with that: deleting enormous test files returned much faster, as you'd expect. However performance is not the purpose of this PR, which is why I've kept it narrow. I'd like to think about it more before we did it for everything, not least how it interacts with a real dataset destroy, and also scrubs etc.Still, cool you can watch this with
zpool wait -t free
.Discussion
There's a few different problems that lead to this happening. This patch isn't really addressing any of them, but rather detecting a known cause and routing around it. I'll write down all the trouble spots that lead to this so we at least have them.
zio_t
is too big1280 bytes is a lot for something we create by the thousands. I have a heap of ideas for how to reduce it and/or change the overheads of the IO pipeline and queues as a whole. They're all hugely invasive and not something I want to do in a hurry.
Bonus: there's many similar shape of bugs in the issue tracker that I believe boil down to the same kind of thing: it's easy to load up "too much" IO on the IO taskqs that then sits there, pinning memory, until the workers at the other end can clear it, which can be an age if the the thing at the end is slow. Unfortunately it's not all just
zio_t
size."free" is not really an IO
I suppose it was probably useful to think about that way because allocations are requested from within the IO pipeline. Most of the time though its just
metaslab_free()
. Obviously the code has to go somewhere, so this is mostly felt in the API:zio_free()
doesn't even take apio
, andzio_free_sync()
does but might not issue IO, but also has other rules like "same txg". So I think the idea that free is "magic IO" is too magic, and its unpredictability is what gets us here.I did try to rework just these functions to be "free this sometime, don't care how" or "free it now or error" and then use the error to decide what to do. It didn't work out and I abandon it.
"frees that cause IO" isn't obvious
The old gang and dedup tests are fine, for now.
brt_maybe_exists()
is a little troubling, as its a feature in use everywhere now, and it might mean that a lot more frees are going on pipeline than used to.But I also suspect its not the full story. #16037 shows similar behaviour but doesn't appear to involve gang, dedup or cloned blocks. I could find anything else that obviously might generate a ton of IO from a free, but there's still lots of ZFS I'm not familiar with and it also doesn't cover the future.
dmu_tx assignment isn't granular enough
Deleting a file gets a single call to
dmu_tx_hold_free(tx, obj, 0, DMU_OBJECT_END)
. In that respect its doing nothing wrong, but that's clearly too much all at once, and is ultimately what led to this, as the write throttle can only push back on single holds.write throttle doesn't know about resource overheads
Similar vibes to #15511. Most things don't account for CPU/memory/IO overheads, and so can't push back when those are low. There's no single answer of course, but it seems we're going to run into problems like this more and more as everything gets bigger and bigger.
How Has This Been Tested?
The goal is to create a single file with millions of L0 blocks, and then drop them all at once. This worked well enough for me. Needs a spare 2T+ disk.
Watching the
zio_cache
field through the process is instructive. Without this patch, even if you haven't gone large enough to trigger an OOM, you will see the count blow out a long way. You may see substantial pauses while reclaim kicks in.With this patch, it runs at a much more leisurely rate. Still gets big, but never ridiculous.
ZTS run is going locally, though I'm not expecting to see anything much. At least, it should make sure I didn't screw up the accounting.
Update: @serjponomarev has tested this, see #16708 (comment). Seems plausible?
Types of changes
Checklist:
Signed-off-by
.