-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gang ABD Type #10069
Gang ABD Type #10069
Conversation
cc87b62
to
97671fb
Compare
Codecov Report
@@ Coverage Diff @@
## master #10069 +/- ##
==========================================
- Coverage 79.52% 79.32% -0.20%
==========================================
Files 389 390 +1
Lines 123120 123349 +229
==========================================
- Hits 97906 97842 -64
- Misses 25214 25507 +293
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! What performance improvement did you measure, and on what workload?
module/os/linux/zfs/abd.c
Outdated
typedef struct abd_link { | ||
abd_t *link_abd; | ||
list_node_t link_node; | ||
} abd_link_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see how this could be needed if one normal abd could be part of several "multi" abd's. Do we take advantage of that functionality? If not, the implementation could be faster/simpler by having the list_node_t in the abd_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree that putting the list_node_t inside of the abd_t struct is a much simpler approach. However, as you stated, I think leaving it as is would allow the same abd to be in multiple “multi” abd’s. I am fairly certain this can in fact happen in the Direct IO code path that is in our other PR.
We construct “multi” abd’s when the request is not aligned perfectly. In turn that same abd can be in the vdev aggregate function as well. Just in a forward thinking approach, with the Direct IO work, I think we would want to leave the list_node_t in a separate struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to investigate if we really need the unaligned handling in the other PR. I would guess that we don't really care about performance of unaligned directios, so perhaps we can fall back on a simpler implementation (even if it is less efficient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do the investigation, but my guess is that there are compelling use cases for supporting "unaligned" directIO. Note that the alignment constraint is against the block size. E.g., if we don't have a full block read, we construct a full block by using a multi-abd. So a use case example would be a database that is construct using an 8K ZFS block size is then read using 4K directIO random reads. I think we will see a significant performance difference if we fall back to doing a data copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our performance investigations with recordsize=8k, I haven't seen the per-byte costs like bcopy() be significant (even with compression=lz4 and checksum=edonr, both more expensive than the defaults). CPU time is almost always dominated by the per-block costs (e.g. creating dbufs, arc bufs, etc, and tearing them down). One of my concerns is that allocating/freeing the abd_link_t would add to the per-block costs.
That said, I agree that in some cases, partial-block directio reads could benefit from the directio semantics. Bringing us back to this PR, could you help me understand how that leads to an abd_t being part of more than one multi-abd? I get that for the partial-block directio read, we create a multi-abd where the children are the user's buffer plus a dummy buffer that we will later throw away. Does the user's abd (as opposed to the first multi-abd) then become part of a different multi-abd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding an abd to a multi-abd, you should be able to assert that it isn't already in a multi-abd. Something like ASSERT(!list_link_active(&abd->abd_multi_link))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahrens, I added an ASSERT for !list_link_active() to abd_add_child() and that is now where the failure is happening. Thank you for pointing this out! It does seem that an abd_t can be present in multiple multi-abd's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what scenario do we add one abd_t to multiple multi-abd's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mmaybee and I are in the process of hunting this down at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we did wind up moving the list link inside of the ABD struct itself and addressed the situations where a ABD can be in multiple Gang ABDs.
module/os/linux/zfs/abd.c
Outdated
abd_zero_buf = zio_buf_alloc(SPA_MAXBLOCKSIZE); | ||
(void) memset(abd_zero_buf, 0, SPA_MAXBLOCKSIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're OK with taking consuming an 16MB of memory in all cases? Even in e.g. the crash kernel? (cc @sdimitro)
It looks like in at least some cases we try to handle abd_zero_buf being NULL, which I'm not sure ever happens given this code, but maybe we could not allocate this, depending on the amount of memory (e.g. <1GB, don't allocate abd_zero_buf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid consuming this memory, what if we instead allocated a single zero'd page and then created a 16M scatter abd solely consisting of that page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea, although it would limit the use of the abd to contexts that do not require a linear allocation (perhaps not a significant issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wound up going with Brian's suggestion of using a single zero'd page and making a scattered ABD of SPA_MAXBLOCKSIZE, which uses the single zero'd page as its contents.
dfd871f
to
2ac5c16
Compare
module/os/linux/zfs/abd.c
Outdated
abd_zero_buf = zio_buf_alloc(SPA_MAXBLOCKSIZE); | ||
(void) memset(abd_zero_buf, 0, SPA_MAXBLOCKSIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid consuming this memory, what if we instead allocated a single zero'd page and then created a 16M scatter abd solely consisting of that page.
f5bd5af
to
3c2f2bd
Compare
module/os/linux/zfs/abd.c
Outdated
/* | ||
* Create an ABD that will be the head of a list of ABD's. This is used | ||
* to "chain" scatter/gather lists together when constructing aggregated | ||
* IO's. To free this abd, abd_put() must be called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to use abd_put() to free this, vs abd_free()? Is the idea that freeing the scatterlist does not free the underlying abd's, like abd_put() does not free the underlying buffer or abd? But with multi-abd, we chose for each underlying abd whether it will be freed or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am perfectly happy changing this to be abd_free(). It would be no problem at all to change this and it makes more sense with the naming scheme of alloc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did wind up changing this. abd_free() should be called with the returned ABD after abd_alloc_gang_abd() is called.
module/os/linux/zfs/abd.c
Outdated
* Add a child ABD to a chained list of ABD's. | ||
*/ | ||
void | ||
abd_add_child(abd_t *pabd, abd_t *cabd, boolean_t multi_mem_manage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abd_add_child(abd_t *pabd, abd_t *cabd, boolean_t multi_mem_manage) | |
abd_add_child(abd_t *pabd, abd_t *cabd, boolean_t free_on_put) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I will be changing the multilist and to use abd_free() I can change the variable name to maybe “call_free_on_free” or “free_on_free”. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wound up going with "free_on_free" for this function parameter name.
module/os/linux/zfs/abd.c
Outdated
abd_t *child_abd = NULL; | ||
|
||
mutex_enter(&cabd->abd_mtx); | ||
if (list_link_active(&cabd->multi_link)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multi_mem_manage
is TRUE, and list_link_active()
is also TRUE, I don't think we implement the caller's request to free the passed-in ABD when the multi-abd is freed. I'm not sure how we could implement it.
Also, if the cabd's link is active, and ABD_FLAG_MULTI_FREE is already set on it, I'm not sure how that would work, depending on which order the two multi-abd's are freed.
In general, the multi_mem_manage=TRUE
case does not seem like it can be very general purpose. I think we need to at least ASSERT that you can't combine multi_mem_manage=TRUE
with having the child be part of multiple multi-abd's. It seems a bit fragile, so maybe we should explain the multi_mem_manage=TRUE
use case a bit more, and the restrictions on it.
Or, I wonder if we should get rid of multi_mem_manage=TRUE
(for callers outside the abd code), and make them free all the child abd's themself. E.g. in vdev_queue_agg_io_done()
, like we're already doing for the "write gap" abd's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that an ASSERT is absolutely necessary here. In general, if someone passes TRUE here they are explicitly saying they expect the multilist abd to take care of the memory management on the call to abd_free(). So by placing the ASSERT it will definitely signal an issue when someone tries to use the same abd in multiple multilist abd’s abs still expects all the memory management to be taken care of in the call to abd_free() on the multilist abd. I will also explain this more in the comments as well.
I kinda like flagging if the multilist abd should free the underlying abd’s. It allows for possibly easier use of the multilist abd by just being able to call abd_free() if it is known that freeing all the underlying abd’s can be handled there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still kept the parameter for "free_on_free" here. I did however add an ASSERT with a comment explaining "free_on_free" must be B_FALSE if an ABD is already part of another Gang ABD.
module/zfs/vdev_queue.c
Outdated
abd_copy_off(pio->io_abd, aio->io_abd, | ||
0, pio->io_offset - aio->io_offset, pio->io_size); | ||
if (pio->io_flags & ZIO_FLAG_NODATA) { | ||
abd_put(pio->io_abd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We got this abd from abd_get_zeros()
, which gives us our own abd, which is why we need to abd_put() it. Why can't we instead use abd_add_child(manage_mem=B_TRUE)
to have the multi-abd infrastructure free it for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually was looking at this earlier today and was thinking I can clean all this up. I need to clean up the code in vdev_queue_aggregate() for adding abd_get_zeros() to the multilist abd. Then in the vdev_queue_agg_io_done() function we just need to call abd_free().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wound up cleaning up the code in vdev_queue_aggregate() and now vdev_queue_agg_io_done() just consists of a call to abd_free().
e7c2d15
to
b7c0932
Compare
15b4828
to
427d454
Compare
On FreeBSD this panics on load in various different locations. Most likely an indicator of memory corruption. |
c7535b1
to
671755c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. This turned out nicely!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little hard to see what was changed with abd.c, since the code has been rearranged and split out into multiple abd_os.c files, which git history doesn't handle well. So it would be really nice if you could separate out the rearranging of code into its own commit. And then this PR could add the multi-abd stuff on top of that.
include/sys/abd.h
Outdated
/* | ||
* Linux ABD bio functions | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be "ifdef LINUX"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be changed to just be #ifdef LINUX. Simple change and would isolate this just to Linux builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wound up changing this to be #if defined(linux) with the _KERNEL check as well. I also wound creating a separate commit to just contain moving the ABD code into separate OS files for Linux and FreeBSD. This should make seeing only the Gang ABD changes much easier.
module/zfs/vdev_queue.c
Outdated
@@ -556,6 +547,14 @@ vdev_queue_agg_io_done(zio_t *aio) | |||
#define IO_SPAN(fio, lio) ((lio)->io_offset + (lio)->io_size - (fio)->io_offset) | |||
#define IO_GAP(fio, lio) (-IO_SPAN(lio, fio)) | |||
|
|||
/* | |||
* Sufficiently adjacent io_offset's in ZIOs will be aggregated. We do this | |||
* by creating a multilist ABD from the adjacent ZIOs io_abd's. By using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's find another name for "multilist ABD", since this is not related to multilist_t
/ multilist.c
. Perhaps "aggregate ABD" or "gang ABD" ("gang block" is something else but "gang" means the same thing here: "Arrange (electrical devices or machines) together to work in coordination." (see Verb definition 2))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think gang ABD would be good.
d154bdb
to
610988c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you've split the refactoring from the gang ABD changes I think it would be a good to open a new PR with just the refactoring. This way we can test those changes independantly.
include/sys/abd_impl.h
Outdated
|
||
#define ABD_SCATTER(abd) (abd->abd_u.abd_scatter) | ||
#define ABD_LINEAR_BUF(abd) (abd->abd_u.abd_linear.abd_buf) | ||
#define ABD_MULTI(abd) (abd->abd_u.abd_multi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this line should be moved to the Gang ABD commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I forgot to remove that line.
Now that you've split the refactoring from the gang ABD changes I think it would be a good to open a new PR with just the refactoring. This way we can test those changes independantly.
I went ahead and opened a PR containing the refactoring only:
#10293
include/sys/abd_impl.h
Outdated
/* | ||
* This is used to get an ABD from an Gang ABD's list based on | ||
* the provided offset. This should only be called from the | ||
* ABD source code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should any of the functions in abd_impl.h be called from outside the ABD code? If not, maybe remove this sentence, and add a comment at the beginning of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wound up just removing this comment completely.
module/os/linux/zfs/abd_os.c
Outdated
abd_alloc_struct(size_t chunkcnt) | ||
{ | ||
/* | ||
* In Linux we do not use the size passed in during ABD | ||
* In Linux we do not use the chunkcnt passed in during ABD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the argument is still conceptually the size
, not chunkcnt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to be size
instead of chunkcnt
.
module/zfs/abd.c
Outdated
@@ -114,18 +118,32 @@ abd_is_linear_page(abd_t *abd) | |||
B_TRUE : B_FALSE); | |||
} | |||
|
|||
boolean_t | |||
abd_is_gang_abd(abd_t *abd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abd_is_gang_abd(abd_t *abd) | |
abd_is_gang(abd_t *abd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this to be abd_is_gang()
.
include/sys/abd_impl.h
Outdated
@@ -84,6 +92,14 @@ struct abd_iter { | |||
struct scatterlist *iter_sg; /* current sg */ | |||
}; | |||
|
|||
/* | |||
* This is used to get an ABD from an Gang ABD's list based on | |||
* the provided offset. This should only be called from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the other function declarations in this header don't have comments describing them. This comment is also a incomplete compared to the one in the .c file (e.g. describing the in-out parameter). So I'd recommend removing this comment and relying on the one in the .c file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this comment.
module/zfs/abd.c
Outdated
} | ||
ASSERT3P(child_abd, !=, NULL); | ||
|
||
mutex_enter(&pabd->abd_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chained abd's are inherently ordered (i.e. abd_gang_add()
adds the child's data to the end of the parent). So I think that you could not correctly call abd_gang_add() on the same parent concurrently, since you wouldn't know what order the children would be added. Therefore I don't think you need to get pabd->abd_mtx
here, which removes the lock ordering requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I wound up removing the locking of pad's abd_mtx.
module/os/linux/zfs/abd_os.c
Outdated
@@ -426,8 +437,57 @@ abd_free_chunks(abd_t *abd) | |||
abd_free_sg_table(abd); | |||
} | |||
|
|||
#define ABD_ZERO_PAGE (ZERO_PAGE(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I think the code would be easier to understand without this. It's only used in a few places, and it has a different definition for userland vs kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the macro.
module/os/linux/zfs/abd_os.c
Outdated
* On Illumos this is linear ABDs, however if ldi_strategy() can ever issue I/Os | ||
* using a scatter/gather list we should switch to that and replace this call | ||
* with vanilla abd_alloc(). | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to add this back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ughh... I didn't mean to add this back in. I have removed it again.
for (abd_t *cabd = abd_gang_get_offset(abd, &off); | ||
cabd != NULL; | ||
cabd = list_next(&ABD_GANG(abd).abd_gang_chain, cabd)) { | ||
int size = MIN(io_size, cabd->abd_size - off); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we used abd_gang_get_offset
, can we now assert that off < cabd->abd_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this ASSERT.
module/zfs/abd.c
Outdated
* data over into the newly allocated ABD. | ||
* | ||
* Cases where an ABD may be part of multiple | ||
* Gang ABD's are ditto blocks and when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about something like: An ABD may becompe part of multiple gang ABD's. For example, when writing ditto blocks, the same abd is used to write to 2 or 3 locations with 2 or 3 zio_t's. Each of the zio's may be aggregated with different adjacent zio's. zio aggregation uses gang zio's, so the single abd can become part of multiple gang zio's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I update this comment.
module/zfs/abd.c
Outdated
ASSERT3U(*off, <, abd->abd_size); | ||
for (cabd = list_head(&ABD_GANG(abd).abd_gang_chain); cabd != NULL; | ||
cabd = list_next(&ABD_GANG(abd).abd_gang_chain, cabd)) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] extra blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the extra blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out nicely. Looks good, just a couple trivial nits.
Adding the gang ABD type, which allows for linear and scatter ABDs to be chained together into a single ABD. This can be used to avoid doing memory copies to/from ABDs. An example of this can be found in vdev_queue.c in the vdev_queue_aggregate() function. Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Co-authored-by: Mark Maybee <mmaybee@cray.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work Brian.
Adding the gang ABD type, which allows for linear and scatter ABDs to be chained together into a single ABD. This can be used to avoid doing memory copies to/from ABDs. An example of this can be found in vdev_queue.c in the vdev_queue_aggregate() function. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian <bwa@clemson.edu> Co-authored-by: Mark Maybee <mmaybee@cray.com> Signed-off-by: Brian Atkinson <batkinson@lanl.gov> Closes openzfs#10069
Adding the Gang ABD type, which allows for
linear and scatter abd's to be chained together
into a single abd.
Signed-off-by: Brian batkinson@lanl.gov
Authored-by: Mark Maybee mmaybee@cray.com
Added the Gang ABD type to allow for chaining together both linear and
scatter ABDs into a single ABD.
Motivation and Context
The Gang ABD type can be used to avoid unnecessary memory copies between
ABD's. An example of this is the updated vdev_queue_aggregate() function.
Description
Adding the Gang ABD type was just a matter of extending the functionality in
abd.c.
How Has This Been Tested?
Tested using --enable-debug, --enable-debuginfo and running zpool.sh.
Tested on CentOS using kernel 4.18.5.
Types of changes
Checklist:
Signed-off-by
.