-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large synchronous writes are slow when a slog is present #1012
Comments
@dechamps I was investigating this issue yesterday which was easy to reproduce given your excellent summary of the problem. Unfortunately, I don't thing it's going to be quite as trivial to fix as we'd hoped. Initially, I tried your suggestion of tweaking Upon further reflection it was also clear to me that the log size isn't really what we want to be using here. What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage. As you point out above this is almost certainly true for small I/Os, but large streaming I/O would be far better handled by the primary pool. Ideally we want a way to determine which set of vdevs is going to stream the fastest on your system to minimize the latency. For unrelated reasons I've already been looking at tracking additional per-vdev performance data such as IOPs and throughput. Once those enhancements get merged it would be relatively straight forward for the |
The code doesn't always assume the slog is faster than everything: when the log size exceeds The core issue is that both decisions (indirect/immediate and slog/main) are taken by different modules at different times, so we end up with an absurd end result. |
Sure, and I think it is pretty easy to fix the worst case behavior you described. The initial patch I put together basically added a call to USE_SLOG() when setting the slogging = spa_has_slogs(zilog->zl_spa) && USE_SLOG(zilog) && (zilog->zl_logbias == ZFS_LOGBIAS_LATENCY); For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing. |
The code lines in your last comment are basically what I had in mind for fixing this issue.
Well, that's a trade-off. Keep in mind that for the ZIL, latency is the main performance metric, not throughput. What counts is the time it takes for Basically, the primary pool should be used if (initial latency + commit size / vdev throughput) is greater for the primary pool than for the slog. For example, for a pool with 1 SSD slog (0.1 ms latency, 100 MB/s) and 3 spindles (3 ms, 300 MB/s total), then any ZIL commit larger than 0.435 MB will take less time to complete on the main pool. Which means Note that my previous demonstration is only valid when there is no congestion, i.e. the disks are idle when the commit occurs. If the disks aren't idle, then other factors come into play, and then the SSD will often win because it is likely to be less loaded than the main disks, unless all the load is on the ZIL. In addition, if disks are busy, then writing in immediate mode (i.e. twice) on the main pool halves performance, which brings us to the issue described in my original description. |
Is this issue already open? I guess it is solved with #1013 ? |
As I said in the comments of #1013, it is not. |
Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large |
I assume you sidestep the issue if you set the logbias=throughput ??? On Mar 26,2013, at 10:35 , P.SCH wrote:
|
Basically, yes. If you care about large synchronous writes, then adding a slog might be counter-productive. Note that this is true for all ZFS implementations, including FreeBSD and Illumos.
Yes, but if you set |
Thank you for this clarification. As there are so many disscussion around SSD for zil/cache, this is basically frustrating. I hope you find some time to work on this issue. |
Hi all, I just read about this old, still opened, ticket and wondered if the problem can be somewhat sidestepped by using a quite large |
@evujumenuk I wouldn't know. It definitely looks interesting, but the last time I looked into this was literally 5 years ago and I don't have any context around this anymore. |
Maybe @dinatale2 can shed some light on this. The question (as I understand it): is it still possible for large sync writes to be written in immediate mode to data vdevs if |
@evujumenuk PR #6191 does not directly address this issue. When a slog device is part of the pool the assumption is still that it offers the absolute lowest latency and is preferred when With #6191 and OpenZFS 8585 (#6566) it might be easier to implement the original suggestion. Since OpenZFS 8585 does away with the batching of blocks we may have a better idea about when the log device is being overwhelmed and should transition to indirect writes. We still want the slog to soak up bursty synchronous writes. It's also worth mentioning that if your target workload is large synchronous writes you can set |
After re-reading my original description, I don't think that's what this issue is about. The issue is that, when faced with large synchronous writes, a slog is present, and What makes sense is either writing the data in immediate mode to the slog, or in indirect mode to the main disks. Writing large blocks in immediate mode to the main disks just results in the data getting written twice (both times to the main disks) for no reason. Or at least that's what I can piece back together after re-reading my report from 5 years ago. |
Then we should close this issue because that is the current behavior. Here's the relevant block of code which decides how the log record should be written. Large blocks will always be written indirectly when a pool lacks a slog device. long zfs_immediate_write_sz = 32768; if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
write_state = WR_INDIRECT;
>>> else if (!spa_has_slogs(zilog->zl_spa) &&
resid >= zfs_immediate_write_sz)
write_state = WR_INDIRECT;
else if (ioflag & (FSYNC | FDSYNC))
write_state = WR_COPIED;
else
write_state = WR_NEED_COPY; |
Yes. If there is no slog, then there is no problem. I agree. Again, that's not what this issue is about. Everything I said in this thread assumes there is a slog attached to the pool. Again:
Examples of behaviors that would make sense, but that I did not observe when I filed this issue, include:
In any case, according to my original description the issue is quite straightforward to reproduce, so it should just be a matter of trying to reproduce it with the current code to confirm that the issue is still there. |
I think that much of the disagreement is related to the ambiguity -- and sometimes counterintuitive appearance -- of the parameter naming AND setting terminology with their at-times lack of clearly-differentiated definitions, coupled with the overriding behavior of other less-documented, less-accessible parameters' settings and/or decision logic. I believe that this is the wall that continues to be hit by many of us looking to adopt this beautiful beast with its capable promises of being the last word. For example, we have logbias=latency. "Bias" means "a predisposition towards" ... what?? We should find out by the setting: "latency". "Latency" is also an ambiguous word meaning "the time delay", having no clear positive or negative value towards which part of the overall file commit process, unfortunately for us (figuratively) less RAM-equipped thinkers. I know that I am not about to change ZFS, but when I learned the two opposite settings for logbias, being =latency or =throughput, it appeared (and continues to try to assert itself with me) that, according to ZFS' image of high standards of performance (ZFS: the LAST word in file systems), logbias=latency meant the log's predisposition is toward latency, a generally slower commit; as opposed to throughput appearing to mean (in opposition) get it done, now. "Fast". This of course is not what these parameters mean, which I believe many in the community are continually finding out. We are continually learning that this particular setting is referring to "part" of a "part" of "part" of the entire file commit decision process. In other words, for the logbias parameter's scope (which can unfortunately be overridden by other, less-accessible but equally powerful variables' settings, as well as even more powerful, all-but-undocumented logic), latency means a target of "low perceived latency" and probably worse, is supposed to mean "longer delay in having the file commit fully committed to permanent storage". Whereas throughput means "bypass the SLOG and go straight to disk", which, although is the shortest by-wire path, appears to slow down the overall throughput of the sum of the file commit processes under many workloads. Pretzel-Minding... And often seemingly unfair with the other overriding parameters being less accessible, or less documented. Another example of the knotted verbage is found in the likes of "writes being logged in indirect mode or in immediate mode." To the uninitiated, these types of statements, by way of "common" reasoning, seem to imply that "indirect mode" means that the writes are "not" directly written/logged to disk, and conversely, "direct mode" would imply that the writes "are" written/logged directly. But of course, the "opposite" is true, and we are dealing with a several step write process that involves logging on one or more levels, at potentially 3 locations, RAM, ZIL or SLOG, and Final Resting Place. The OP does a fair job of defining the true behavior of "indirect" and "direct mode" in this context. I mean, come on. A "write being logged in indirect mode" really means that the write IS directly written to it's final location, bypassing a log??? Why is the verb "logged" even used??? Why did the creators take such pains to write such convoluted, counterintuitive phrasing??? "Why God, Why???" This type of verbage is much like saying, "It is certainly not bright outside" in describing the outside lighting at noonday, but with reference to the full-moon's illumination! A full moon is always on the opposite side of the world at noon in anyone's timezone, and therefore it is really "moon-dark", but "sun-bright" outside. It completely defies the childlike desire to want to understand "now". They just wanted to know if they could go out to play! If us kids miss just one apparently benign but critical word, with its lack of clearly differentiated meaning and context, the concept is either misunderstood, or perhaps understood as the world exists on "Opposite Day". I fear that for many "would-be" ZFS-ists, the "last word" is unfortunately the one they never get to because it wasn't a part of an 1) official (or not), clearly-differentiated documentation, 2) in an all-encompassing, easy-to-reference manner, 3) from a single standards-oriented, open source (or not!) organization or web location. |
Do not assume that low-latency devices can deliver high throughput. Thus the logbias property attempts to allow some control over those conditions. |
I'm not quite sure if your rant is meant to be taken seriously (loved your second-to-last paragraph though). In any case, the following should clarify the terminology for those unfamiliar with the internals of ZIL implementation:
Every time ZFS commits the ZIL, it has to make a decision about two things:
Assuming nothing has changed since my original post, ZFS makes these decisions based on:
Following the decision tree that I described in my original post, there are four possible outcomes:
The reason why I filed this bug is because the above is inefficient, since it can lead to large amounts of data being written in immediate mode to the main disks even though a SLOG is present, which is a very dumb thing to do (it is strictly worse behavior than if you do not have any SLOG at all!). I believe a more efficient decision logic would be:
Such a change would simultaneously improve latency, throughput, and efficiency in the case where large synchronous writes are happening in a SLOG-enabled pool. |
Sorry I posted the last post under the wrong Github account (used my work account). I have deleted that and reposted under personal account. I'm not sure if I understand the proposed case 3 and 4 and I wonder if are building your cases incorrectly? Assumption: SLOG present and SLOG can write faster than data disks for small writes. Case 1: Write it to the data disks in indirect mode since that doesnt need to write twice. Case 2: Write to the SLOG and data disks (indirect mode - maintain throughput) simultaneously. Case 3: The lowest latency operation here (for a single large write) is to write it to the data disks in immediate mode (remember, you cant write it to the SLOG). Edit: If there are multiple large writes, the lowest latency operation for subsequent writes will be increased. If we write in indirect mode, then we pay a small latency penalty for the first large write. Case 4: The lowest latency operation here then be to write it to the SLOG while simultaneously writing it to the data device in indirect mode (maintain as much throughput as possible, latency already taken care of by SLOG). |
Sure. This is what ZFS already does.
No. There is also the problem that such a strategy could exacerbate the write endurance issues of SSD technology.
I don't think that would be smart. What would be smart would be to store the data in indirect mode on the main disks, and store the ZIL indirect blocks (i.e. the metadata) on the SLOG. (Your proposed solution is what ZFS already does, which is why I've opened this issue in the first place. There is also the problem that the strategy is weirdly inconsistent between the "SLOG present" and "SLOG non present" cases: in the first case the data is written in immediate mode to the main disks, while in the second case the data is written in indirect mode to the main disks. That discrepancy makes no sense, and I doubt it was intended.)
It is technically true that, assuming everything is idle, writing in immediate mode to the main disks gets you the lowest possible latency (assuming you don't want to write the full data to the SLOG). However, writing in indirect mode to the main disks, while storing the ZIL indirect blocks on the SLOG, would probably make zero difference to latency (the SLOG will write the metadata faster than the main disks will write the data blocks, so you won't have to wait for the SLOG), but it would make a very large difference in terms of efficiency (you only write the data once instead of twice). Basically my point is: your proposed solution is latency-optimal, but it's inefficient. The solution I'm proposing is both latency-optimal (as long as your SLOG is not abnormally slow) and efficient. There is no need to make a tradeoff here. The only case where my solution is not latency optimal is if, for some reason, the SLOG takes longer to acknowledge small (metadata) writes than it takes for your main disk to acknowledge large (data) writes. Which would be weird and probably means you need to buy a SLOG that doesn't suck :)
I agree for this case, and this is what ZFS already does. |
You make good points, although I think today, even for homelab type use cases, if you want to use/can afford an SLOG, you can easily pick up something NVMe and faster than your spindles while remaining reasonable on the budget. For example, I have a pair of ~$350 Optane 900p 280 GB - a consumer drive - each of which has rated endurance 5 PB, write long-term sustained ~2GB/s even when drive is full. I really wish that sync writes used it effectively turning the SLOG into a writeback cache. In the enterprise space where ZFS really shines, I'd imagine the ruler SSD form factor favoring the speed and endurance of PCIe SSDs even more.
Ah. That makes sense now. Your explanation of immediate mode confused me as little bit as it claimed it is the lowest latency operation (but I guess the context is without the help of the SLOG).
Just to be clear, you had proposed this case under SLOG, immediate. |
You should be able to do that today by setting a very high value on the (Note that, strictly speaking, the SLOG can't be used as a writeback cache, because the ZIL is never read from unless you're recovering from a crash. The data stored in the ZIL is a copy of pending writes that are in RAM, and will stay in RAM until the next TXG commit, at which point the ZIL is discarded.) |
@mtippmann Good catch. Yes, this commit by @dinatale2 appears to solve most of the problem described here. (Actually @evujumenuk had already mentioned that commit months ago but I declined to take a look at the time.) It's not really the solution I would have hoped, because it chooses to simply log in immediate mode to the SLOG all the time (i.e. it's equivalent to setting While I'm not entirely convinced that's the best approach, I do concede that it's a reasonable one and it's still way better than the old behaviour (which truly made no sense). So I think we can finally put this to rest, with credits for the fix going to @dinatale2. |
) Bumps [ucd-trie](https://github.com/BurntSushi/ucd-generate) from 0.1.5 to 0.1.6. - [Commits](BurntSushi/ucd-generate@ucd-util-0.1.5...ucd-trie-0.1.6) --- updated-dependencies: - dependency-name: ucd-trie dependency-type: indirect update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Note that this issue seems to impact all ZFS implementations, not just ZFS On Linux.
ZFS uses a complicated process when it comes to deciding whether a write should be logged in indirect mode (written once by the DMU, the log records store a pointer) or in immediate mode (written in the log record, rewritten later by the DMU). Basically, it goes like this:
logbias=throughput
, orzfs_immediate_write_sz
.logbias=latency
and:zfs_immediate_write_sz
, orzil_slog_limit
.logbias=latency
, there is a slog, and the total commit size is smaller thanzil_slog_limit
.The decision to use indirect or immediate mode is implemented in
zfs_log_write()
andzvol_log_write()
. The decision to use the slog or the normal vdevs is implemented in theUSE_SLOG()
macro used byzil_lwb_write_start
.The issue is, this decision process makes sense except for one particularly painful edge case, when these conditions are all true:
logbias=latency
, andIn this situation, the optimal choice would be to write to the normal pool in indirect mode, which should give us the minimum latency considering this is a large sequential write. Indeed, for very large writes, you don't want to use immediate mode because it means writing the data twice. Even if you write the log records to the slog, this will be slower with most pool configurations with e.g. lots of spindles and one SSD slog because the aggregate sequential write throughput of all the spindles is usually greater than the SSD's.
Instead, the algorithm makes the worst decision possible: it writes the data in immediate mode to the main data disks. This means that all the (large) data will be commited as ZIL log records on the data disks first, then immediately after, it will get written again by the DMU. This means the overall throughput is halved, and if this is a sustained load, the ZIL commit latency will be doubled compared to indirect mode.
It is shockingly easy to reproduce this issue. In pseudo-code:
Watch the
zil_stats
kstat page when that runs.If you don't have a slog in your pool, then the
fsync()
call will complete in roughly the time it takes to write 2 GB sequentially to your main disks. This is optimal.If you have a slog in your pool, then the
fsync()
call will generate twice as much write activity, and will write up to 4 GB to your main disks. Ironically, the slog won't be used at all when that happens.The solution would be to modify the algorithm
zfs_log_write()
andzvol_log_write()
so that, in the conditions mentioned above, it switches to indirect writes when the commit size reaches a certain threshold (e.g. 32 MB).I would gladly write a patch, but I won't have the time to do it, so I'm just leaving the result of my research here in case anyone's interested. If anyone wants to write the patch, it should be very simple to implement it.
The text was updated successfully, but these errors were encountered: