Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large synchronous writes are slow when a slog is present #1012

Closed
dechamps opened this issue Oct 4, 2012 · 27 comments
Closed

Large synchronous writes are slow when a slog is present #1012

dechamps opened this issue Oct 4, 2012 · 27 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@dechamps
Copy link
Contributor

dechamps commented Oct 4, 2012

Note that this issue seems to impact all ZFS implementations, not just ZFS On Linux.

ZFS uses a complicated process when it comes to deciding whether a write should be logged in indirect mode (written once by the DMU, the log records store a pointer) or in immediate mode (written in the log record, rewritten later by the DMU). Basically, it goes like this:

  • Write in indirect mode to the data vdevs if:
    • logbias=throughput, or
    • There is no slog and the write is larger than zfs_immediate_write_sz.
  • Write in immediate mode to the data vdevs if logbias=latency and:
    • There is no slog and the write is smaller than zfs_immediate_write_sz, or
    • There is a slog and the total commit size if larger than zil_slog_limit.
  • Write in immediate mode to the slog vdevs if logbias=latency, there is a slog, and the total commit size is smaller than zil_slog_limit.

The decision to use indirect or immediate mode is implemented in zfs_log_write() and zvol_log_write(). The decision to use the slog or the normal vdevs is implemented in the USE_SLOG() macro used by zil_lwb_write_start.

The issue is, this decision process makes sense except for one particularly painful edge case, when these conditions are all true:

  • logbias=latency, and
  • There is a slog, and
  • There are large writes in the ZIL to be commited (e.g. > 100 MB).

In this situation, the optimal choice would be to write to the normal pool in indirect mode, which should give us the minimum latency considering this is a large sequential write. Indeed, for very large writes, you don't want to use immediate mode because it means writing the data twice. Even if you write the log records to the slog, this will be slower with most pool configurations with e.g. lots of spindles and one SSD slog because the aggregate sequential write throughput of all the spindles is usually greater than the SSD's.

Instead, the algorithm makes the worst decision possible: it writes the data in immediate mode to the main data disks. This means that all the (large) data will be commited as ZIL log records on the data disks first, then immediately after, it will get written again by the DMU. This means the overall throughput is halved, and if this is a sustained load, the ZIL commit latency will be doubled compared to indirect mode.

It is shockingly easy to reproduce this issue. In pseudo-code:

open(file)
write(file, lots of data) // e.g. 2 GB
fsync(file)

Watch the zil_stats kstat page when that runs.

If you don't have a slog in your pool, then the fsync() call will complete in roughly the time it takes to write 2 GB sequentially to your main disks. This is optimal.

If you have a slog in your pool, then the fsync() call will generate twice as much write activity, and will write up to 4 GB to your main disks. Ironically, the slog won't be used at all when that happens.

The solution would be to modify the algorithm zfs_log_write() and zvol_log_write() so that, in the conditions mentioned above, it switches to indirect writes when the commit size reaches a certain threshold (e.g. 32 MB).

I would gladly write a patch, but I won't have the time to do it, so I'm just leaving the result of my research here in case anyone's interested. If anyone wants to write the patch, it should be very simple to implement it.

@behlendorf
Copy link
Contributor

@dechamps I was investigating this issue yesterday which was easy to reproduce given your excellent summary of the problem. Unfortunately, I don't thing it's going to be quite as trivial to fix as we'd hoped.

Initially, I tried your suggestion of tweaking zfs_log_write() and zvol_log_write() to switch to indirect mode when exceeding a commit size threshold. In practice that proved problematic since on my test system the log never grew to a large enough size where I could set a reasonable default threshold.

Upon further reflection it was also clear to me that the log size isn't really what we want to be using here. What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage. As you point out above this is almost certainly true for small I/Os, but large streaming I/O would be far better handled by the primary pool.

Ideally we want a way to determine which set of vdevs is going to stream the fastest on your system to minimize the latency. For unrelated reasons I've already been looking at tracking additional per-vdev performance data such as IOPs and throughput. Once those enhancements get merged it would be relatively straight forward for the zfs_log_write() and zvol_log_write() to take device performance in to account and to do the right thing.

@dechamps
Copy link
Contributor Author

What would be far better is to remove that assumption from the existing code that the slog is always going to be fastest storage.

The code doesn't always assume the slog is faster than everything: when the log size exceeds zil_slog_limit, it switches to the main pool. The issue is, when it takes this decision, it already decided earlier that the write will be in immediate mode, so it ends up writing in immediate mode to the main pool, which is not what we want.

The core issue is that both decisions (indirect/immediate and slog/main) are taken by different modules at different times, so we end up with an absurd end result.

@behlendorf
Copy link
Contributor

Sure, and I think it is pretty easy to fix the worst case behavior you described. The initial patch I put together basically added a call to USE_SLOG() when setting the slogging variable in fs_log_write() and zvol_log_write(). That allowed it to change to indirect mode at roughly the right time. Perhaps that's still worth doing in the short term.

        slogging = spa_has_slogs(zilog->zl_spa) && USE_SLOG(zilog) &&
            (zilog->zl_logbias == ZFS_LOGBIAS_LATENCY);

For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing.

@dechamps
Copy link
Contributor Author

The code lines in your last comment are basically what I had in mind for fixing this issue.

For testing it just happened that my pools primary storage was about 3x faster at streaming than my slog. So even when it correctly used the slog in immediate mode there was still a 3x performance penalty compared to using indirect mode to the primary pool. This second issue got me thinking about how to just do the right thing.

Well, that's a trade-off. Keep in mind that for the ZIL, latency is the main performance metric, not throughput. What counts is the time it takes for zil_commit() to complete and nothing else. In your case with primary storage 3x faster (streaming) than slog, small commits should still go to the slog, because they will complete much faster (~ 0.1 ms versus ~ 3 ms, assuming it's a SSD). Large commits, however, should go the primary pool because the actual write time (= disk throughput) dominates the initial seek latency.

Basically, the primary pool should be used if (initial latency + commit size / vdev throughput) is greater for the primary pool than for the slog. For example, for a pool with 1 SSD slog (0.1 ms latency, 100 MB/s) and 3 spindles (3 ms, 300 MB/s total), then any ZIL commit larger than 0.435 MB will take less time to complete on the main pool. Which means zil_slog_limit should be set to roughly 512 KB.

Note that my previous demonstration is only valid when there is no congestion, i.e. the disks are idle when the commit occurs. If the disks aren't idle, then other factors come into play, and then the SSD will often win because it is likely to be less loaded than the main disks, unless all the load is on the ZIL. In addition, if disks are busy, then writing in immediate mode (i.e. twice) on the main pool halves performance, which brings us to the issue described in my original description.

@pyavdr
Copy link
Contributor

pyavdr commented Mar 26, 2013

Is this issue already open? I guess it is solved with #1013 ?

@dechamps
Copy link
Contributor Author

As I said in the comments of #1013, it is not.

@pyavdr
Copy link
Contributor

pyavdr commented Mar 26, 2013

Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large
synchronous writes like iscsi or nfs to zvols?

@ColdCanuck
Copy link
Contributor

I assume you sidestep the issue if you set the logbias=throughput ???

On Mar 26,2013, at 10:35 , P.SCH wrote:

Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large
synchronous writes like iscsi or nfs to zvols?


Reply to this email directly or view it on GitHub.

@dechamps
Copy link
Contributor Author

Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large synchronous writes like iscsi or nfs to zvols?

Basically, yes. If you care about large synchronous writes, then adding a slog might be counter-productive. Note that this is true for all ZFS implementations, including FreeBSD and Illumos.

I assume you sidestep the issue if you set the logbias=throughput ???

Yes, but if you set logbias=throughput then the slog is never used for synchronous writes, even small ones, so that makes the slog useless (unless you have some other dataset that uses it).

@pyavdr
Copy link
Contributor

pyavdr commented Mar 26, 2013

Thank you for this clarification. As there are so many disscussion around SSD for zil/cache, this is basically frustrating. I hope you find some time to work on this issue.

@shodanshok
Copy link
Contributor

Hi all, I just read about this old, still opened, ticket and wondered if the problem can be somewhat sidestepped by using a quite large zil_slog_limit. Sure, with fast main pools this still impair performance as large synchronized writes will be logged to the ZIL and to the main pool, but should avoid the problem of 2X writes to the main pool. I am right, or I am missing something?

@evujumenuk
Copy link

PR #6191 looks like it inadvertently fixes this problem. @dechamps, can you confirm (or refute)?

@dechamps
Copy link
Contributor Author

@evujumenuk I wouldn't know. It definitely looks interesting, but the last time I looked into this was literally 5 years ago and I don't have any context around this anymore.

@evujumenuk
Copy link

Maybe @dinatale2 can shed some light on this. The question (as I understand it): is it still possible for large sync writes to be written in immediate mode to data vdevs if logbias=latency and a SLOG exists?

@behlendorf
Copy link
Contributor

@evujumenuk PR #6191 does not directly address this issue. When a slog device is part of the pool the assumption is still that it offers the absolute lowest latency and is preferred when logbias=latency.

With #6191 and OpenZFS 8585 (#6566) it might be easier to implement the original suggestion. Since OpenZFS 8585 does away with the batching of blocks we may have a better idea about when the log device is being overwhelmed and should transition to indirect writes. We still want the slog to soak up bursty synchronous writes.

It's also worth mentioning that if your target workload is large synchronous writes you can set logbias=throughput on the dataset today and prevent this double writing.

@dechamps
Copy link
Contributor Author

@behlendorf

When a slog device is part of the pool the assumption is still that it offers the absolute lowest latency and is preferred when logbias=latency.

After re-reading my original description, I don't think that's what this issue is about. The issue is that, when faced with large synchronous writes, a slog is present, and logbias=latency, ZFS will decide to write the data in immediate mode to the main disks (not the slog!), which makes absolutely no sense under any scenario, even if you "assume that slog devices offer the absolute lowest latency".

What makes sense is either writing the data in immediate mode to the slog, or in indirect mode to the main disks. Writing large blocks in immediate mode to the main disks just results in the data getting written twice (both times to the main disks) for no reason.

Or at least that's what I can piece back together after re-reading my report from 5 years ago.

@behlendorf
Copy link
Contributor

What makes sense is either writing the data in immediate mode to the slog, or in indirect mode to the main disks.

Then we should close this issue because that is the current behavior. Here's the relevant block of code which decides how the log record should be written. Large blocks will always be written indirectly when a pool lacks a slog device.

long zfs_immediate_write_sz = 32768;
        if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
                write_state = WR_INDIRECT;
>>>     else if (!spa_has_slogs(zilog->zl_spa) &&
            resid >= zfs_immediate_write_sz)
                write_state = WR_INDIRECT;
        else if (ioflag & (FSYNC | FDSYNC))
                write_state = WR_COPIED;
        else
                write_state = WR_NEED_COPY;

@behlendorf behlendorf removed this from the 1.0.0 milestone Oct 26, 2017
@dechamps
Copy link
Contributor Author

Large blocks will always be written indirectly when a pool lacks a slog device.

Yes. If there is no slog, then there is no problem. I agree. Again, that's not what this issue is about. Everything I said in this thread assumes there is a slog attached to the pool.

Again:

  • If there is no slog, large synchronous writes are going to be written in indirect mode to the main disks → optimal behavior
  • If there is a slog, large synchronous writes are going to be written in immediate mode twice to the main disks (and the slog stays idle!) → makes no sense

Examples of behaviors that would make sense, but that I did not observe when I filed this issue, include:

  • If there is a slog, large synchronous writes are going to be written to the slog in immediate mode and then permanently to the main disks
  • If there is a slog, large synchronous writes are going to be written in indirect mode to the main disks

In any case, according to my original description the issue is quite straightforward to reproduce, so it should just be a matter of trying to reproduce it with the current code to confirm that the issue is still there.

@Hypocritus
Copy link

Hypocritus commented Dec 8, 2017

I think that much of the disagreement is related to the ambiguity -- and sometimes counterintuitive appearance -- of the parameter naming AND setting terminology with their at-times lack of clearly-differentiated definitions, coupled with the overriding behavior of other less-documented, less-accessible parameters' settings and/or decision logic. I believe that this is the wall that continues to be hit by many of us looking to adopt this beautiful beast with its capable promises of being the last word.

For example, we have logbias=latency. "Bias" means "a predisposition towards" ... what?? We should find out by the setting: "latency". "Latency" is also an ambiguous word meaning "the time delay", having no clear positive or negative value towards which part of the overall file commit process, unfortunately for us (figuratively) less RAM-equipped thinkers.

I know that I am not about to change ZFS, but when I learned the two opposite settings for logbias, being =latency or =throughput, it appeared (and continues to try to assert itself with me) that, according to ZFS' image of high standards of performance (ZFS: the LAST word in file systems), logbias=latency meant the log's predisposition is toward latency, a generally slower commit; as opposed to throughput appearing to mean (in opposition) get it done, now. "Fast".

This of course is not what these parameters mean, which I believe many in the community are continually finding out. We are continually learning that this particular setting is referring to "part" of a "part" of "part" of the entire file commit decision process.

In other words, for the logbias parameter's scope (which can unfortunately be overridden by other, less-accessible but equally powerful variables' settings, as well as even more powerful, all-but-undocumented logic), latency means a target of "low perceived latency" and probably worse, is supposed to mean "longer delay in having the file commit fully committed to permanent storage". Whereas throughput means "bypass the SLOG and go straight to disk", which, although is the shortest by-wire path, appears to slow down the overall throughput of the sum of the file commit processes under many workloads.

Pretzel-Minding... And often seemingly unfair with the other overriding parameters being less accessible, or less documented.

Another example of the knotted verbage is found in the likes of "writes being logged in indirect mode or in immediate mode." To the uninitiated, these types of statements, by way of "common" reasoning, seem to imply that "indirect mode" means that the writes are "not" directly written/logged to disk, and conversely, "direct mode" would imply that the writes "are" written/logged directly. But of course, the "opposite" is true, and we are dealing with a several step write process that involves logging on one or more levels, at potentially 3 locations, RAM, ZIL or SLOG, and Final Resting Place. The OP does a fair job of defining the true behavior of "indirect" and "direct mode" in this context.

I mean, come on. A "write being logged in indirect mode" really means that the write IS directly written to it's final location, bypassing a log??? Why is the verb "logged" even used??? Why did the creators take such pains to write such convoluted, counterintuitive phrasing??? "Why God, Why???" This type of verbage is much like saying, "It is certainly not bright outside" in describing the outside lighting at noonday, but with reference to the full-moon's illumination! A full moon is always on the opposite side of the world at noon in anyone's timezone, and therefore it is really "moon-dark", but "sun-bright" outside. It completely defies the childlike desire to want to understand "now". They just wanted to know if they could go out to play! If us kids miss just one apparently benign but critical word, with its lack of clearly differentiated meaning and context, the concept is either misunderstood, or perhaps understood as the world exists on "Opposite Day".

I fear that for many "would-be" ZFS-ists, the "last word" is unfortunately the one they never get to because it wasn't a part of an 1) official (or not), clearly-differentiated documentation, 2) in an all-encompassing, easy-to-reference manner, 3) from a single standards-oriented, open source (or not!) organization or web location.

@richardelling
Copy link
Contributor

Do not assume that low-latency devices can deliver high throughput. Thus the logbias property attempts to allow some control over those conditions.

@dechamps
Copy link
Contributor Author

dechamps commented Dec 12, 2017

@Hypocritus

I'm not quite sure if your rant is meant to be taken seriously (loved your second-to-last paragraph though). In any case, the following should clarify the terminology for those unfamiliar with the internals of ZIL implementation:

  • logbias=latency means "optimize ZIL behavior for minimum latency". The goal is to make individual sync operations complete as quickly as possible, sacrificing overall efficiency and throughput in the process. This mode is meant to be used in applications that don't write a lot of data, but want this data to be committed to disk as quickly as possible. Which is typically what one wants for sync operations, hence it is the default.

  • logbias=throughput means "optimize ZIL behavior for maximum throughput". The goal is to make it possible to efficiently write large amounts of data in a synchronous manner, where the time it takes for a sync() call to complete doesn't matter much but the total write throughput (and overall I/O load) does. This mode sacrifices sync operation latency for a more efficient use of resources.

  • "Immediate mode" means that the data itself is written directly inside ZIL blocks when the ZIL is committed to disk. This approach provides the lowest latency and allows the ZIL commit (and thus the sync operation) to complete as quickly as possible, but it is inefficient because the data is rewritten again at TXG commit time (ZIL blocks are ephemeral).

  • "Indirect mode" means the data is written to its "final resting place" at ZIL commit time (by going directly to the DMU), and the ZIL only contains pointers to the data. It is called "indirect" precisely because the ZIL blocks contain pointers: it is literally a layer of indirection. This doesn't require rewriting the data because it's already in the right place, so it's a more efficient use of resources. However it might also take longer to commit the ZIL (i.e. sync operation latency is increased), because there are now two blocks (the ZIL block and the "final" block) that need to be written in two potentially separate places, possibly incurring the cost of a seek.

Every time ZFS commits the ZIL, it has to make a decision about two things:

  • Where the ZIL itself should be stored (SLOG - if there is one - or main disks)
  • Whether individual blocks should be written in immediate mode or indirect mode

Assuming nothing has changed since my original post, ZFS makes these decisions based on:

  • The value of the logbias property
  • The size of the block to be written (related to the zfs_immediate_write_sz tunable)
  • The total size of the pending writes to be committed (related to the zil_slog_limit tunable)

Following the decision tree that I described in my original post, there are four possible outcomes:

  1. Main disks, indirect: if logbias=throughput (which overrides everything else), or there is no slog and the write is larger than zfs_immediate_write_sz
  2. Main disks, immediate: if there is no SLOG and the write is smaller than zfs_immediate_write_sz, or there is a SLOG and the total size of the writes to be committed is larger than zil_slog_limit.
  3. SLOG, indirect: never happens
  4. SLOG, immediate: if there is a SLOG (duh) and the total size of the writes to be committed is smaller than zil_slog_limit.

The reason why I filed this bug is because the above is inefficient, since it can lead to large amounts of data being written in immediate mode to the main disks even though a SLOG is present, which is a very dumb thing to do (it is strictly worse behavior than if you do not have any SLOG at all!). I believe a more efficient decision logic would be:

  1. Main disks, indirect: (same as above)
  2. Main disks, immediate: if there is no SLOG and the write is smaller than zfs_immediate_write_sz.
  3. SLOG, indirect: if there is a SLOG and the total size of the writes to be committed is larger than zil_slog_limit.
  4. SLOG, immediate: if there is a SLOG and the total size of the writes to be committed is smaller than zil_slog_limit.

Such a change would simultaneously improve latency, throughput, and efficiency in the case where large synchronous writes are happening in a SLOG-enabled pool.

@hocheung20
Copy link

hocheung20 commented Mar 15, 2018

Sorry I posted the last post under the wrong Github account (used my work account). I have deleted that and reposted under personal account.

I'm not sure if I understand the proposed case 3 and 4 and I wonder if are building your cases incorrectly?

Assumption: SLOG present and SLOG can write faster than data disks for small writes.

Case 1:
logbias = throughput
Large Write (>zil_slog_limit, so you can't write it to the slog)

Write it to the data disks in indirect mode since that doesnt need to write twice.

Case 2:
logbias = throughput
Small Write

Write to the SLOG and data disks (indirect mode - maintain throughput) simultaneously.
Acknowledge sync when SLOG completes.

Case 3:
logbias = latency
Large write. (>zil_slog_limit, so you can't write it to the slog)

The lowest latency operation here (for a single large write) is to write it to the data disks in immediate mode (remember, you cant write it to the SLOG).

Edit: If there are multiple large writes, the lowest latency operation for subsequent writes will be increased. If we write in indirect mode, then we pay a small latency penalty for the first large write.

Case 4:
logbias = latency
Small write.

The lowest latency operation here then be to write it to the SLOG while simultaneously writing it to the data device in indirect mode (maintain as much throughput as possible, latency already taken care of by SLOG).
Acknowledging the sync when the SLOG has finished writing it.

@dechamps
Copy link
Contributor Author

@hocheung20

Case 1:
logbias = throughput
Large Write (>zil_slog_limit, so you can't write it to the slog)

Write it to the data disks in indirect mode since that doesnt need to write twice.

Sure. This is what ZFS already does.

Case 2:
logbias = throughput
Small Write

Write to the SLOG and data disks (indirect mode - maintain throughput) simultaneously.
Acknowledge sync when SLOG completes.

No. logbias=throughput means "maximize throughput, don't care about latency". The best way to maximize throughput is to write in indirect mode to the main disks - you're only writing the data once. If you write it to the SLOG too then you run the risk of the SLOG becoming a throughput bottleneck (remember that many SSDs are not much better than spinning rust when it comes to sequential write throughput, especially if a single SSD is competing against multiple spindles). You could also create a bottleneck in the bus itself (e.g. SATA controller) because you're sending an additional copy of the data down the wire.

There is also the problem that such a strategy could exacerbate the write endurance issues of SSD technology.

Case 3:
logbias = latency
Large write. (>zil_slog_limit, so you can't write it to the slog)

The lowest latency operation here (for a single large write) is to write it to the data disks in immediate mode (remember, you cant write it to the SLOG).

I don't think that would be smart. What would be smart would be to store the data in indirect mode on the main disks, and store the ZIL indirect blocks (i.e. the metadata) on the SLOG.

(Your proposed solution is what ZFS already does, which is why I've opened this issue in the first place. There is also the problem that the strategy is weirdly inconsistent between the "SLOG present" and "SLOG non present" cases: in the first case the data is written in immediate mode to the main disks, while in the second case the data is written in indirect mode to the main disks. That discrepancy makes no sense, and I doubt it was intended.)

Edit: If there are multiple large writes, the lowest latency operation for subsequent writes will be increased. If we write in indirect mode, then we pay a small latency penalty for the first large write.

It is technically true that, assuming everything is idle, writing in immediate mode to the main disks gets you the lowest possible latency (assuming you don't want to write the full data to the SLOG). However, writing in indirect mode to the main disks, while storing the ZIL indirect blocks on the SLOG, would probably make zero difference to latency (the SLOG will write the metadata faster than the main disks will write the data blocks, so you won't have to wait for the SLOG), but it would make a very large difference in terms of efficiency (you only write the data once instead of twice).

Basically my point is: your proposed solution is latency-optimal, but it's inefficient. The solution I'm proposing is both latency-optimal (as long as your SLOG is not abnormally slow) and efficient. There is no need to make a tradeoff here.

The only case where my solution is not latency optimal is if, for some reason, the SLOG takes longer to acknowledge small (metadata) writes than it takes for your main disk to acknowledge large (data) writes. Which would be weird and probably means you need to buy a SLOG that doesn't suck :)

Case 4:
logbias = latency
Small write.

The lowest latency operation here then be to write it to the SLOG while simultaneously writing it to the data device in indirect mode (maintain as much throughput as possible, latency already taken care of by SLOG).
Acknowledging the sync when the SLOG has finished writing it.

I agree for this case, and this is what ZFS already does.

@hocheung20
Copy link

hocheung20 commented Mar 17, 2018

No. logbias=throughput means "maximize throughput, don't care about latency". The best way to maximize throughput is to write in indirect mode to the main disks - you're only writing the data once. If you write it to the SLOG too then you run the risk of the SLOG becoming a throughput bottleneck (remember that many SSDs are not much better than spinning rust when it comes to sequential write throughput, especially if a single SSD is competing against multiple spindles). You could also create a bottleneck in the bus itself (e.g. SATA controller) because you're sending an additional copy of the data down the wire.

There is also the problem that such a strategy could exacerbate the write endurance issues of SSD technology.

You make good points, although I think today, even for homelab type use cases, if you want to use/can afford an SLOG, you can easily pick up something NVMe and faster than your spindles while remaining reasonable on the budget.

For example, I have a pair of ~$350 Optane 900p 280 GB - a consumer drive - each of which has rated endurance 5 PB, write long-term sustained ~2GB/s even when drive is full. I really wish that sync writes used it effectively turning the SLOG into a writeback cache.

In the enterprise space where ZFS really shines, I'd imagine the ruler SSD form factor favoring the speed and endurance of PCIe SSDs even more.

Case 3:
logbias = latency
Large write. (>zil_slog_limit, so you can't write it to the slog)

The lowest latency operation here (for a single large write) is to write it to the data disks in >>immediate mode (remember, you cant write it to the SLOG).

I don't think that would be smart. What would be smart would be to store the data in indirect mode on >the main disks, and store the ZIL indirect blocks (i.e. the metadata) on the SLOG.

Ah. That makes sense now. Your explanation of immediate mode confused me as little bit as it claimed it is the lowest latency operation (but I guess the context is without the help of the SLOG).

Case 4:
logbias = latency
Small write.

The lowest latency operation here then be to write it to the SLOG while simultaneously writing it to the data device in indirect mode (maintain as much throughput as possible, latency already taken care of by SLOG).
Acknowledging the sync when the SLOG has finished writing it.

I agree for this case, and this is what ZFS already does.

Just to be clear, you had proposed this case under SLOG, immediate.

@dechamps
Copy link
Contributor Author

I really wish that sync writes used it effectively turning the SLOG into a writeback cache.

You should be able to do that today by setting a very high value on the zil_slog_limit tunable, which will have the effect of making all sync writes go to the SLOG in immediate mode unconditionally. On current ZFS I think that's a good idea as long as your SLOG write throughput is more than half the write throughput of your combined spindles, and you don't care about longevity issues due to write-induced wear. It will reduce the load on your main disks because they will only have to write the data once and only as part of the normal "batch" TXG commit, not as an "urgent" ZIL write.

(Note that, strictly speaking, the SLOG can't be used as a writeback cache, because the ZIL is never read from unless you're recovering from a crash. The data stored in the ZIL is a copy of pending writes that are in RAM, and will stay in RAM until the next TXG commit, at which point the ZIL is discarded.)

@mtippmann
Copy link

@dechamps zil_slog_limit is history as of 1b7c1e5 - any idea how to get the desired behavoir when using ZoL 0.7.7? However the commit reads like ZoL is already doing the right thing.

@dechamps
Copy link
Contributor Author

dechamps commented Mar 30, 2018

@mtippmann Good catch. Yes, this commit by @dinatale2 appears to solve most of the problem described here. (Actually @evujumenuk had already mentioned that commit months ago but I declined to take a look at the time.)

It's not really the solution I would have hoped, because it chooses to simply log in immediate mode to the SLOG all the time (i.e. it's equivalent to setting zil_slog_limit to infinity), while I would have preferred to fall back to indirect mode (but with the ZIL still going to the SLOG) under heavy logging. However there are people who disagree with me on this and have expressed some preference for the approach that this commit implements (such as @hocheung20, above).

While I'm not entirely convinced that's the best approach, I do concede that it's a reasonable one and it's still way better than the old behaviour (which truly made no sense). So I think we can finally put this to rest, with credits for the fix going to @dinatale2.

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
)

Bumps [ucd-trie](https://github.com/BurntSushi/ucd-generate) from 0.1.5 to 0.1.6.
- [Commits](BurntSushi/ucd-generate@ucd-util-0.1.5...ucd-trie-0.1.6)

---
updated-dependencies:
- dependency-name: ucd-trie
  dependency-type: indirect
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

10 participants