ZFS fix to make xattr=sa logging to ZIL on create/remove/update. #9078

jsai20 · 2019-07-25T13:52:57Z

As such, there are no specific synchronous semantics defined for
the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is
done, if sync=always is set on dataset. This provides sync semantics
for xattr=on with sync=always set on dataset.
For the xattr=sa implementation, it doesn't log to ZIL, so, even with
sync=always, xattrs are not guaranteed to be synced before xattr call
returns to caller. so xattr can be lost if system crash happens, before
txg carrying xattr transaction is synced.

This change makes xattr=sa logging to ZIL on xattr create/remove/update
and xattrs are synced to ZIL (zil_commit() done) for sync=always.
This make xattr=sa behavior similar to xattr=on.

This could also provide basic framework to support implementing sync
semantics at file level for xattr=sa.

Signed-off-by: Jitendra Patidar jitendra.patidar@nutanix.com
Closes #8768

codecov · 2019-07-25T22:14:18Z

Codecov Report

Merging #9078 (31abaff) into master (161ed82) will increase coverage by 3.70%.
The diff coverage is 75.00%.

❗ Current head 31abaff differs from pull request most recent head ad5cb1c. Consider uploading reports for the commit ad5cb1c to get more accurate results

@@            Coverage Diff             @@
##           master    #9078      +/-   ##
==========================================
+ Coverage   75.17%   78.88%   +3.70%     
==========================================
  Files         402      400       -2     
  Lines      128071   121758    -6313     
==========================================
- Hits        96283    96050     -233     
+ Misses      31788    25708    -6080

Flag	Coverage Δ
kernel	`79.35% <91.07%> (+0.59%)`	⬆️
user	`66.85% <7.69%> (+19.42%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
cmd/zdb/zdb_il.c	`33.52% <0.00%> (-20.18%)`	⬇️
cmd/ztest/ztest.c	`83.75% <ø> (+76.49%)`	⬆️
include/sys/zfs_znode.h	`100.00% <ø> (ø)`
module/zfs/zfs_replay.c	`37.74% <87.50%> (+37.74%)`	⬆️
module/zfs/zfs_sa.c	`62.13% <87.50%> (+1.63%)`	⬆️
module/zcommon/zfeature_common.c	`92.92% <100.00%> (-0.25%)`	⬇️
module/zfs/zfs_log.c	`68.77% <100.00%> (+0.33%)`	⬆️
module/zfs/zpl_xattr.c	`83.44% <100.00%> (ø)`
module/unicode/u8_textprep.c	`10.47% <0.00%> (-23.10%)`	⬇️
... and 454 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e410130...ad5cb1c. Read the comment docs.

ahrens · 2019-07-29T18:18:13Z

Can you clarify whether we are talking about when synchronous semantics are requested vs not? How do we request those semantics? I think that we only log these operations do disk if sync=always or when we later do another (unrelated) sync operation, is that right?

It looks like with the default of xattr=on, these would be logged via zpl_xattr_set_dir()'s calls to zfs_create(), zpl_write_common(), etc. These would do zfs_log_create() but not zil_commit(). Are those semantics correct?

I think that with xattr=sa, we are not creating the itx at all (i.e. not calling zfs_log_create()). So there is no way for these operations to be logged. That does indeed seem like a bug.

behlendorf · 2019-07-29T22:34:42Z

Can you clarify whether we are talking about when synchronous semantics are requested vs not? How do we request those semantics?

Good question. Unlike illumos and FreeBSD, the setxattr(2) system call is used to atomically manipulate xattr's on Linux. The expected behavior is described in the attr(5) man page and it differs slightly between filesystems. But it does not explicitly state if setxattr(2) is expected to behave synchronously. However, since they're implemented as a system call and one of the primary uses for xattrs is to store ACLs and selinux contexts I think making them synchronous would be prudent.

I think that with xattr=sa, we are not creating the itx at all (i.e. not calling zfs_log_create()). So there is no way for these operations to be logged. That does indeed seem like a bug.

Adding support to log these wasn't included in the original implementation, and we never managed to revisit this and add the missing support. Which is definitely a bug.

A new ZIL log record type does seem like the correct way to handle this, but we'll need to register a new feature flag for this. Otherwise, replaying a log with these entries on another OpenZFS platform will result in a crash.

jsai20 · 2019-07-30T10:51:07Z

Can you clarify whether we are talking about when synchronous semantics are requested vs not? How do we request those semantics?

Good question. Unlike illumos and FreeBSD, the setxattr(2) system call is used to atomically manipulate xattr's on Linux. The expected behavior is described in the attr(5) man page and it differs slightly between filesystems. But it does not explicitly state if setxattr(2) is expected to behave synchronously. However, since they're implemented as a system call and one of the primary uses for xattrs is to store ACLs and selinux contexts I think making them synchronous would be prudent.

I think that with xattr=sa, we are not creating the itx at all (i.e. not calling zfs_log_create()). So there is no way for these operations to be logged. That does indeed seem like a bug.

Adding support to log these wasn't included in the original implementation, and we never managed to revisit this and add the missing support. Which is definitely a bug.

A new ZIL log record type does seem like the correct way to handle this, but we'll need to register a new feature flag for this. Otherwise, replaying a log with these entries on another OpenZFS platform will result in a crash.

Thanks Matthew and Brain for the review comments. yes, exact semantics (sync or async) required for the operation are not very clear. For xattr=on, it does log the xattr write/update operation via zpl_write_common() path as Metthew mentioned and with sync=always, log would be committed to disk. But for xattr=sa, it doesn't log the operation. So, for xattr=sa, as well, similarly, we can log the record and commit (zil_commit) to disk as part of operation if sync=always.

Thanks Brian, I would check the feature flag part. I am new to zfs and still learning it.

ahrens · 2019-07-30T16:52:49Z

since they're implemented as a system call and one of the primary uses for xattrs is to store ACLs and selinux contexts I think making them synchronous would be prudent.

Do you think that we should also change the xattr=on behavior to be always sync?

In either case, I think that like all other ZFS operations (AFAIK), xattr operations (whether xattr=on or xattr=sa) should be async by default (i.e. not persisted to disk immediately via zil_commit()).

jsai20 · 2019-08-06T13:13:20Z

since they're implemented as a system call and one of the primary uses for xattrs is to store ACLs and selinux contexts I think making them synchronous would be prudent.

Do you think that we should also change the xattr=on behavior to be always sync?

In either case, I think that like all other ZFS operations (AFAIK), xattr operations (whether xattr=on or xattr=sa) should be async by default (i.e. not persisted to disk immediately via zil_commit()).

For xattr=on, xattr's are synced to zil via zpl_xattr_set_dir()->zpl_write_common()->zpl_write_common_iovec()->zfs_write() {
....
if (ioflag & (FSYNC | FDSYNC) ||
zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS)
zil_commit(zilog, zp->z_id);
...
}
As ioflag seems to be alwyas passed as 0 for xattr write. So, zil_commit is only done if dataset property "sync" is set to "always".
xattr=on should always do sync to disk, which means, zil_commit() should always be done, without any checks. So, I am not really sure about it.

In my diff, which enables xattr=sa logging to zil, for now I have kept xattr=sa behavior same as xattr=on, means do zil_commit() only if dataset property "sync" is set to "always". So, xattr=sa attributes doesn't lost on node crash similar to xattr=on, but it's guaranteed only when sync=always is set.

I think, decision about "xatt=on" and "xattr=sa" should always do zil_commit() needs more involved discussion and not much related to diff here. Appropriate diff/pull request for the same can be opened separately, as discussed and decided.

I have updated diff with feature flag change. Please have a look and provide feedback on same.

ahrens · 2019-08-06T17:06:03Z

xattr=on should always do sync to disk, which means, zil_commit() should always be done, without any checks

Why should xattr=on always sync to disk immediately?

I think, decision about "xatt=on" and "xattr=sa" should always do zil_commit() needs more involved discussion and not much related to diff here.

I agree that it isn't related to this PR, so we can proceed with xattr=sa having the same semantics as xattr=on currently does.

jsai20 · 2019-08-07T05:10:37Z

xattr=on should always do sync to disk, which means, zil_commit() should always be done, without any checks

Why should xattr=on always sync to disk immediately?

I meant, I am not very sure, either it should be always sync or not. I can't find any specific reference to it.

I think, decision about "xatt=on" and "xattr=sa" should always do zil_commit() needs more involved discussion and not much related to diff here.

I agree that it isn't related to this PR, so we can proceed with xattr=sa having the same semantics as xattr=on currently does.

Thanks.

snajpa · 2019-11-13T21:52:01Z

When talking new feature flags - there's an ultimate trade-off between these two options with the current design:

making the pools used on Linux r/o on other platforms permanently by increasing the feature count on first use
increment/decrement the feature on every use of the tx types - but the functions doing that are actually supposed to be called in a transaction group syncing context.

It is explained a bit better in the comment in the OverlayFS PR #9414
https://github.com/zfsonlinux/zfs/pull/9414/files#r340373675

I don't understand the need for those functions to be called in the syncing context fully, but I suppose it is to be sure that the incr/decr actually lands on disk by the end of the txg.

So, might I suggest adding new txtypes for incr/decr of ZFS pool features? That way, we could incr/decr the features no matter the context; the only trade off here is that on import, to check for the features used by the pool, we'd have to walk the IL. I'm not sure whether this can be done in a flat way for the whole pool, or whether we have to walk the dataset structure and then walk the associated IL records...

@behlendorf @ahrens perhaps you would know more and could advise on the matter?

I wouldn't mind doing the work, but I'd be glad for some guidance :)

We would probably want to use a new feature flag for feature incr/decr's in the IL, but that is something all platforms could potentially benefit from right away (unlike the two txtypes needed for renameat2, for example).

module/zfs/zfs_replay.c

module/zcommon/zfeature_common.c

module/zfs/zfs_sa.c

ahrens · 2019-11-14T05:39:12Z

@snajpa What you are suggesting would most likely be implemented something like this:

When we do a read-write import of the pool, when we walk all the ZIL's to claim their blocks, we also check if there are any unrecognized record types. If so, we fail the import. One downside with this is that the record types are just an enum, so all implementations of ZFS must agree about what each value means.

Alternatively, we could add a single feature flag now which is always activated (refcount=1), and add a single new ZIL record type. This new record type would specify which other feature flag must be supported in order to import the pool read-write. The other feature flag would be specified by fully-qualified name, so it wouldn't have the downside mentioned above. But it doesn't solve the backwards-compatibility for the first thing that needs it.

Personally, I think we should err on the side of simplicity. We already have the feature flag infrastructure. We shouldn't shy away from using it to introduce new feature flags and accept the fact that older/different implementations won't be able to access the pool (at least not writeable). But I could be convinced otherwise depending on the specifics. IMO, the problem this PR is solving is not that acute, so folks desiring compatibility can leave the feature disabled with little impact.

ahrens · 2021-06-10T05:38:37Z

It looks like maybe we decided on adding a new on-disk feature flag for this, but the code review process stalled out. @jsai20 if you were to rebase this, would it be ready for final code review and integration, as far as you know?

jsai20 · 2021-06-10T10:48:24Z

It looks like maybe we decided on adding a new on-disk feature flag for this, but the code review process stalled out. @jsai20 if you were to rebase this, would it be ready for final code review and integration, as far as you know?

Thanks Matt. ya, code review was stalled after around conversation on new on-disk feature flag.
I would just recollect, rebase the change and update the diff.

ghost · 2021-06-10T15:44:41Z

This will need to be applied on the FreeBSD side as well, and the test updated to use helpers from libtest for working with tunables and xattrs.

tests/zfs-tests/tests/functional/slog/slog_016_pos.ksh

module/os/linux/zfs/zpl_xattr.c

jsai20 · 2021-06-11T11:25:25Z

This will need to be applied on the FreeBSD side as well, and the test updated to use helpers from libtest for working with tunables and xattrs.

Done.

behlendorf · 2022-02-15T17:03:21Z

@jsai20 thanks for your patience, if can you rebase this one last time it should be good to go. I'd like to try and get this merged by the end of the week.

@ahrens @nabijaczleweli @problame @freqlabs last call, if you have any remaining concerns please post a comment.

module/zfs/zfs_replay.c

module/zfs/zil.c

behlendorf

As of today, we enabled some additional compiler warning. If you could rebase this on the latet master when addressing th last few comments I posted we can verify it still builds cleanly.

cmd/zdb/zdb_il.c

behlendorf · 2022-02-18T18:10:44Z

cmd/zdb/zdb_il.c

+	char	*val;
+	int	i;
+
+	name = (char *)(lr + 1);


Suggested change

name = (char *)(lr + 1);

char *name = (char *)(lr + 1);

cmd/zdb/zdb_il.c

man/man7/zpool-features.7

module/zfs/zfs_replay.c

module/zfs/zfs_sa.c

module/zfs/zil.c

As such, there are no specific synchronous semantics defined for the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is done, if sync=always is set on dataset. This provides sync semantics for xattr=on with sync=always set on dataset. For the xattr=sa implementation, it doesn't log to ZIL, so, even with sync=always, xattrs are not guaranteed to be synced before xattr call returns to caller. So, xattr can be lost if system crash happens, before txg carrying xattr transaction is synced. This change adds xattr=sa logging to ZIL on xattr create/remove/update and xattrs are synced to ZIL (zil_commit() done) for sync=always. This makes xattr=sa behavior similar to xattr=on. Implementation notes: The actual logging is fairly straight-forward and does not warrant additional explanation. However, it has been 14 years since we last added new TX types to the ZIL [1], hence this is the first time we do it after the introduction of zpool features. Therefore, here is an overview of the feature activation and deactivation workflow: 1. The feature must be enabled. Otherwise, we don't log the new record type. This ensures compatibility with older software. 2. The feature is activated per-dataset, since the ZIL is per-dataset. 3. If the feature is enabled and dataset is not for zvol, any append to the ZIL chain will activate the feature for the dataset. Likewise for starting a new ZIL chain. 4. A dataset that doesn't have a ZIL chain has the feature deactivated. We ensure (3) by activating on the first zil_commit() after the feature was enabled. Since activating the features requires waiting for txg sync, the first zil_commit() after enabling the feature will be slower than usual. The downside is that this is really a conservative approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL chain, we pay the penalty for feature activation. The upside is that the user is in control of when we pay the penalty, i.e., upon enabling the feature. We ensure (4) by hooking into zil_sync(), where ZIL destroy actually happens. One more piece on feature activation, since it's spread across multiple functions: zil_commit() zil_process_commit_list() if lwb == NULL // first zil_commit since zil_open zil_create() if no log block pointer in ZIL header: if feature enabled and not active: // CASE 1 enable, COALESCE txg wait with dmu_tx that allocated the log block else // log block was allocated earlier than this zil_open if feature enabled and not active: // CASE 2 enable, EXPLICIT txg wait else // already have an in-DRAM LWB if feature enabled and not active: // this happens when we enable the feature after zil_create // CASE 3 enable, EXPLICIT txg wait [1] illumos/illumos-gate@da6c28a Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes openzfs#8768

behlendorf · 2022-02-22T21:07:14Z

@jsai20 thanks for iterating with us on this, merged!

jsai20 · 2022-02-23T04:39:38Z

@jsai20 thanks for iterating with us on this, merged!

Thanks @behlendorf @ahrens @freqlabs @problame @nabijaczleweli @scineram and othre reviewers. Thanks much for refining it!

As such, there are no specific synchronous semantics defined for the xattrs. But for xattr=on, it does log to ZIL and zil_commit() is done, if sync=always is set on dataset. This provides sync semantics for xattr=on with sync=always set on dataset. For the xattr=sa implementation, it doesn't log to ZIL, so, even with sync=always, xattrs are not guaranteed to be synced before xattr call returns to caller. So, xattr can be lost if system crash happens, before txg carrying xattr transaction is synced. This change adds xattr=sa logging to ZIL on xattr create/remove/update and xattrs are synced to ZIL (zil_commit() done) for sync=always. This makes xattr=sa behavior similar to xattr=on. Implementation notes: The actual logging is fairly straight-forward and does not warrant additional explanation. However, it has been 14 years since we last added new TX types to the ZIL [1], hence this is the first time we do it after the introduction of zpool features. Therefore, here is an overview of the feature activation and deactivation workflow: 1. The feature must be enabled. Otherwise, we don't log the new record type. This ensures compatibility with older software. 2. The feature is activated per-dataset, since the ZIL is per-dataset. 3. If the feature is enabled and dataset is not for zvol, any append to the ZIL chain will activate the feature for the dataset. Likewise for starting a new ZIL chain. 4. A dataset that doesn't have a ZIL chain has the feature deactivated. We ensure (3) by activating on the first zil_commit() after the feature was enabled. Since activating the features requires waiting for txg sync, the first zil_commit() after enabling the feature will be slower than usual. The downside is that this is really a conservative approximation: even if we never append a 'TX_SETSAXATTR' to the ZIL chain, we pay the penalty for feature activation. The upside is that the user is in control of when we pay the penalty, i.e., upon enabling the feature. We ensure (4) by hooking into zil_sync(), where ZIL destroy actually happens. One more piece on feature activation, since it's spread across multiple functions: zil_commit() zil_process_commit_list() if lwb == NULL // first zil_commit since zil_open zil_create() if no log block pointer in ZIL header: if feature enabled and not active: // CASE 1 enable, COALESCE txg wait with dmu_tx that allocated the log block else // log block was allocated earlier than this zil_open if feature enabled and not active: // CASE 2 enable, EXPLICIT txg wait else // already have an in-DRAM LWB if feature enabled and not active: // this happens when we enable the feature after zil_create // CASE 3 enable, EXPLICIT txg wait [1] illumos/illumos-gate@da6c28a Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes openzfs#8768 Closes openzfs#9078

jsai20 force-pushed the issue-8768 branch from 46725c6 to 83bcb78 Compare July 25, 2019 14:49

ahrens added the Status: Design Review Needed Architecture or design is under discussion label Jul 29, 2019

jsai20 force-pushed the issue-8768 branch 4 times, most recently from a7f8e9a to 12fac85 Compare August 6, 2019 12:38

jsai20 force-pushed the issue-8768 branch from 12fac85 to 5423ad1 Compare August 7, 2019 04:51

ahrens reviewed Nov 14, 2019

View reviewed changes

module/zfs/zfs_replay.c Outdated Show resolved Hide resolved

module/zcommon/zfeature_common.c Outdated Show resolved Hide resolved

module/zfs/zfs_sa.c Outdated Show resolved Hide resolved

jsai20 force-pushed the issue-8768 branch from 5423ad1 to 31abaff Compare March 5, 2020 09:40

ahrens assigned mmaybee Jun 10, 2021

jsai20 force-pushed the issue-8768 branch 2 times, most recently from 2e3d681 to 00688c9 Compare June 10, 2021 14:09

ghost suggested changes Jun 10, 2021

View reviewed changes

jsai20 force-pushed the issue-8768 branch 3 times, most recently from 2d79d97 to fbc2d93 Compare June 11, 2021 12:21

jsai20 force-pushed the issue-8768 branch 4 times, most recently from 71b0bd0 to c5085ab Compare January 23, 2022 18:45

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Feb 15, 2022

nabijaczleweli approved these changes Feb 15, 2022

View reviewed changes

jsai20 force-pushed the issue-8768 branch from c5085ab to 2b67048 Compare February 16, 2022 04:57

ghost reviewed Feb 16, 2022

View reviewed changes

module/zfs/zfs_replay.c Show resolved Hide resolved

ghost approved these changes Feb 16, 2022

View reviewed changes

scineram reviewed Feb 17, 2022

View reviewed changes

module/zfs/zil.c Show resolved Hide resolved

jsai20 force-pushed the issue-8768 branch from 2b67048 to 85f772c Compare February 17, 2022 11:42

behlendorf approved these changes Feb 18, 2022

View reviewed changes

jsai20 force-pushed the issue-8768 branch from 85f772c to ad5cb1c Compare February 19, 2022 06:29

behlendorf merged commit 361a7e8 into openzfs:master Feb 22, 2022

jsai20 deleted the issue-8768 branch August 1, 2024 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS fix to make xattr=sa logging to ZIL on create/remove/update. #9078

ZFS fix to make xattr=sa logging to ZIL on create/remove/update. #9078

jsai20 commented Jul 25, 2019 •

edited

Loading

codecov bot commented Jul 25, 2019 •

edited

Loading

ahrens commented Jul 29, 2019

behlendorf commented Jul 29, 2019

jsai20 commented Jul 30, 2019

ahrens commented Jul 30, 2019

jsai20 commented Aug 6, 2019

ahrens commented Aug 6, 2019

jsai20 commented Aug 7, 2019

snajpa commented Nov 13, 2019 •

edited

Loading

ahrens commented Nov 14, 2019

ahrens commented Jun 10, 2021

jsai20 commented Jun 10, 2021

ghost commented Jun 10, 2021

jsai20 commented Jun 11, 2021

behlendorf commented Feb 15, 2022

behlendorf left a comment

behlendorf Feb 18, 2022

jsai20 Feb 19, 2022

behlendorf commented Feb 22, 2022

jsai20 commented Feb 23, 2022

ZFS fix to make xattr=sa logging to ZIL on create/remove/update. #9078

ZFS fix to make xattr=sa logging to ZIL on create/remove/update. #9078

Conversation

jsai20 commented Jul 25, 2019 • edited Loading

codecov bot commented Jul 25, 2019 • edited Loading

Codecov Report

ahrens commented Jul 29, 2019

behlendorf commented Jul 29, 2019

jsai20 commented Jul 30, 2019

ahrens commented Jul 30, 2019

jsai20 commented Aug 6, 2019

ahrens commented Aug 6, 2019

jsai20 commented Aug 7, 2019

snajpa commented Nov 13, 2019 • edited Loading

ahrens commented Nov 14, 2019

ahrens commented Jun 10, 2021

jsai20 commented Jun 10, 2021

ghost commented Jun 10, 2021

jsai20 commented Jun 11, 2021

behlendorf commented Feb 15, 2022

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Feb 18, 2022

Choose a reason for hiding this comment

jsai20 Feb 19, 2022

Choose a reason for hiding this comment

behlendorf commented Feb 22, 2022

jsai20 commented Feb 23, 2022

jsai20 commented Jul 25, 2019 •

edited

Loading

codecov bot commented Jul 25, 2019 •

edited

Loading

snajpa commented Nov 13, 2019 •

edited

Loading