Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a diagnostic kstat for obtaining pool status #16484

Closed
wants to merge 2 commits into from

Conversation

usaleem-ix
Copy link
Contributor

Motivation and Context

This PR is an updated version of previous #16026

In the original PR, JSON was written into the buffer directly and nvlists were also converted to JSON, which was redundant.

Description

This PR creates an output nvlist and later nvlist is printed in JSON format to provided buffer. Spares and l2cache devices were also not showing up with previous #16026. This is also fixed now.

How Has This Been Tested?

Manually tested in different pool configurations.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

.gitignore Outdated Show resolved Hide resolved
include/sys/spa_impl.h Outdated Show resolved Hide resolved
#define JPRINTF(start, end, ...) \
do { \
if (start < end) \
start += snprintf(start, end - start, __VA_ARGS__); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe we need to check that the return value is not negative here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

module/zfs/spa.c Outdated Show resolved Hide resolved
module/zfs/spa_json_stats.c Outdated Show resolved Hide resolved
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 28, 2024
fredw and others added 2 commits August 29, 2024 10:48
This kstat output does not require taking the spa_namespace
lock, as in the case for 'zpool status'. It can be used for
investigations when pools are in a hung state while holding
global locks required for a traditional 'zpool status' to
proceed.

This kstat is not safe to use in conditions where pools are
in the process of configuration changes (i.e., adding/removing
devices).  Therefore, this kstat is not intended to be a general
replacement or alternative to using 'zpool status'.

Sponsored-by: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.

Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
This commit updates the kstat for pool status and simplifies by
creating an nvlist that contains the pool status. This nvlist is then
printed to provided buffer in JSON format. The redundant parts of code
have also been removed.

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
@tonyhutter
Copy link
Contributor

@usaleem-ix is the main goal of this to get the JSON into a kstat so it's lockless? Or was there another use case for wanting the JSON specifically in the kstats? I ask because I'm working on some prototype code that would remove spa_namespace_lock from the zpool status callpath. Early tests are promising but I'm not 100% sure it's going to work yet.

@usaleem-ix
Copy link
Contributor Author

@tonyhutter yes, the main goal of this is to get zpool status JSON into a kstat so it is lockless.

Also, instead of executing zpool status command in a subprocess in python, ease of accessing the kstats within our TrueNAS middleware is also one of the motivations for this.

@yocalebo
Copy link

yocalebo commented Sep 3, 2024

@usaleem-ix is the main goal of this to get the JSON into a kstat so it's lockless? Or was there another use case for wanting the JSON specifically in the kstats? I ask because I'm working on some prototype code that would remove spa_namespace_lock from the zpool status callpath. Early tests are promising but I'm not 100% sure it's going to work yet.

@tonyhutter For TrueNAS, we have a few primary reasons for wanting zpool status in procfs.

  1. lockless (already mentioned)
  2. preventing the necessity for fork+exec'ing (this can become expensive)
  3. simplifying a large part of our code-base that uses libzfs (parsing structured output from procfs becomes "trivial" compared to using libzfs)

EDIT:
4. parts of libzfs aren't thread safe and so we have to use a process pool. A process pool adds a non-trivial amount of overhead to our application and so offloading certain information to procfs allows us to lessen the usage of said process pool.

@tonyhutter
Copy link
Contributor

tonyhutter commented Sep 4, 2024

@usaleem-ix @yocalebo thanks for the info.

  1. I just opened WIP: Remove spa_namespace_lock from zpool status #16507 for lockless zpool status.

  2. The kstat means we're doing JSON generation in kernel-space. That makes me worry more about JSON edge cases (special characters in a vdev name, overflowing the JSON buffer, etc...)

  3. This kstat JSON is just a dump of the pool's nvlist. That means we're exposing a bunch of internal variables that the user never needs to see:

$ sudo cat /proc/spl/kstat/zfs/tank/status.json | jq
{
  "status_json_version": 4,
  "scl_config_lock": true,
  "scan_error": 2,
  "scan_stats": {
    "func": "NONE",
    "state": "NONE"
  },
...

It also API-ifys the config nvlist, which is something I think we should avoid.

  1. The status.json kstat doesn't match the zpool status JSON output.

  2. This PR introduces a second JSON implementation in the same codebase.

  3. The zpool status -j route allows us to use our delegation support, wheras the kstat does not.

Overall, I think putting the JSON functionality in libzfs may be the better route, even if it also means fixing whatever thread safety issues we have in the library. It's just nice to have all the JSON generation done in userspace.

Alternatively, if you want to go the fork+exec zpool status -j route, it's possible pre-forking could speed things up a bit.

@yocalebo
Copy link

yocalebo commented Sep 6, 2024

1. I just opened [WIP: Remove spa_namespace_lock from zpool status #16507](https://github.com/openzfs/zfs/pull/16507) for lockless `zpool status`.

This is great. Removing that lock for zpool status will benefit everyone (not just selfish developers like myself 😄 )

2. The kstat means we're doing JSON generation in kernel-space.  That makes me worry more about JSON edge cases (special characters in a vdev name, overflowing the JSON buffer, etc...)

I tend to agree this isn't that big of a deal. The linux kernel does yaml formatting for certain nfs statistics (and we actually found it wasn't escaping characters properly) and was producing invalid yaml. Not quite the same argument you're making about special chars, over/under flowing etc, but procfs is pretty resilient and has a ton of information from all kinds of esoteric subsystems these days.

3. This kstat JSON is just a dump of the pool's nvlist.  That means we're exposing a bunch of internal variables that the user never needs to see:

I agree with you on this point. I see no benefit in exposing this type of information. I was under the assumption this would essentially mirror the zpool status -j output (for the most part).

It also API-ifys the config nvlist, which is something I think we should avoid.

One of the biggest gripes I have is that libzfs isn't versioned and is hard to utilize 😄 I've got no strong opinion on this side of the argument though.

4. The `status.json` kstat doesn't match the `zpool status` JSON output.

Yeah, this is a problem and I agree with you 100% here. Some subtle differences are okay but not matching at a large percentage is bad.

5. This PR introduces a second JSON implementation in the same codebase.

Agree with you 100% on this too. No reason to add unnecessary complexity like this.

6. The `zpool status -j` route allows us to use our delegation support, wheras the kstat does not.

That's a valid point.

Overall, I think putting the JSON functionality in libzfs may be the better route, even if it also means fixing whatever thread safety issues we have in the library. It's just nice to have all the JSON generation done in userspace.

Yeah, I'm all for improving libzfs. We've had a couple community members try to user our python libzfs bindings and they have run into issues with non-reentrant calls being made in libzfs (and IIRC, there were some global memory objects wreaking havoc at some point.) Anyways, one of the community members actually tried to help improve thread-safety by opening a PR here. There were quite a few follow-up commits to make the lib thread-safe. Maybe it is thread-safe (for the most part) and we just need to test it.

Alternatively, if you want to go the fork+exec zpool status -j route, it's possible pre-forking could speed things up a bit.

We already do this with a process pool. However, eventually the child processes get reaped and new ones get forked. This all becomes moot, however, if the library is indeed thread-safe. More tests need to be done on our side I guess.

@amotin
Copy link
Member

amotin commented Nov 7, 2024

@tonyhutter I don't see a big problem in points 2 and 5, since it would be a code not requiring much maintenance, once written and forgotten. The rest of your points are valid to me. The question is how much work would be to handle those now and maintain it later. I see there ~300 lines of spa_json_stats.c trying to do it now, and I wonder what more would it take to fix all the JSON output divergence.

@tonyhutter
Copy link
Contributor

@amotin another issue is that I'm not convinced the pool kstats are taking the proper locks to deal with device removal/export. When I tested this back in September I was able to panic the kernel by running these two commands in parallel:

while [ 1 ] ; do sudo cat /proc/spl/kstat/zfs/tank/status.json > /dev/null ; done
sudo ./zpool export tank
[19066.009010] BUG: kernel NULL pointer dereference, address: 00000000000009a8
[19066.009592] #PF: supervisor read access in kernel mode
[19066.010015] #PF: error_code(0x0000) - not-present page
[19066.010380] PGD 0 P4D 0 
[19066.010564] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[19066.010913] CPU: 7 PID: 390725 Comm: cat Tainted: P           OE      6.10.3-200.fc40.x86_64 #1
[19066.011514] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+19570+14a90618 04/01/2014
[19066.012233] RIP: 0010:rrw_held+0x18/0x140 [zfs]
[19066.012692] Code: 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 41 89 f4 55 65 48 8b 2d 99 5a 0d 3f 53 <48> 8b 47 28 48 89 fb 48 39 e8 0f 84 de 00 00 00 48 89 df e8 30 54
[19066.013976] RSP: 0018:ffffb90c225936b0 EFLAGS: 00010246
[19066.014339] RAX: 0000000000000000 RBX: 0000000000000980 RCX: ffff9e2f9383f340
[19066.014836] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000980
[19066.015325] RBP: ffff9e2ee89e5200 R08: 0000000000000000 R09: 0000000000000000
[19066.015818] R10: ffff9e2f9383f340 R11: 0000000000000020 R12: 0000000000000002
[19066.016308] R13: ffffb90c22593a58 R14: 0000000000000000 R15: ffffb90c22593a50
[19066.016800] FS:  00007f22fa51b740(0000) GS:ffff9e2febd80000(0000) knlGS:0000000000000000
[19066.017353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19066.017751] CR2: 00000000000009a8 CR3: 000000013ab60000 CR4: 0000000000750ef0
[19066.018247] PKRU: 55555554
[19066.018444] Call Trace:
[19066.018621]  <TASK>
[19066.018777]  ? __die_body.cold+0x19/0x27
[19066.019057]  ? page_fault_oops+0x15a/0x2f0
[19066.019347]  ? exc_page_fault+0x7e/0x180
[19066.019622]  ? asm_exc_page_fault+0x26/0x30
[19066.019921]  ? rrw_held+0x18/0x140 [zfs]
[19066.020313]  dsl_pool_config_enter+0x22/0x60 [zfs]
[19066.020760]  spa_prop_get+0x82/0x1110 [zfs]
[19066.021202]  ? __kmalloc_node_noprof+0x21e/0x4b0
[19066.021530]  spa_props_json+0x47/0x7d0 [zfs]
[19066.021941]  spa_generate_json_stats+0x121/0x770 [zfs]
[19066.022397]  kstat_seq_show+0x259/0x4e0 [spl]
[19066.022707]  ? kstat_seq_start+0x9a/0x470 [spl]
[19066.023029]  seq_read_iter+0x11f/0x460
[19066.023292]  seq_read+0x12e/0x170
[19066.023526]  proc_reg_read+0x5a/0xa0
[19066.023778]  vfs_read+0xb8/0x370
[19066.024009]  ksys_read+0x6d/0xf0
[19066.024237]  do_syscall_64+0x82/0x160
[19066.024494]  ? __pte_offset_map+0x1b/0x180
[19066.024780]  ? __handle_mm_fault+0xc06/0x1070
[19066.025089]  ? syscall_exit_to_user_mode+0x72/0x220
[19066.025427]  ? __count_memcg_events+0x75/0x130
[19066.025738]  ? count_memcg_events.constprop.0+0x1a/0x30
[19066.026100]  ? handle_mm_fault+0x1f0/0x300
[19066.026385]  ? do_user_addr_fault+0x36c/0x620
[19066.026689]  ? clear_bhb_loop+0x25/0x80
[19066.026959]  ? clear_bhb_loop+0x25/0x80
[19066.027226]  ? clear_bhb_loop+0x25/0x80
[19066.027494]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[19066.027844] RIP: 0033:0x7f22fa62be11

Maybe this has been fixed - I haven't re-tested it since then.

@amotin
Copy link
Member

amotin commented Nov 7, 2024

I'm not convinced the pool kstats are taking the proper locks to deal with device removal/export. When I tested this back in September I was able to panic the kernel by running these two commands in parallel:

I wonder if this issue is not specific to the pool status but is a general bug in Linux kstat implementation, not waiting for the ongoing calls to complete before returning from destruction.

@usaleem-ix
Copy link
Contributor Author

I had been working on addressing the review comments, mainly trying to make the kstat output look similar to zpool status -j. Output of #16026 diverged significantly from what zpool status -j outputs and did not output things like checkpoint stats, device removal stats, raidz expansion stats, dedup stats and decoding the error list. Moreover, implementation of status and action strings in zpool status -j output will introduce a lot of code duplication.

When kstat output gets large for fairly large zpool, the number of times we construct the output and go back and forth for allocating buffer of suitable size to contain the kstat output, it actually becomes slower than zpool status -j.

While all these problems can be looked into and fixed, but this increases the effort required here to address these issues and those that have been highlighted above in earlier comments.

After evaluating the cost and value this brings, we have decided to not pursue this further. Although, anybody who is interested in this work is welcome to try and continue it in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants