-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What caused so many failures in the 4k record size test on the London cluster #1334
Comments
A first pass search of all the logs I find: Searching upstairs logs for why clients disconnected:
Then vim and chop off the front of the file, then:
Gives us this summary:
|
Logs are extracted at /staff/core/crucible-1334/4k-recordsize-debug |
After doing some research (for RFD 490), here's what I found: when doing writes, illumos ZFS splits into one transaction per recordsize. This means that going from 128 KiB → 4 KiB recordsize, we end up with 32× more transactions, which (presumably) adds a bunch of overhead. OpenZFS has a patch for this: We should consider backporting it and then testing small recordsizes again. |
@faithanalog ran a test on the Madrid cluster, the details of which are recorded here:
https://github.com/oxidecomputer/artemiscellaneous/tree/main/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a
There was some trouble getting all 64 VMs to start, and there could be noise here in the logs from those errors
vs. errors that happened under IO load.
The logs from propolis and crucible downstairs are here: /staff/artemis/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a/logs.tar.gz
The text was updated successfully, but these errors were encountered: