Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What caused so many failures in the 4k record size test on the London cluster #1334

Open
leftwo opened this issue Jun 3, 2024 · 3 comments

Comments

@leftwo
Copy link
Contributor

leftwo commented Jun 3, 2024

@faithanalog ran a test on the Madrid cluster, the details of which are recorded here:

https://github.com/oxidecomputer/artemiscellaneous/tree/main/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a

There was some trouble getting all 64 VMs to start, and there could be noise here in the logs from those errors
vs. errors that happened under IO load.

The logs from propolis and crucible downstairs are here: /staff/artemis/cru-perf/2024-05-24-4krecordsize-omicron_c2f51e25-crucible_92fe269-propolis_6d7ed9a/logs.tar.gz

@leftwo leftwo changed the title Analysis of 4k record size failures on Madrid cluster Analysis of 4k record size failures on London cluster Jun 3, 2024
@leftwo leftwo changed the title Analysis of 4k record size failures on London cluster What caused so many failures in the 4k record size test on the London cluster Jun 3, 2024
@leftwo
Copy link
Contributor Author

leftwo commented Jun 4, 2024

A first pass search of all the logs I find:
No panics.

Searching upstairs logs for why clients disconnected:

cat logs/*/*propolis* | grep "client task is sending Done" | looker |  grep "client task is sending" > client-done.txt

Then vim and chop off the front of the file, then:

cat client-done.txt | sort -n | uniq -c

Gives us this summary:

1325 Done(ConnectionTimeout)
  95 Done(Timeout)
  67 Done(RequestedStop(TooManyOutstandingJobs))
  58 Done(RequestedStop(IOError))
  40 Done(Finished)
  35 Done(ConnectionFailed(Os { code: 146, kind: ConnectionRefused, message: "Connection refused" }))
  12 Done(RequestedStop(FailedLiveRepair))
   7 Done(WriteFailed(IO Error: Os { code: 12, kind: OutOfMemory, message: "Not enough space" }))
   1 Done(WriteFailed(IO Error: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }))

1325 Done(ConnectionTimeout)
This is when the upstairs is trying to connect to a downstairs, but
is not able to make the connection, so those are either never being able to connect, or after a
connection has failed, trying to reconnect when the downstairs is not answering.

95 Done(Timeout)
Happens when the downstairs fails to answer any message from the upstairs for 45 seconds.

67 Done(RequestedStop(TooManyOutstandingJobs))
Is when the upstairs has built up enough IO that it has to give up holding it for the downstairs
as the downstairs is either not answering at all, or is not answering at a fast enough rate.

@leftwo
Copy link
Contributor Author

leftwo commented Jun 4, 2024

Logs are extracted at /staff/core/crucible-1334/4k-recordsize-debug

@mkeeter
Copy link
Contributor

mkeeter commented Jul 25, 2024

After doing some research (for RFD 490), here's what I found: when doing writes, illumos ZFS splits into one transaction per recordsize. This means that going from 128 KiB → 4 KiB recordsize, we end up with 32× more transactions, which (presumably) adds a bunch of overhead.

OpenZFS has a patch for this:
openzfs/zfs@b0cbc1aa9

We should consider backporting it and then testing small recordsizes again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants