High io_wait CPU usage #247

Rolly992 · 2024-06-13T10:03:54Z

When switching an application to use io_uring as a transport, we observed a huge increase in the CPU io_wait time, going from 0.01% to ~95% for the same load. Overall CPU usage however stays low, around 4% per core, as you can see on the charts below.
The app was running on a relatively fresh OS and kernel version.

Is such increase in io_wait expected? What is the best way to further diagnose it? Recording JFR didn't give much, the Socket IO events are missing.

As for the side effects of swtiching to io_uring, the app started to have much more timeouts - both for incoming and outgoing http requests. For example, timeouts in both serving the usual and the metrics requests are observed, even the later is supposed to be fast.

IO_uring enabled:

IO_uring disabled:

JRE vendor/version: java-21-amazon-corretto
Exact version: 21.0.3+9-LTS
IO_uring version: 0.0.25.Final, x86-64
Netty version: 4.1.107-109
OS version: Ubuntu 22.04.4 LTS, at AWS EC2
Kernel version: 6.5.0-1018-aws

franz1981 · 2024-06-13T10:20:35Z

Probably async-profiler can help there, but first, quick check:

are you using HTTP 1.1, am I right?
is it using file I/O or anything? what the test is doing?
how many event loops vs physical CPUs?
how many connections are active concurrently?

This change b047cec#diff-e660183916ec399e605aac22a0864bd4b02ee986c85f5d96256872b5cf7f6783R49 should have exactly the opposite effect of what you're seeing and it should be within 0.0.25.Final, x86-64 actually, so it surprise me a bit...

Rolly992 · 2024-06-13T15:23:55Z

Yeah, we are using HTTP 1.1.
It is heavily using file IO - it's storing and serving files from/to local disk via HTTP interface.
There are 8 event loops vs 4 CPUs (2 physical CPUs).
Number of concurrent connections - roughly between 50 and 200.

I've also double-checked and it turned out what I've been using io_uring v 0.0.24. I switched it to 0.0.25 and it didn't improve io_wait times.

Surprisingly I see the same picture even when the service is absolutely idle - when no traffic is routed to it and no background tasks are running. Even in this case the io_wait time is above 90%..
This applies both for 0.0.24 and 0.0.25.

franz1981 · 2024-06-13T17:08:37Z

Ok so found the reason. NIO and epoll can use file zero copy transfer while io_uring cannot (search for the same issue in this repo).

franz1981 · 2024-06-13T17:10:08Z

I see the same picture even when the service is absolutely idle

I would profile with async profiler or some perf top -a

Rolly992 · 2024-06-20T17:28:14Z

I've profiled using async profiler, profiling the CPU and wall-clock.
The wall clock looks almost identical, while the CPU has a bit more difference.

The difference for wall-clock is expectedly what the compute threads are busy (waiting?) with corresponding transport implementations.

IO_uring:

Epoll:

I couldn't find any meaningful difference for the CPU profile though
1 flame graph "tower" is based on the EpollEventLoop.run and the other is based on the IOUringEvenLoop.run

Is there any particular information which could help to investigate it better?
Could it be that for the function io_uring is calling the io_wait is an expected state and is not an issue?

franz1981 · 2024-06-22T12:41:44Z

Yep, it seems that both are waiting. Meaning that there is not enough network activity to perform further I/O

Rolly992 · 2024-09-02T16:12:53Z

Thanks for all of the suggestions and answers!

And in case of the load, the one which can normally handled by the epoll, does the absence of a file zero copy transfer explain timeouts, in both service and even in http metrics endpoints? Or could it explain it, at least theoretically? Does it mean that io_uring performance gains are not utilized when working with files (writing / reading)?

Ok so found the reason. NIO and epoll can use file zero copy transfer while io_uring cannot (search for the same issue in this repo).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High io_wait CPU usage #247

High io_wait CPU usage #247

Rolly992 commented Jun 13, 2024

franz1981 commented Jun 13, 2024

Rolly992 commented Jun 13, 2024 •

edited

Loading

franz1981 commented Jun 13, 2024

franz1981 commented Jun 13, 2024

Rolly992 commented Jun 20, 2024

franz1981 commented Jun 22, 2024

Rolly992 commented Sep 2, 2024

High io_wait CPU usage #247

High io_wait CPU usage #247

Comments

Rolly992 commented Jun 13, 2024

franz1981 commented Jun 13, 2024

Rolly992 commented Jun 13, 2024 • edited Loading

franz1981 commented Jun 13, 2024

franz1981 commented Jun 13, 2024

Rolly992 commented Jun 20, 2024

IO_uring:

Epoll:

franz1981 commented Jun 22, 2024

Rolly992 commented Sep 2, 2024

Rolly992 commented Jun 13, 2024 •

edited

Loading