Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High io_wait CPU usage #247

Open
Rolly992 opened this issue Jun 13, 2024 · 7 comments
Open

High io_wait CPU usage #247

Rolly992 opened this issue Jun 13, 2024 · 7 comments

Comments

@Rolly992
Copy link

When switching an application to use io_uring as a transport, we observed a huge increase in the CPU io_wait time, going from 0.01% to ~95% for the same load. Overall CPU usage however stays low, around 4% per core, as you can see on the charts below.
The app was running on a relatively fresh OS and kernel version.

Is such increase in io_wait expected? What is the best way to further diagnose it? Recording JFR didn't give much, the Socket IO events are missing.

As for the side effects of swtiching to io_uring, the app started to have much more timeouts - both for incoming and outgoing http requests. For example, timeouts in both serving the usual and the metrics requests are observed, even the later is supposed to be fast.

IO_uring enabled:
netty-io_uring

IO_uring disabled:
netty-std

JRE vendor/version: java-21-amazon-corretto
Exact version: 21.0.3+9-LTS
IO_uring version: 0.0.25.Final, x86-64
Netty version: 4.1.107-109
OS version: Ubuntu 22.04.4 LTS, at AWS EC2
Kernel version: 6.5.0-1018-aws

@franz1981
Copy link
Contributor

Probably async-profiler can help there, but first, quick check:

  • are you using HTTP 1.1, am I right?
  • is it using file I/O or anything? what the test is doing?
  • how many event loops vs physical CPUs?
  • how many connections are active concurrently?

This change b047cec#diff-e660183916ec399e605aac22a0864bd4b02ee986c85f5d96256872b5cf7f6783R49 should have exactly the opposite effect of what you're seeing and it should be within 0.0.25.Final, x86-64 actually, so it surprise me a bit...

@Rolly992
Copy link
Author

Rolly992 commented Jun 13, 2024

  • Yeah, we are using HTTP 1.1.
  • It is heavily using file IO - it's storing and serving files from/to local disk via HTTP interface.
  • There are 8 event loops vs 4 CPUs (2 physical CPUs).
  • Number of concurrent connections - roughly between 50 and 200.

I've also double-checked and it turned out what I've been using io_uring v 0.0.24. I switched it to 0.0.25 and it didn't improve io_wait times.

Surprisingly I see the same picture even when the service is absolutely idle - when no traffic is routed to it and no background tasks are running. Even in this case the io_wait time is above 90%..
This applies both for 0.0.24 and 0.0.25.

@franz1981
Copy link
Contributor

Ok so found the reason. NIO and epoll can use file zero copy transfer while io_uring cannot (search for the same issue in this repo).

@franz1981
Copy link
Contributor

I see the same picture even when the service is absolutely idle

I would profile with async profiler or some perf top -a

@Rolly992
Copy link
Author

I've profiled using async profiler, profiling the CPU and wall-clock.
The wall clock looks almost identical, while the CPU has a bit more difference.

The difference for wall-clock is expectedly what the compute threads are busy (waiting?) with corresponding transport implementations.

IO_uring:
uring-wall2
Epoll:
epoll-wall

I couldn't find any meaningful difference for the CPU profile though
1 flame graph "tower" is based on the EpollEventLoop.run and the other is based on the IOUringEvenLoop.run

Is there any particular information which could help to investigate it better?
Could it be that for the function io_uring is calling the io_wait is an expected state and is not an issue?

@franz1981
Copy link
Contributor

Yep, it seems that both are waiting. Meaning that there is not enough network activity to perform further I/O

@Rolly992
Copy link
Author

Rolly992 commented Sep 2, 2024

Thanks for all of the suggestions and answers!

And in case of the load, the one which can normally handled by the epoll, does the absence of a file zero copy transfer explain timeouts, in both service and even in http metrics endpoints? Or could it explain it, at least theoretically? Does it mean that io_uring performance gains are not utilized when working with files (writing / reading)?

Ok so found the reason. NIO and epoll can use file zero copy transfer while io_uring cannot (search for the same issue in this repo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants