Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GIT PULL] liburing/io_uring_for_each_cqe: Pull load acquire out of for loop #1249

Merged
merged 1 commit into from
Sep 29, 2024

Conversation

CPestka
Copy link
Contributor

@CPestka CPestka commented Sep 29, 2024

I thought it was a bit odd that the macro repedatly does the atomic load. While when using it one can amortize the atomic store following the loop, I don't really see why one wouldn't want to do the same for the load, like is already being done in io_uring_peek_batch_cqe(). Maybe I'm missing smth.

This pulls the load out of the loop into a local. Not sure if this is acceptable. I gave it a long ugly name to avoid name collisions, but still. Maybe making a new macro would be better.


git request-pull output:

The following changes since commit 206650ff72b6ea4d76921f9c91ebfffd9902e6a0:

  test/fixed-hugepage: skip test on -ENOMEM (2024-09-27 10:27:10 -0600)

are available in the Git repository at:

  https://github.com/CPestka/liburing for_each_cqe

for you to fetch changes up to cb02d22dba0c6005c877c14a2cd2574a4b95462c:

  liburing: Pull load_acquire out of for loop to amortize cost (2024-09-29 14:43:43 +0200)

----------------------------------------------------------------
CPestka (1):
      liburing: Pull load_acquire out of for loop to amortize cost

 src/include/liburing.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Click to show/hide pull request guidelines

Pull Request Guidelines

  1. To make everyone easily filter pull request from the email
    notification, use [GIT PULL] as a prefix in your PR title.
[GIT PULL] Your Pull Request Title
  1. Follow the commit message format rules below.
  2. Follow the Linux kernel coding style (see: https://github.com/torvalds/linux/blob/master/Documentation/process/coding-style.rst).

Commit message format rules:

  1. The first line is title (don't be more than 72 chars if possible).
  2. Then an empty line.
  3. Then a description (may be omitted for truly trivial changes).
  4. Then an empty line again (if it has a description).
  5. Then a Signed-off-by tag with your real name and email. For example:
Signed-off-by: Foo Bar <foo.bar@gmail.com>

The description should be word-wrapped at 72 chars. Some things should
not be word-wrapped. They may be some kind of quoted text - long
compiler error messages, oops reports, Link, etc. (things that have a
certain specific format).

Note that all of this goes in the commit message, not in the pull
request text. The pull request text should introduce what this pull
request does, and each commit message should explain the rationale for
why that particular change was made. The git tree is canonical source
of truth, not github.

Each patch should do one thing, and one thing only. If you find yourself
writing an explanation for why a patch is fixing multiple issues, that's
a good indication that the change should be split into separate patches.

If the commit is a fix for an issue, add a Fixes tag with the issue
URL.

Don't use GitHub anonymous email like this as the commit author:

123456789+username@users.noreply.github.com

Use a real email address!

Commit message example:

src/queue: don't flush SQ ring for new wait interface

If we have IORING_FEAT_EXT_ARG, then timeouts are done through the
syscall instead of by posting an internal timeout. This was done
to be both more efficient, but also to enable multi-threaded use
the wait side. If we touch the SQ state by flushing it, that isn't
safe without synchronization.

Fixes: https://github.com/axboe/liburing/issues/402
Signed-off-by: Jens Axboe <axboe@kernel.dk>

By submitting this pull request, I acknowledge that:

  1. I have followed the above pull request guidelines.
  2. I have the rights to submit this work under the same license.
  3. I agree to a Developer Certificate of Origin (see https://developercertificate.org for more information).

The io_uring_for_each_cqe macro is used to efficiently iterate over
all currently available CQEs. It allows to amortize the cost of the
store_release on the head for all CQEs processed via this macro.
This change also amortizes the load acquire on the tail, by pulling
the load out of the for loop.

Signed-off-by: Constantin Pestka <constantin.pestka@c-pestka.de>
@axboe
Copy link
Owner

axboe commented Sep 29, 2024

I think this is fine to do, it'll iterate anything that was already available by the time of starting the loop.

@axboe axboe merged commit 40df5e6 into axboe:master Sep 29, 2024
15 checks passed
for (head = *(ring)->cq.khead; \
(cqe = (head != io_uring_smp_load_acquire((ring)->cq.ktail) ? \
(cqe = (head != __liburing_internal_cached_tail ? \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (false)
    io_uring_for_each_cqe() { ... }

Try it, for each will get run.

@isilence
Copy link
Collaborator

It's changing behaviour in a nasty way, we can't have this version.

@CPestka
Copy link
Contributor Author

CPestka commented Sep 29, 2024

Oh. Yeah, that is nasty. Having two initializations in the for loop instead does not work either here -,-
Guess this does not work with this macro. Do You think it is worth it to add another macro that does this or do you want to keep it to one?

@axboe
Copy link
Owner

axboe commented Sep 30, 2024

Ah indeed, thanks Pavel. I'll revert it.

axboe added a commit that referenced this pull request Sep 30, 2024
This reverts commit cb02d22.

This wasn't fully baked, see:

#1249 (comment)

for details.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
@axboe
Copy link
Owner

axboe commented Sep 30, 2024

Do You think it is worth it to add another macro that does this or do you want to keep it to one?

One would be ideal, I don't think there's any issue with only loading it upfront rather than repeatedly in the loop. There was never any guarantee on how CQEs are found, it only really guarantees that CQEs that were available at the time of the call will be iterated. Eg the app does a wait_nr(N) and then iterates. And depending on ring setup mode, there's no way that events will show up DURING the loop, unless old school setup is done (eg signal notifications for task_work).

@isilence
Copy link
Collaborator

isilence commented Sep 30, 2024

One would be ideal, I don't think there's any issue with only loading it upfront rather than repeatedly in the loop. There was never any guarantee on how CQEs are found

Agree, that should be fine, it depends on timings anyway and might happen that we get a new batch only after finishing "for each".

Oh. Yeah, that is nasty. Having two initializations in the for loop instead does not work either here -,-
Guess this does not work with this macro. Do You think it is worth it to add another macro that does this or do you want to keep it to one?

Idea 1:

for (int cached_tail = ({ head = ring.head; load_acquire((ring.tail) }, ...)

Do we require compilers to support statement expressions in liburing?

Idea 2:

struct iter {
 int head, tail;
};

for (struct iter it = {..., ...}, ...)

In which case you don't even need to pass head, so it might even be a separate helper or just a dummy parameter

@axboe
Copy link
Owner

axboe commented Sep 30, 2024

I like idea 1, as it doesn't require a new helper for this. In terms of compiler support, that kind of construct has been used in the kernel for 20+ years, so I don't think we have to worry about that.

@isilence
Copy link
Collaborator

I like idea 1, as it doesn't require a new helper for this.

Not particularly a problem, you can always gracefully ignore an argument.

In terms of compiler support, that kind of construct has been used in the kernel for 20+ years, so I don't think we have to worry about that.

That's if we follow kernel requirements, we're not doing msvc, but I was still thinking about cases where the user might want some niche compiler. Maybe some heterogeneous computing like cuda, or when BPF was in discussion I was thinking making the header shareable with libbpf program (not sure if it supports it).

@axboe
Copy link
Owner

axboe commented Sep 30, 2024

Not particularly a problem, you can always gracefully ignore an argument.

Sure, but then folks would need to opt-in.

I hear your point on more esoteric compilers. How about this, and please stay close to the eye wash station:

diff --git a/src/include/liburing.h b/src/include/liburing.h
index 3ad061d18dfb..512edd50d5cc 100644
--- a/src/include/liburing.h
+++ b/src/include/liburing.h
@@ -310,10 +310,11 @@ int __io_uring_get_cqe(struct io_uring *ring,
 	 * io_uring_smp_load_acquire() enforces the order of tail	\
 	 * and CQE reads.						\
 	 */								\
-	for (head = *(ring)->cq.khead;					\
-	     (cqe = (head != io_uring_smp_load_acquire((ring)->cq.ktail) ? \
-		&(ring)->cq.cqes[io_uring_cqe_index(ring, head, (ring)->cq.ring_mask)] : NULL)); \
-	     head++)							\
+	for (__u32 __HEAD__ = (head) = *(ring)->cq.khead,		\
+	     __tail__ = io_uring_smp_load_acquire((ring)->cq.ktail);	\
+	     (cqe = ((head) != __tail__ ?				\
+	     &(ring)->cq.cqes[io_uring_cqe_index(ring, __HEAD__, (ring)->cq.ring_mask)] : NULL)); \
+	     (head) = ++__HEAD__)
 
 /*
  * Must be called after io_uring_for_each_cqe()

which should result in identical code as to just getting rid of passing in a head iterator to begin with.

@isilence
Copy link
Collaborator

Not particularly a problem, you can always gracefully ignore an argument.

Sure, but then folks would need to opt-in.

I don't understand, the user should never use the value of head variable after the loop, so it's basically

#define for_each(ring, head, cqe)  for_each_no_head(ring, cqe)

I hear your point on more esoteric compilers. How about this, and please stay close to the eye wash station:

Or we can do whatever option and see if anyone complaints. Though it reminds me someone tried to compile it as a strict C.

diff --git a/src/include/liburing.h b/src/include/liburing.h
index 3ad061d18dfb..512edd50d5cc 100644
--- a/src/include/liburing.h
+++ b/src/include/liburing.h
@@ -310,10 +310,11 @@ int __io_uring_get_cqe(struct io_uring *ring,
 	 * io_uring_smp_load_acquire() enforces the order of tail	\
 	 * and CQE reads.						\
 	 */								\
-	for (head = *(ring)->cq.khead;					\
-	     (cqe = (head != io_uring_smp_load_acquire((ring)->cq.ktail) ? \
-		&(ring)->cq.cqes[io_uring_cqe_index(ring, head, (ring)->cq.ring_mask)] : NULL)); \
-	     head++)							\
+	for (__u32 __HEAD__ = (head) = *(ring)->cq.khead,		\
+	     __tail__ = io_uring_smp_load_acquire((ring)->cq.ktail);	\
+	     (cqe = ((head) != __tail__ ?				\
+	     &(ring)->cq.cqes[io_uring_cqe_index(ring, __HEAD__, (ring)->cq.ring_mask)] : NULL)); \
+	     (head) = ++__HEAD__)
 
 /*
  * Must be called after io_uring_for_each_cqe()

which should result in identical code as to just getting rid of passing in a head iterator to begin with.

Any would do IMHO. nit: I don't think you need to assign head.

@axboe
Copy link
Owner

axboe commented Sep 30, 2024

Just trying to avoid needing to do another helper... We have to assign head or some compilers will (rightfully) complain that head is only ever being written, never read. Because I'd much rather just get rid of head entirely, but imho preferable to live with the existing version over having two versions of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants