-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xsk: i40e: Tx performance improvements #320
Commits on Nov 11, 2020
-
Configuration menu - View commit details
-
Copy full SHA for 12422f9 - Browse repository at this point
Copy the full SHA 12422f9View commit details -
samples/bpf: increment Tx stats at sending
Increment the statistics over how many Tx packets have been sent at the time of sending instead of at the time of completion. This as a completion event means that the buffer has been sent AND returned to user space. The packet always gets sent shortly after sendto() is called. The kernel might, for performance reasons, decide to not return every single buffer to user space immediately after sending, for example, only after a batch of packets have been transmitted. Incrementing the number of packets sent at completion, will in that case be confusing as if you send a single packet, the counter might show zero for a while even though the packet has been transmitted. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for c429d72 - Browse repository at this point
Copy the full SHA c429d72View commit details -
i40e: remove unnecessary sw_ring access from xsk Tx
Remove the unnecessary access to the software ring for the AF_XDP zero-copy driver. This was used to record the length of the packet so that the driver Tx completion code could sum this up to produce the total bytes sent. This is now performed during the transmission of the packet, so no need to record this in the software ring. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for e5df0fd - Browse repository at this point
Copy the full SHA e5df0fdView commit details -
xsk: introduce padding between more ring pointers
Introduce one cache line worth of padding between the consumer pointer and the flags field as well as between the flags field and the start of the descriptors in all the lockless rings. This so that the x86 HW adjacency prefetcher will not prefetch the adjacent pointer/field when only one pointer/field is going to be used. This improves throughput performance for the l2fwd sample app with 1% on my machine with HW prefetching turned on in the BIOS. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 1b06b2f - Browse repository at this point
Copy the full SHA 1b06b2fView commit details -
xsk: introduce batched Tx descriptor interfaces
Introduce batched descriptor interfaces in the xsk core code for the Tx path to be used in the driver to write a code path with higher performance. This interface will be used by the i40e driver in the next patch. Though other drivers would likely benefit from this new interface too. Note that batching is only implemented for the common case when there is only one socket bound to the same device and queue id. When this is not the case, we fall back to the old non-batched version of the function. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Configuration menu - View commit details
-
Copy full SHA for f2586be - Browse repository at this point
Copy the full SHA f2586beView commit details -
i40e: use batched xsk Tx interfaces to increase performance
Use the new batched xsk interfaces for the Tx path in the i40e driver to improve performance. On my machine, this yields a throughput increase of 4% for the l2fwd sample app in xdpsock. If we instead just look at the Tx part, this patch set increases throughput with above 20% for Tx. Note that I had to explicitly loop unroll the inner loop to get to this performance level, by using a pragma. It is honored by both clang and gcc and should be ignored by versions that do not support it. Using the -funroll-loops compiler command line switch on the source file resulted in a loop unrolling on a higher level that lead to a performance decrease instead of an increase. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 8df8e06 - Browse repository at this point
Copy the full SHA 8df8e06View commit details