Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xsk: i40e: Tx performance improvements #320

Closed
wants to merge 6 commits into from

Commits on Nov 11, 2020

  1. adding ci files

    kernel-patches-bot committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    12422f9 View commit details
    Browse the repository at this point in the history
  2. samples/bpf: increment Tx stats at sending

    Increment the statistics over how many Tx packets have been sent at
    the time of sending instead of at the time of completion. This as a
    completion event means that the buffer has been sent AND returned to
    user space. The packet always gets sent shortly after sendto() is
    called. The kernel might, for performance reasons, decide to not
    return every single buffer to user space immediately after sending,
    for example, only after a batch of packets have been
    transmitted. Incrementing the number of packets sent at completion,
    will in that case be confusing as if you send a single packet, the
    counter might show zero for a while even though the packet has been
    transmitted.
    
    Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    magnus-karlsson authored and kernel-patches-bot committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    c429d72 View commit details
    Browse the repository at this point in the history
  3. i40e: remove unnecessary sw_ring access from xsk Tx

    Remove the unnecessary access to the software ring for the AF_XDP
    zero-copy driver. This was used to record the length of the packet so
    that the driver Tx completion code could sum this up to produce the
    total bytes sent. This is now performed during the transmission of the
    packet, so no need to record this in the software ring.
    
    Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    magnus-karlsson authored and kernel-patches-bot committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    e5df0fd View commit details
    Browse the repository at this point in the history
  4. xsk: introduce padding between more ring pointers

    Introduce one cache line worth of padding between the consumer pointer
    and the flags field as well as between the flags field and the start
    of the descriptors in all the lockless rings. This so that the x86 HW
    adjacency prefetcher will not prefetch the adjacent pointer/field when
    only one pointer/field is going to be used. This improves throughput
    performance for the l2fwd sample app with 1% on my machine with HW
    prefetching turned on in the BIOS.
    
    Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    magnus-karlsson authored and kernel-patches-bot committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    1b06b2f View commit details
    Browse the repository at this point in the history
  5. xsk: introduce batched Tx descriptor interfaces

    Introduce batched descriptor interfaces in the xsk core code for the
    Tx path to be used in the driver to write a code path with higher
    performance. This interface will be used by the i40e driver in the
    next patch. Though other drivers would likely benefit from this new
    interface too.
    
    Note that batching is only implemented for the common case when
    there is only one socket bound to the same device and queue id. When
    this is not the case, we fall back to the old non-batched version of
    the function.
    
    Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
    magnus-karlsson authored and kernel-patches-bot committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    f2586be View commit details
    Browse the repository at this point in the history
  6. i40e: use batched xsk Tx interfaces to increase performance

    Use the new batched xsk interfaces for the Tx path in the i40e driver
    to improve performance. On my machine, this yields a throughput
    increase of 4% for the l2fwd sample app in xdpsock. If we instead just
    look at the Tx part, this patch set increases throughput with above
    20% for Tx.
    
    Note that I had to explicitly loop unroll the inner loop to get to
    this performance level, by using a pragma. It is honored by both clang
    and gcc and should be ignored by versions that do not support
    it. Using the -funroll-loops compiler command line switch on the
    source file resulted in a loop unrolling on a higher level that
    lead to a performance decrease instead of an increase.
    
    Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    magnus-karlsson authored and kernel-patches-bot committed Nov 11, 2020
    Configuration menu
    Copy the full SHA
    8df8e06 View commit details
    Browse the repository at this point in the history