Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Transmit interface #1

Open
lukego opened this issue May 24, 2018 · 4 comments
Open

Define Transmit interface #1

lukego opened this issue May 24, 2018 · 4 comments

Comments

@lukego
Copy link
Owner

lukego commented May 24, 2018

Has to be really easy to interface with on a driver, and really easy to implement in silicon, and really efficient with PCIe bandwidth.

@mlilja01
Copy link

The proposed "pcap" like layout you have suggested where the EOF is indicated by length=0 is not that PCIe efficient because HW would need to read a packet at a time. It would be more efficient if the blocksize is know up front.
One very efficient way is to have multiple packets back to back in contiguous memory and DMA based on start address and then a block size. The HW would need to be-block though.

@lukego
Copy link
Owner Author

lukego commented May 25, 2018

@mlilja01 Good points!

Having the hardware de-block is interesting. I would really like to support operating at line-rate 100G on a single queue. This would mean the hardware needs to extract a packet from the block at ~145MHz which is probably close to the clock speed of the circuit. Is this reasonable?

I want to avoid the situation that we saw on the ConnectX-4 (snabbco/snabb#1007 (comment)) where even on a 100G NIC the per-queue performance maxed out at around 15 Mpps (10% of line rate). If that is the situation then the application needs to shard traffic across many queues and then it can become complicated to preserve ordering (reassemble based on timestamps???) and shard the traffic in an application-appropriate way (need an eBPF VM to hash the headers???)

So presumably it is very important that the DMA layout does not constrain per-queue parallelism on the device and allows it to extract a packet on more-or-less every cycle. Yes?

EDIT: s/GHz/MHz/

@mlilja01
Copy link

A single queue running 100G is possible, we do that today on our NICs. Actually the NICs can handle 200G, but we don't have PCIe4 in any x86 servers yet. The issue we mostly see is that SW cannot keep up with a single queue.

The drawback of a block of packets is that it is not very protocol stack friendly. Normal networking apps like to have a buffer per packet, which is very handy but very bad PCIe wise.

@lukego
Copy link
Owner Author

lukego commented May 25, 2018

The drawback of a block of packets is that it is not very protocol stack friendly.

Yes. I see this as a "with great power comes great responsibility" situation. The EasyNIC design will concentrate all of the complexity in one place i.e. on the host CPU. This is different than mainstream ASIC NICs that seem eager to divide functionality between hardware and software using more elaborate interfaces (scatter-gather, offloads, multiqueue, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants