-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep reading when there's space in the buffer for tokio::io::copy #3694
Comments
I would like to try this. |
Go ahead! |
@cssivision
It doesn't matter. Whatever. |
Please be more careful when you take on PRs that someone have already claimed. If you are concerned that it is stalled, you should ask whether they are still working on it. |
@cssivision You don't have to close your PR, I'm just saying you should have asked me in advance instead of just submitting a PR, which would have caused everyone's time to be wasted. Please feel free to reopen your PR.(If you don't want to, that's fine) Thanks! |
Is this a real problem that needs to be handled? I find it hard to believe that, with modern drives and network speeds, the 2k buffer isn't filled all the time. I started a conversation that resulted in #3722 because I found reading and writing 2k at a time severely limited throughput. I've found that reducing the calls into the kernel can increase performance due to context switches being a fairly expensive operation. Checking to see if a socket is readable or writable probably results in a kernel call (via Unless you're targeting tiny machines (I don't know if Rust compiles code for embedded micros) 2k is a tiny amount of RAM (barely bigger than an MTU and much smaller than a GigE jumbo packet!) I'm asking that, if this is a feature that is to be added, please benchmark it to see how it performs in reading or writing to a gigabit Ethernet link. |
That's good to know. However, I believe my other concerns are still valid. |
Yeah, we should probably change the default buffer size too. The one in std uses 8k, so I guess that would be reasonable? |
8k should definitely be the minimum since Jumbo frames may be more common as GigE becomes the "slowest" network interface. It'll cut the number of kernel calls to a quarter of they are now, at least. Another thing to consider is TCP's NO_DELAY option. Some network traffic is chit-chatty (e.g. Trying to fill the buffer defeats that option. So I'd avoid trying to fill the 8k buffer; just send what was received. This covers short exchanges that don't fill the buffer and large data transfers that completely fill it. (in other words, short messages won't see a benefit of 8k but large data transfers will.) |
When it comes to NO_DELAY, that's not something As for filling the buffer, the design I had in mind would only try to read more data into the buffer if the |
That is, the idea is that if we are not able to write anything right now, we might as well read some more data from the socket instead if we have space for it. It would not involve postponing writing any data in the cases where we are able to write data for the sake of buffering many smaller reads together. |
Sorry, I wasn't being clear. If I set NO_DELAY on the socket, but you're waiting to fill the 2k buffer before sending it on, then you've defeated the purpose of NO_DELAY. But your later response indicates you're only filling when you can't write which I'm more in favor for. Especially if you increase the buffer size. |
Right, I understand now what you meant with NO_DELAY. To confirm, I am not suggesting that we wait for the buffer to be filled before we try to send, rather I am suggesting that if the send returns |
@Darksonn This sounds to me like the condition |
Hey! So it does seem like the issue got a little stale, do you think I could offer my take on this? |
@CthulhuDen go ahead! |
Is anyone currently actively working on this (@CthulhuDen perhaps?) If not, I'd like to tackle it. |
@farnz Please go ahead! Sorry I don't have numbers on hand to back me, but I found this fix changes nothing, cause we're always bottlenecked on either input or output and basically either way change did not affect transfer throughput for me. But, maybe you will arrive at positive results. Good luck! |
Definitely do some benchmarks before and after to make sure your changes improve things. I know it's not easy writing test benchmarks since they don't necessarily reflect the real world, but maximizing throughput can be affected in unexpected ways. In a (C++) project, we wrapped a helper function around getting the system time. It was so easy to use, it got called in several places every time the program handled an I/O event. Getting the system time requires entering the kernel, which is pretty expensive. We changed the program to get the system time when it woke up and saved it. The helper function simply returned the cached value for the current I/O event. That easily doubled our throughput. In another instance, I called an I know these approaches don't translate well to |
The changes to actually do extra reads while waiting for the write to become possible aren't that complex; I can put them up for review, but I'd prefer to finish the benchmarks to show that they're worthwhile first. I'm working now on benchmarks showing that:
|
So, benchmarks done, and I've been able to create one case in which the change speeds things up, and 4 in which it doesn't affect things. The no change cases are:
And the one that benefits (significantly)
However, I cannot see a case in which an IOPS throttled unbuffered device is a realistic target. If you're sending over a network, the network stack adds buffering; if it's local, the kernel will normally add buffering. The only time this could matter is if you've implemented My changes (with benchmark numbers on my aging Xeon E3-1245v2) are at master...farnz:tokio:copy_buffer - I'm happy to turn them into a proper pull request if there's still interest in this, because they do at least make it slightly harder to shoot yourself in the foot. |
I've come up with a realistic benchmark that shows improvements, and I'm going to force-push my branch ready for a pull request. farnz@382c0bc is the old benchmarks before the force push for historic reference (my previous comment). The "trick" is to make the HDD stall randomly as the buffer fills up. I believe this is a realistic behaviour, since it's what a real NAS would do if it was under overload scenario, and you were unlucky enough to come in just after the NAS assigns buffer space to another writer. |
And with a bit more thinking, I've been able to write a realistic benchmark that does show improvement - I'm going to force push my branch and use it for the pull request, but farnz@382c0bc has the old benchmarks. The missing part in my thinking is that concurrent users introduce random chances of blocking because all buffers are full; it's reasonable to model this as a chance of blocking that's proportional to how full your buffer is (since your buffer empties regularly, this represents more of a chance of blocking as the underlying HDD buffer fills up). With this model in the benchmark, I see wins copying from a slow source to an overloaded HDD simulation. |
This is the current poll method for
tokio::io::copy
andtokio::io::copy_bidirectional
.tokio/tokio/src/io/util/copy.rs
Lines 28 to 76 in adad8fc
However, as long as
self.cap < 2048
, it would be possible to attempt to read more data into the buffer. We should do this. One might also consider copying data from the end of the buffer to the start to make space for further reading.See also #3572.
The text was updated successfully, but these errors were encountered: