writev – zero-copy gather output #291

aantron · 2016-11-19T22:32:32Z

Lwt_unix.writev takes a sequence of bytes and Bigarray buffers, and writes the data to a file descriptor in one system call. Bigarray buffers are always written without copying. bytes buffers are written without copying if the file descriptor is in non-blocking mode, which is typical for sockets and pipes in Lwt.

Example usage:

let () =
  let t =
    let hello : bytes       = Bytes.unsafe_of_string "hello " in
    let world : Lwt_bytes.t = Lwt_bytes.of_string "world!" in (* bigarray *)

    let%lwt bytes_written =
      let io_vectors = Lwt_unix.IO_vectors.create () in
      Lwt_unix.IO_vectors.append_bytes io_vectors hello 0 6;
      Lwt_unix.IO_vectors.append_bigarray io_vectors world 0 6;
      Lwt_unix.(writev stdout io_vectors)
    in

    assert (bytes_written = 12);

    Lwt.return_unit
  in

  Lwt_main.run t

There is also a drop function to trim io_vectors if not all bytes are written.

Performance

I did some rudimentary performance testing (writev.c, writev.ml), measuring throughput. The conclusions are:

When multiple buffers are available, Lwt_unix.writev should always be preferred over multiple calls to Lwt_unix.write (or Lwt_bytes.write, or stdlib's Unix.single_write).
However, if the buffers are sufficiently small (around 128 bytes on Linux, 512 bytes on OS X), it is faster (up to around 30% in C, 250% in OCaml) to copy the buffers to a single large one, and then do a single Lwt_unix.write on the coalesced buffer. This is presumably due to the overhead of dealing with the I/O vectors by the application and kernel.
As the buffers get larger, writev outperforms coalescing by up to 50%.
For large numbers of small buffers (e.g. 512 × 128B), both writev and coalescing dramatically outperform multiple calls to write. Both are about 7× faster on non-blocking Lwt file descriptors, and over 100× faster on blocking ones. The latter is probably due to Lwt synchronizing with worker threads for I/O on blocking descriptors.
Testing the write and writev system calls from C gives very similar ratios, and the order of magnitude of buffer size at which writev becomes faster than coalescing is the same.
Lwt_unix.writev is at least 90% as fast as the writev system call, except for very small buffer sizes mentioned above, where OCaml allocations become relatively significant, but where coalescing is faster in both OCaml and C.

The test machines were my OS X computer, and a Linux virtual machine running in Digital Ocean. These aren't very controlled environments. The point was to make sure there are no serious errors affecting performance in this initial implementation, rather than obtain really high-quality measurements. Perhaps I will do some more thorough testing later, and write a short article on the results.

Other notes

I made the I/O vector sequence an abstract type, so we can change its representation easily in the future.
Optimizations could include an unsafe version that does not perform bounds checks, and/or versions that represent Lwt_unix.IO_vector.t using C struct iovecs directly, though the latter may have to carry extra data or make assumptions about the user retaining references to buffers, and potentially the GC not running between I/O vectors assembly and the call to writev for bytes buffers (this may be especially tough in future multicore).
Even though the new code deals with Bigarrays, I put it in Lwt_unix. Hopefully, one day we will have modular implicits, and Lwt_bytes can be folded into Lwt_unix.
Maybe, at some point in the distant future, send_msg and recv_msg can be switched to use the heterogenous I/O vectors from this PR. I am not in a hurry to break compatibility, however. Also, modular implicits may help here.
It should be possible to implement writev on Windows for Windows sockets only, but I have left that to be done on demand, or as part of some future effort to port more functions to Windows.

Resolves #281.

cc @rgrinberg, @seliopou

avsm

This is a very useful PR. Overall I'm just wondering if it makes more sense to put a Bigarray-based version of this call in Lwt_bytes, and leave the Bytes version in Lwt_unix. Right now that's the split between Bigarray and Bytes functions in Lwt.

avsm · 2016-11-21T19:44:24Z

src/unix/lwt_unix.ml

+module IO_vectors =
+struct
+  type _bigarray =
+    (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t


Could you use Lwt_bytes.t instead of a new type definition here? It is the same definition

Lwt_bytes depends on Lwt_unix, otherwise yes.

avsm · 2016-11-21T19:45:05Z

src/unix/lwt_unix.ml

+
+  type _buffer =
+    | Bytes of Bytes.t
+    | Bigarray of _bigarray


_buffer is not exposed in the external signature so shouldnt need the _ prefix.

I typically use the _ to indicate at a glance that something isn't exposed (or in rare cases, like _bigarray, is exposed, but I wish it weren't).

@avsm you're right here, I had forgotten that the leading _ suppress dead code warnings. I removed the underscores in e7a6755.

avsm · 2016-11-21T19:47:37Z

src/unix/lwt_unix.ml

+     mutable reversed_suffix : _io_vector list;
+     mutable count : int}
+
+  let create () = {prefix = []; reversed_suffix = []; count = 0}


I'm not sure it's worth the extra complexity in the interface, but an optional argument of ?bytes and ?bigarrays here could be used to initialise the iov in one call rather than multiple appends. I could use that in cstruct to perform the write with less overhead from the API.

What would be the ordering if both are provided? Perhaps it is better to have separate calls such as from_bigarray_list?

aantron · 2016-11-21T20:30:07Z

@avsm Thanks.

I chose to have one function handle mixtures of bytes and Bigarrays because, if there are two functions, and you have, say, data in Bigarrays interleaved with punctuation in strings, you would completely lose the benefit of writev. You would have to either make individual calls for each buffer, or copy everything into a big buffer.

With that and other considerations in mind, I have an ill-defined long-term goal of merging Lwt_bytes into Lwt_unix, if/when we get modular implicits. The distinction seems a bit artificial at present, and writev seems like a case where it is better not to have it.

rgrinberg · 2016-11-27T05:37:00Z

src/unix/lwt_unix.mli

+      includes its first [n] bytes. *)
+
+  val count : t -> int
+  (** [count vs] is the number of I/O vectors in the sequence [vs]. *)


I'm a bit curious as to why did you decide to make this function part of the interface. I can't think of any use cases for it off the top of my head.

I was thinking that somebody might want to compare count vs with system_limit, in case the latter matters to them. I suppose, in the vast majority of cases, such users could easily keep a counter while accumulating vs. If you agree, I'll hide this function. We can always expose it later, if there is a request.

I suppose it also gives an easy way to know when all the data in the vectors has been written. Here is some kind of writev loop skeleton:

let rec loop () = Lwt_unix.writev fd vecs >>= fun n -> Lwt_unix.IO_vectors.drop vecs n; if Lwt_unix.IO_vectors.count vecs = 0 then Lwt.return_unit else loop ()

Of course, the user could keep track of how much data is in the vectors, but having count for this seems like less of a pain.

This could be an argument to expose a size or byte_count instead, though.

In this particular case, it seems like it would be better to add a predicate that would check if the whole io vector has been written yet, rather than expose the raw count. Hopefully we can fulfill other cases in this direct way as well.

aantron · 2016-11-28T14:40:40Z

@rgrinberg I've replaced count by empty. Not fully sure about the new interface:

(** [empty vs] is [true] if and only if [vs] has no I/O vectors, or all I/O
    vectors in [vs] have zero bytes. *)

However, writev with zero vectors results in EINVAL, but writev with at least one vector of length zero is a valid operation. empty does not distinguish between these. However, I figured that the main use of empty is for drop loops. If the user wants to do a zero-byte write using writev, they would construct a special I/O vector and not need to call empty on it anyway.

aantron · 2016-11-28T15:19:37Z

writev with zero vectors results in EINVAL

Apparently, that's only the case on Mac (and I guess other BSDs). Automated testing FTW.

rgrinberg · 2016-11-28T15:30:36Z

On 11/28, Anton wrote: @rgrinberg I've replaced `count` by `empty`. Not fully sure about the new interface: ``` (** [empty vs] is [true] if and only if [vs] has no I/O vectors, or all I/O vectors in [vs] have zero bytes. *) ```

I'm also not fully sure either. Also, shouldn't `empty` be used to construct an empty io_vector? Something like is_empty is more in line with conventions (such as core's) But I think something like empty/is_empty is pretty normal for most containers so I don't see the harm in adding it.

aantron · 2016-11-28T15:39:27Z

It seems that the BSD behavior is more strict, while still being POSIX-compliant. There are two options. I favor the first one:

Fail with Invalid_argument if a zero-length vector is passed, ensuring BSD-compatible behavior at the Lwt level. Lwt code written and tested on Linux will be portable to BSD without the risk of new unexpected failures.
Allow writev to do whatever the system's writev does.

After a brief search, I haven't found any mention that the Linux behavior is deliberate. Will probably look further later, however.

BSD writev:

[EINVAL]           The iovcnt argument was less than or equal to 0, or
                   greater than IOV_MAX.

POSIX writev:

...may fail...

EINVAL The iovcnt argument was less than or equal to 0, or greater
       than {IOV_MAX}.

Linux writev:

EINVAL The vector count, iovcnt, is less than zero or greater than
       the permitted maximum.

I'll change empty to is_empty.

aantron · 2016-11-28T23:28:59Z

Regarding zero-length I/O vector lists, I decided it's better to only document the difference and tell users not to rely on the behavior. Don't really have any argument for dealing with this corner case in any particular way.

aantron modified the milestone: 2.7.0 Nov 19, 2016

aantron force-pushed the writev branch from 15cd0d8 to 20e1858 Compare November 21, 2016 17:12

avsm reviewed Nov 21, 2016

View reviewed changes

rgrinberg reviewed Nov 27, 2016

View reviewed changes

aantron force-pushed the writev branch 2 times, most recently from 160aa90 to fcbc348 Compare November 28, 2016 22:41

aantron added 2 commits November 28, 2016 17:34

Add Lwt_unix.writev

f9690b0

Add Lwt_unix.IO_vectors.is_empty

b6d1f9f

aantron force-pushed the writev branch from fcbc348 to b6d1f9f Compare November 28, 2016 23:35

aantron merged commit b6d1f9f into master Nov 29, 2016

aantron mentioned this pull request Nov 29, 2016

Bind readv #297

Closed

aantron deleted the writev branch November 29, 2016 22:10

aantron mentioned this pull request Nov 29, 2016

readv – zero-copy scatter input #299

Merged

aantron mentioned this pull request Dec 22, 2016

Release 2.7.0 on 3 January #305

Closed

aantron mentioned this pull request Jan 15, 2018

Test the Unix binding #539

Open

35 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

writev – zero-copy gather output #291

writev – zero-copy gather output #291

aantron commented Nov 19, 2016 •

edited

Loading

avsm left a comment

avsm Nov 21, 2016

aantron Nov 21, 2016

avsm Nov 21, 2016

aantron Nov 21, 2016

aantron Dec 7, 2016

avsm Nov 21, 2016

aantron Nov 21, 2016

aantron commented Nov 21, 2016

rgrinberg Nov 27, 2016

aantron Nov 27, 2016

aantron Nov 27, 2016

rgrinberg Nov 27, 2016

aantron commented Nov 28, 2016

aantron commented Nov 28, 2016

rgrinberg commented Nov 28, 2016 via email

aantron commented Nov 28, 2016

aantron commented Nov 28, 2016

writev – zero-copy gather output #291

writev – zero-copy gather output #291

Conversation

aantron commented Nov 19, 2016 • edited Loading

Performance

Other notes

avsm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aantron commented Nov 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aantron commented Nov 28, 2016

aantron commented Nov 28, 2016

rgrinberg commented Nov 28, 2016 via email

aantron commented Nov 28, 2016

aantron commented Nov 28, 2016

aantron commented Nov 19, 2016 •

edited

Loading