Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it (somehow) possible to avoid needless allocations when serialising requests (needs GAT?) #711

Open
psychon opened this issue Jun 4, 2022 · 2 comments

Comments

@psychon
Copy link
Owner

psychon commented Jun 4, 2022

I just was looking at some code and wondered why we really need vec! here:

impl GetInputFocusRequest {
    /// Serialize this request into bytes for the provided connection
    pub fn serialize(self) -> BufWithFds<PiecewiseBuf<'static>> {
        let length_so_far = 0;
        let mut request0 = vec![
            GET_INPUT_FOCUS_REQUEST,
            0,
            0,
            0,
        ];
        let length_so_far = length_so_far + request0.len();
        assert_eq!(length_so_far % 4, 0);
        let length = u16::try_from(length_so_far / 4).unwrap_or(0);
        request0[2..4].copy_from_slice(&length.to_ne_bytes());
        (vec![request0.into()], vec![])
    }
}

All of this is static. It would e.g. possible to do something like this (however, no idea how to set the length field in a way that works for both big endian and little endian):

fn serialize(self) -> &'static [u8] {
    &[GET_INPUT_FOCUS_REQUEST, 0, 1, 0]
}

Of course, this return type is not compatible with everything else. Perhaps a Cow<'static, [u8]> could work?

Also: Of course, "just because" is not good enough, so: Would there be any actual benefit to doing this?

(The original code has two allocations. The one I am talking about above is for the "actual on-the-wire bytes". There is also the two vec!s in the last line. The first one could also be replaced with something static, but Cow<'static, [Cow<'static, [u8]]> seems a bit ugly. The second is actually free since there are no FDs and AFAIK an empty Vec does not allocate anything.)

@notgull
Copy link
Collaborator

notgull commented Jun 4, 2022

My solution to this (that I discussed earlier in #687) is to rewrite the Serialize trait to, rather than return a buffer of data, fill a buffer with the data required. The design I was experimenting with (that also takes advantage of vectorization) would look something like this:

pub trait Serialize {
    fn serialize<'this>(&'this self, target: &mut [IoSlice<'this>]);
    
    /// Get the number of slices that need to be available for `serialize`.
    fn num_slices(&self) -> usize;

    /// This function can be overrided by downstream implementors to use a fixed size array instead of a vector.
    fn to_slices(&self, callback: impl FnOnce(&mut [IoSlice<'_>])) {
        let mut target = vec![empty_io_slice(); self.num_slices()];
        self.serialize(&mut target);
        callback(&mut target);
    }
}

impl Serialize for GetInputFocusRequest {
    fn serialize(...) {
        target[0] = &[OPCODE, 0, 0, 0].into_io_slice();
    }

    fn num_slices(...) {
        1
    }

    fn to_slices(...) {
        let mut target = [empty_io_slice()];
        self.serialize(&mut target);
        callback(&mut target);
    }
}

With smaller requests, it ensures that serialization is kept out of the heap entirely. With larger requests that need the heap, it ensures that only one level is needed.

Is it worth it? Maybe. The bottleneck here is definitely going to be server communication. It's worth a flamegraph. That being said, it should be worth it in certain cases, such as with larger requests (e.g. Trapezoids, which requires a copy under the current implementation).

@notgull
Copy link
Collaborator

notgull commented Jun 5, 2022

Some things I noticed from profiling the simple_window example:

  • Just as we expected, overhead from polling the server takes up most of the time. In contrast, however, it doesn't take up as much time as expected. When you factor out the time that's spent waiting for events to arrive, I'd say about twice as much time is spent on I/O in comparison to allocations.
  • A surprising amount of time is spent in try_read_packets, more than pretty much everything else that isn't I/O. Most of it's in allocations and memmove. You could probably squeeze some important performance wins there.
  • PolyLineRequest::serialize() spends all of its time up to a rounding error growing/allocating the vectors it returns. So yes, reducing the allocations here would help a lot.
  • Other notable time sinks include Event::parse() and mutex locking/unlocking.
  • That being said, the above all go pretty fast in the grand scheme of things. About as much time is spent in the fmt module and writing to standard output as all of the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants