Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many memcpys #64301

Closed
nnethercote opened this issue Sep 9, 2019 · 11 comments
Closed

Too many memcpys #64301

nnethercote opened this issue Sep 9, 2019 · 11 comments
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@nnethercote
Copy link
Contributor

Cachegrind profiles indicate that the Rust compiler often spends 3-6% of its executed instructions within memcpy (specifically __memcpy_avx_unaligned_erms on my Linux box), which is pretty incredible.

I have modified DHAT to track memcpy/memmove calls and have discovered that a lot are caused by obligation types, such as PendingPredicateObligations and PendingObligations, which are quite large (160 bytes and 136 bytes respectively on my Linux64 machine).

For example, for the keccak benchmark, 33% of the copied bytes occur in the swap call in the compress function:

// Now move all popped nodes to the end. Try to keep the order.
//
// LOOP INVARIANT:
// self.nodes[0..i - dead_nodes] are the first remaining nodes
// self.nodes[i - dead_nodes..i] are all dead
// self.nodes[i..] are unchanged
for i in 0..self.nodes.len() {
match self.nodes[i].state.get() {
NodeState::Pending | NodeState::Waiting => {
if dead_nodes > 0 {
self.nodes.swap(i, i - dead_nodes);
node_rewrites[i] -= dead_nodes;
}
}

For serde, 11% of the copied bytes occur constructing this vector of obligations:

rust/src/librustc/ty/wf.rs

Lines 150 to 157 in a6624ed

self.out.iter()
.inspect(|pred| assert!(!pred.has_escaping_bound_vars()))
.flat_map(|pred| {
let mut selcx = traits::SelectionContext::new(infcx);
let pred = traits::normalize(&mut selcx, param_env, cause.clone(), pred);
once(pred.value).chain(pred.obligations)
})
.collect()

and 5% occur appending to this vector of obligations:
obligations.push(get_paranoid_cache_value_obligation(infcx,
param_env,
projection_ty,
cause,
depth));

It also looks like some functions such as FulfillmentContext::register_predicate_obligation() might be passed a PredicateObligation by value (using a memcpy) rather than by reference, though I'm not sure about that.

I have some ideas to shrink these types a little, and improve how they're used, but these changes will be tinkering around the edges. It's possible that more fundamental changes to how the obligation system works could elicit bigger wins.

@nnethercote
Copy link
Contributor Author

#64302 shrinks the size of some of these types by 24 bytes. For a clean check build of serde, it almost halves the number of instructions executed in memcpy. AFAICT, the effect is so big because some of the types (e.g. PredicateObligation) shrink enough that the compiler generates code to copy them instead of calling memcpy.

@Centril Centril added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-compiletime Issue: Problems and improvements with respect to compile times. and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels Sep 9, 2019
@nnethercote nnethercote changed the title Many memcpys of obligations, which are large Too many memcpys Sep 11, 2019
@nnethercote
Copy link
Contributor Author

I'm going to modify this issue to be about all types that cause many calls to memcpy. These types are all > 128 bytes in size.

DiagnosticBuilder is another one (and consequently PResult). #64374 deals with that.

@nnethercote
Copy link
Contributor Author

SubregionOrigin is another one, #64394 improves it.

@nnethercote
Copy link
Contributor Author

nnethercote commented Sep 12, 2019

This is an interesting case:

let mut queue: SmallVec<[_; 4]> =
smallvec![self.tcx.predicate_for_trait_def(self.fcx.param_env,
cause,
coerce_unsized_did,
0,
coerce_source,
&[coerce_target.into()])];

DHAT tells me that each time this executes (which is often) we do a memcpy that is 552 bytes. Why 552? Each element is a PredicateObligation which is 136 bytes. 4 * 136 = 544. Then add another 8 bytes for the SmallVec length (we use SmallVec's "union" feature so the length is the only additional field) and you have 552 bytes. So a single memcpy is being used to initialize all 4 elements and the length, despite there only being one element initialized. This is surprising to me.

The code for smallvec! is here:
https://github.com/servo/rust-smallvec/blob/3ee6d1e3e5a8b5a6eafa5f396c397d5038a9c97e/lib.rs#L111-L128

It calls SmallVec::new() and then pushes the single element. Perhaps the memcpy occurs in the new call:
https://github.com/servo/rust-smallvec/blob/3ee6d1e3e5a8b5a6eafa5f396c397d5038a9c97e/lib.rs#L409

@SimonSapin: does this make sense to you? Is there a way to only initialize 136*1 + 8 = 144 bytes, instead of 136*4 + 8 = 552 bytes?

(BTW, #64302 shrinks PredicateObligation from 136 bytes to 112 bytes, so the memcpy size here will also shrink from 552 to 456. But 456 is still high.)

@jamesmunns
Copy link
Member

jamesmunns commented Oct 12, 2019

@nnethercote does your profiling show "returning structs by value" to be a significant amount of time spent in memcpy? Since Rust doesn't have guaranteed RVO/NRVO (see #32966 for details), I've been trying to get a feel for what kind of performance impact implementing this in MIR would cause.

This could perhaps be the cause of your mystery memcpy in #64301 (comment), as smallvec! could be returning the struct by value (if it's too large at 500ish bytes, LLVM's RVO pass may not catch it reliably), but I have no data to back that up.

@nnethercote
Copy link
Contributor Author

@nnethercote does your profiling show "returning structs by value" to be a significant amount of time spent in memcpy?

#64374 was one such case. In general, it doesn't seem like a big fraction of the memcpy calls, though. I wonder about return values that are less than 128 bytes, though.

@lnicola
Copy link
Member

lnicola commented Oct 12, 2019

Referencing https://github.com/kvark/copyless here, since it can be used to avoid some extra copies.

@SimonSapin
Copy link
Contributor

A single 544-bytes memcpy sounds to me like the SmallVec is being moved after it was initialized, and yes missing RVO sounds most likely.

This is not the first I hear of accidentally moving a large buffer with SmallVec, I feel this is a design flaw of this library. I’ve been pondering an alternative design (perhaps for a separate crate) where the inline buffer is borrowed:

let mut buffer = std::mem::Uninitialized::<Foo, N>::uninit();
let queue = SmallSmallVec::new(&mut buffer);

This gives a size_of similar to that of Cow<[T]>, so moving that value around is less of a problem. The downside is that the lifetime parameter makes this much less flexible to use: it’s mostly for cases where a SmallVec would stay on the stack.

@nnethercote
Copy link
Contributor Author

#67250 is another tiny win from memcpy profiling.

@nnethercote
Copy link
Contributor Author

#67340 is another small win from memcpy profiling.

@nnethercote
Copy link
Contributor Author

At this point I have eliminated most of the worst cases and even those have produced only small wins. I will keep an eye out for additional cases in the future, but I don't think we don't need an issue open for it any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants