-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested calls makes vek unusable in debug-buils for real-time applications. #53
Comments
Hi, It's fine, when put this way I must agree that it is quite surprising. Looks a bit like these bloated Java enterprise frameworks with crazy call stacks for the simplest tasks. It is also true that I didn't profile the library; I focused first on features, occasionnally using the Godbolt compiler explorer on small snippets to confirm that optimized release assembly was as expected. First of all I ought to get rid of Second, but this is the tricky part, there appears to be overhead due to the The It is definitely possible to avoid using the Which optimization settings are you using for profiling? Not to doubt your benchmark, I am just asking because as mentioned, the calls do not seem to be inlined as intended. |
Sorry, I didn't see you mentioned Debug builds in the issue's title. Unfortunately, from what I gather, a lot of libraries in Rust have this problem. I do not remember where I saw this, it was some post by tomaka on the Rust forums, or Twitter, I do not remember. The situation appears to be basically :
I've never got to that stage, i.e I've never encountered a situation where something in the Debug build was way too slow. But I'm not that suprised that it happens, given the state of the language. I figure if we want to "fix" the dot product, then we might as well fix the whole language; like "stop using iterators and go back to plain old pointer offsets"; OR if it is possible, tell Rust that |
I think this is the way one should solve this : Just override the optimization level of We can't (easily) get rid of the overhead incurred by nested calls in debug builds; I mean, Please do let me know if this solves your problem. |
I'll also mention this here : the repository includes an example to inspect the generated optimized assembly for a snippet (documented in the source) : EDIT: Results with: #[no_mangle] pub fn v2_f32_dot_reprc (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.dot(b) }
#[no_mangle] pub fn v2_f32_dot_simd (a: repr_simd::Vec2<f32>, b: repr_simd::Vec2<f32>) -> f32 { a.dot(b) }
#[no_mangle] pub fn v2_f32_dot_reprc_handmade (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.x * b.x + a.y * b.y } v2_f32_dot_reprc:
mulss xmm0, xmm2
mulss xmm1, xmm3
xorps xmm2, xmm2
addss xmm0, xmm2
addss xmm0, xmm1
ret
v2_f32_dot_simd:
movss xmm1, dword ptr [rcx]
movss xmm2, dword ptr [rcx + 4]
mulss xmm1, dword ptr [rdx]
mulss xmm2, dword ptr [rdx + 4]
xorps xmm0, xmm0
addss xmm0, xmm1
addss xmm0, xmm2
ret
v2_f32_dot_reprc_handmade:
mulss xmm0, xmm2
mulss xmm1, xmm3
addss xmm0, xmm1
ret
To be honest, it hurts to look at. Even in release mode, the "handmade" version beats the current Re-EDIT (sorry. I'm re-discovering stuff as I go). Looks like the compiler is messing with calling conventions. If we enforce the "C" calling convention like so : #[no_mangle] pub extern "C" fn v2_f32_dot_reprc (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.dot(b) }
#[no_mangle] pub extern "C" fn v2_f32_dot_simd (a: repr_simd::Vec2<f32>, b: repr_simd::Vec2<f32>) -> f32 { a.dot(b) }
#[no_mangle] pub extern "C" fn v2_f32_dot_reprc_handmade (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.x * b.x + a.y * b.y } then we get these fellows: v2_f32_dot_reprc:
movd xmm0, ecx
shr rcx, 32
movd xmm1, ecx
movd xmm2, edx
mulss xmm2, xmm0
shr rdx, 32
movd xmm3, edx
mulss xmm3, xmm1
xorps xmm0, xmm0
addss xmm0, xmm2
addss xmm0, xmm3
ret
v2_f32_dot_simd:
movaps xmm0, xmmword ptr [rcx]
mulps xmm0, xmmword ptr [rdx]
xorps xmm1, xmm1
addss xmm1, xmm0
shufps xmm0, xmm0, 229
addss xmm0, xmm1
ret
v2_f32_dot_reprc_handmade:
movd xmm0, ecx
shr rcx, 32
movd xmm1, ecx
movd xmm2, edx
mulss xmm2, xmm0
shr rdx, 32
movd xmm0, edx
mulss xmm0, xmm1
addss xmm0, xmm2
ret What I make out of it is that the generated assembly will greatly vary depending on what the compiler can assume, and I think we can trust its decisions. It remains true that I should consider optimization efforts. |
I am sorry to cause so much concern, the issue was written out of frustration and intens no harm. I have measured now exactly time difference between different builds with my workload (if you are curious it is this loop https://github.com/koalefant/circle2d/blob/911399563bf1c893ff49c9d2a56e1865a927f1ea/examples/basic.rs#L537, generates background image in https://koalefant.github.io/circle2d/, ): Debug - 4794.9 ms Deps are optimized with these options:
I realize that it may be a bit too much to demand comparable performance in Debug, but for interactive applications 100x difference pretty much robs me of Debug builds. If I target the game for 60 FPS and I am CPU-bound, 100x slowdown brings me way below 1 FPS. It becomes non-interactive anymore and I can't used it for testing. You may argue that opt-level = 1 would improve the situation, but longer compilation hurt iteration times. Currently I end up disabling debug symbols for debug build just so my build times are lower. 😞 I realize that situation is common for other libraries, but I hope we can do better than 100x. |
Seeing the code in context does indeed help. I guess you have considered that option already, but I would try something like : [profile.dev.package.vek] # Note, no wildcard
opt-level = 3
debug = false
# How would that work with the existing wildcard, I'm not sure, but can be figured out Because the timings you obtained...
... lead me to believe that "optimized deps" had little to no effect. Isn't there an option somewhere you could have missed? I suspect
I believe you meant
I would suggest the following : In the meantime, I'll see what I can do. Since replacing I guess you could also clone |
I have published version 0.11 which should fix your issue. This time, A lot of similar functions (anything that looks like a reduce operation) have benefitted from a similar optimization. As a result, they are definitely as fast as possible. (at least for Please let me know if that helped. |
Hi, Thanks for the quick update! I have tried implementing manual dot/magnitude, here is what I got: 3x time improvement in 0.11.0 for Debug! Not bad!
Does not seem to have a measurable effect. Perhaps inlined code gets monomorphized in my crate? Not sure. |
That would make sense... It's a bit of a pity in any case. One thing that could still be done, I guess (but is less than desirable), is to split your own package in two crates, one being the "math core" (or whatever) which does all the intensive operations and does not really need debug information (which you would build with I could see that working, because the "math core" code instanciates Typically the whole It's definitely a possibility. Where I work at, the physics engine and similar math-intensive projects are always compiled in release mode; that means we can't debug them, but OTOH if we don't do that, the games are simply unplayable in debug mode. We've come to accept it as a fact of life, and it's not even Rust, it's C++. In any case, I see no way to further optimize Next you'll definitely have to consider other "standard" optimizations, like caching results (+ not recomputing what hasn't changed), optimizing signed-distance queries with a BVH (for instance), parallelizing work accross multiple threads, or even moving to fragment/compute shaders. I mean, of course it's better when the math library is fast to begin with :) but do be aware of these options. |
I already do splitting, but I always end up having code that uses a lot of math, and still desires to be debuggable. The initialization code on itself is not a problem it is more of an illustration of other performance degradation that is harder to demonstrate. Thank you for your time! |
Anytime. I'll close the issue now, but feel free to add to the discussion as you see fit. |
Hi, sorry for overly negative tone, but I am profiling some code using vek and I can't believe my eyes:
Are there really 11 nested calls in a Vec2::dot product, which could be
a.x * b.x + a.y * b.y
?The text was updated successfully, but these errors were encountered: