Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A few optimizations for the gcinfodecoder construction #96150
A few optimizations for the gcinfodecoder construction #96150
Changes from 7 commits
bdde1b7
12b491b
9f4c0c3
43875b6
7bd79a6
e443e82
a7ca3a5
14ef9e1
05b269b
5106b83
902ed79
6ffcdaf
53e5237
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 64bit one native word can act as a "buffer" for quite a few reads when each read takes only a few bits. This change reduces the need for indirect reads from the bitstream and may allow the compiler to enregister the "buffer".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this change is to use a fixed-size shift, which is typically faster than a variable-sized shift.
Same applies to
Read( int numBits )
when we read a fixed sized nibble.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you pay for it by extra memory writes. Is it really a win at the end?
For example, here are timings for Intel 11th gen from
https://www.agner.org/optimize/instruction_tables.pdf (page 350):
Constant shift of memory is twice the cost of variable shift of register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have to recheck the original codegen, but I think what was happening is that we would do indirect read and then apply a mask that was constructed via a variable shift of
1
.I guess that was because we need the result in a register and do not want to change the bit stream and the ways
m_pCurrent
andm_RelPos
were changing did not allow to hoist/CSE/enregister either the result of the indirect read nor the computed mask.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly for Zen3 the table gives no difference whatsoever between immediate and CL shift versions.
yet profiler finds CL shift operations relatively expensive.
I often see that in other code, so I always assumed that varaible shifts are somewhat costly.
Maybe it is not the instruction itself, but the whole sequence of dependent instructions - load 1, load CL, make a mask, do a read, apply the mask, and sampling profiler just attributes most of that to the shift.
The code does get measurably faster after the change. I had a few other changes that did not improve anything so I did not include them, but this one definitely helped.
c++ inlines and interleaves statement parts quite aggressively, so it is a bit hard to see what belongs where, but I see that a few "expensive" CL shifts are gone.
Here is an example of what happened to a loop like:
It is not an example of expensive/hot code, just something that is easier to read and see how codegen changed.
==== original code
vs.
==== new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are derived from masking the header flags. Masking is cheap and we may not even be asked for these, so we can do masking in the accessors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe __builtin_clz already encodes the subtract from BITS_PER_SIZE_T within the intrinsic function unlike _BitScanReverse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I've just realized that even though lzcnt is encoded similarly to bsr on x64, the result is an offset from different ends.