-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize ROW initialization #15501
Vectorize ROW initialization #15501
Conversation
6f1e61e
to
9fbe3a6
Compare
This comment has been minimized.
This comment has been minimized.
b1321d2
to
a919562
Compare
This comment has been minimized.
This comment has been minimized.
a919562
to
b5c5804
Compare
// implement Reset() efficiently via SIMD and the latter is used to store the past-the-end offset | ||
// into the `charsBuffer`. Even though the `charsBuffer` could be only `rowWidth` large we need them | ||
// to be the same size so that the SIMD code can process both arrays in the same loop simultaneously. | ||
// This wastes up to 5.8% memory but increases overall scrolling performance by around 40%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye. SIMD is free real estate in our CPUs - Let's use it.
Why does vectorization improve performance printing text so much? I thought that all rows were eagerly initialized before your recent work. Is this literally all from reinitializing individual rows as we recycle/circle them? |
Yes, pretty much. It reduces the row initialization cost from around 80ns per 120 columns down to 5ns. OpenConsole with all these recent changes included processes something around 1.7M rows per second, so that's why it has such a big impact. 1.7M sounds like a lot, but it runs fairly close to spending an entire millisecond just initializing the text buffer on startup, whereas this new code won't even really show up in perf traces anymore. (And that other PR will make it a non-issue.) |
|
||
// Fills _charsBuffer with whitespace and correspondingly _charOffsets | ||
// with successive numbers from 0 to _columnCount+1. | ||
#if defined(TIL_SSE_INTRINSICS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw nobody ever sets this to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is part of #15498. I can pull that change into this branch and rebase it on main so we can merge it immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love that! I'm personally OK merging these out of order, even if it means that the numbers in the perf discussion part are incorrect.
} while (chars < charsEndLoop); | ||
} | ||
|
||
_mm256_storeu_si256(reinterpret_cast<__m256i*>(charsEndLoop), whitespace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so wait, this will write up to 15 things off the end of the buffer? and there's no risk that this is going to stomp anything important?
Like, if the buffer is 17 columns wide... the char offsets buffer starts at alignment 16 from the end of the chars buffer, and the next ROW starts at alignment 16 from the end of the char offsets buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It writes up to 15 bytes off the end of the buffer, which at a granularity of wchar_t
is up to 7 items. It won't stomp anything due to our alignment guarantees in the buffer, which ensures that all buffers start at a 16-byte aligned offset and end on one. If we ever determine that this alignment is unneeded for our performance goals, there's a few techniques we can use to avoid writing outside of the buffer, the most common being that you write the remaining N items in a simple for loop.
e83ef45
to
1aba49d
Compare
This comment has been minimized.
This comment has been minimized.
1aba49d
to
9deecc9
Compare
This comment has been minimized.
This comment has been minimized.
9deecc9
to
a9763f7
Compare
It was a bit of a messy rebase, but I think it's ready now. |
offsets = _mm_add_epi16(offsets, increment); | ||
chars += 8; | ||
charOffsets += 8; | ||
// If _columnCount is something like 120, the actual backing buffer for charOffsets is 121 items large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see, i guess this is the part that scares me. every time we talk about the width of the backing buffers we're like, "YUP it's always +1" when in truth it is up to +16 or +32 or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add an "at least" in there when I merge main in. (It's only up to +8 btw.)
Performance of printing enwik8.txt at the following block sizes:
4KiB (printf): 51MB/s -> 54MB/s
128KiB (cat): 92MB/s -> 103MB/s
Validation Steps Performed
window sizes as observed under a debugger ✅