-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: fix build failure for clang 3.2; consolidate byte swapping code; fix buffer writes for unaligned ucs2 strings #7645
Conversation
Couldn't get this to build on OSX 10.8 and Clang 3.2 (commit cc11a96)
EDIT: I ran |
// http://nodejs.org/api/buffer.html regarding Node's "ucs2" | ||
// encoding specification | ||
if (IsBigEndian()) | ||
SwapBytes16(const_cast<char*>(reinterpret_cast<const char*>(buf)), buflen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-place modification of a const buffer is definitely not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnoordhuis Does that mean this isn't possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just botched and misread the pointer reassignment in the original code. Will fix. Thanks for catching @bnoordhuis.
Sorry, turns out clang also defines |
@zbjornson First off, thanks for taking the time to work on this. I don't think this PR is the way to go but that doesn't mean I don't appreciate you working on it. Knowing what I know of clang and gcc (and some quick tests confirm that), I don't think it's the use of intrinsics/builtins that is responsible for the performance improvement, it's that the compiler can:
If I rewrite #7644 to hint the same conditions to the compiler, I get comparable numbers on (I still need to look into the unaligned case. Perhaps there is something we can tweak there as well.) |
Interesting, okay. Keen to try out your reworked version. When I played with similar code, the compilers would only emit PSHUFB and similar with -O3, whereas the intrinsics always mapped correctly. |
I’d expect that to noticeably outperform the |
@addaleax err, I meant to say that this type of thing: char a = data[0];
data[0] = data[1];
data[1] = a; mapped to PSHUFB only with -O3, whereas the builtins reliably mapped to PSHUFB without -O3 (and afaik node is compiled with -Os and /Od). |
@zbjornson Ah – Either way, be aware that, without CPU detection or extra compiler flags,
Pretty sure it’s -O3 by default: Line 103 in fcae5e2
|
@addaleax ahh I'd been trying to find that info about whether or not AVX/SSE extensions will be used in release for quite some time, thanks! If that's the case, then it makes sense to use the smaller code from #7644. (I'd been benchmarking on Windows.) You're right on -O3. (The first line in #7645 (comment) has -Os for some reason, but a normal release build shows -O3.) |
@zbjornson This PR is now building cleanly on my 10.8 machine (d6147aef8e2e9511149cba5809e330fd318e6e38) |
@addaleax ... any further thoughts on this one? |
Note that I want to replace the I think that because this is substantially faster on at least Windows, and potentially on custom builds on other platforms (to be tested), is a good reason to pursue this PR. I don't see any downsides at least. I can make the above change and do more benchmarking this week. |
d6147ae
to
bfd3f49
Compare
c2e098d
to
4a0a1d4
Compare
@bnoordhuis So does this updated PR seem better to you? ref: #7618 (comment) |
// encoding specification | ||
dst.resize(buflen); | ||
SwapBytes(&dst[0], buf, buflen); | ||
std::vector<uint16_t> dst (buf, buf + buflen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: no space before (
.
dst.resize(buflen); | ||
SwapBytes(&dst[0], buf, buflen); | ||
std::vector<uint16_t> dst (buf, buf + buflen); | ||
size_t nbytes = buflen * sizeof(uint16_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider writing it as sizeof(dst[0])
.
#define SWAP(a, b) \ | ||
tmp = data[a]; \ | ||
data[a] = data[b]; \ | ||
data[b] = tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually faster than std::swap or an inline function?
int align = reinterpret_cast<uintptr_t>(data) % sizeof(uint16_t); | ||
|
||
if (align == 0) { | ||
uint16_t* data16 = reinterpret_cast<uint16_t*>(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strict-aliasing violation, strictly speaking. My PR operates on just char for that reason.
4a0a1d4
to
72fcee9
Compare
Thanks for your review @bnoordhuis and again sorry that this small thing has dragged on. Revisions submitted. As far as the strict aliasing violation you pointed out: I admit that this confused me. When I've asked about reading from char buffers in particular, I get answers along the lines of "it's not UB or a SA violation," "it looks like one but it isn't because you assume that the original pointer is a uint16_t" or "it is but everyone does it" (which seems to be the most correct of the three) (see ref1, ref2 (and later replies)). Note that the violation already existed in string_bytes.cc and this PR just relocates it: As far as avoiding the violation:
Thus, in the latest revision, I use That yields these benchmarks, which are about as good as they get (with node's default build config) aside from aligned 16 on linux (per above): |
72fcee9
to
6853f2b
Compare
It is and that is why node.js is currently built with
That is not what the spec says and not how gcc and clang operate when Think of it like this: when strict aliasing is in effect, and when there are no With (Incidentally, that is why the return values of malloc, calloc and operator new are marked as noalias. They logically can't alias other pointers but if the compiler didn't know that, it would have to assume the worst and emit significantly worse code around dynamic allocations. Apparently there are systems that violate that assumption because clang++ has a |
if (align == 0) { | ||
// MSVC has no strict aliasing, and is able to highly optimize this case. | ||
uint16_t* data16 = reinterpret_cast<uint16_t*>(data); | ||
size_t len16 = nbytes / sizeof(uint16_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write this as sizeof(*data16)
? EDIT: Here and in the other functions.
memcpy(&temp, &data[i], sizeof(uint16_t)); | ||
temp = BSWAP_2(temp); | ||
memcpy(&data[i], &temp, sizeof(uint16_t)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use sizeof(temp)
in this block and the blocks below? LGTM apart from that.
Reviewers: fyi, the one change since Ben gave his LGTM was moving |
LGTM, thanks for @mentioning me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM but the commit logs should conform to the style guide.
Did you check what code paths clang and gcc emit? In the other PR, I had to prove that src == dst and had proper alignment before the compiler generated the fast case.
b1b1efb
to
c8281e3
Compare
Fixed commit text.
Surprisingly it doesn't appear necessary to give those hints to the compilers in this incarnation for clang or gcc to emit |
Is this CI situation normal or am I cursed?! |
Ahem, yes, let’s give this another shot: https://ci.nodejs.org/job/node-test-commit/5328/ (It is, unfortunately, normal and the act of complaining about the general brokenness of CI is part of what forms the common core collaborator identity. Welcome to the club! :b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
More CI attempts needed to land? |
@zbjornson started new CI run: https://ci.nodejs.org/job/node-test-pull-request/4337/ |
Oh didn't notice that there is a conflict. |
c8281e3
to
65cbc45
Compare
Removes use of builtins that are unavailable for older clang. Per benchmarks, only uses builtins on Windows, where speedup is significant. Fixes: nodejs#7618
Between nodejs#3410 and nodejs#7645, bytes were swapped twice on bigendian platforms if buffer was not two-byte aligned. See comment in nodejs#7645.
65cbc45
to
2b69933
Compare
@lpinca rebased |
🎉 7th time's a charm I guess :) |
Okay, I'm going to start landing this.
|
Removes use of builtins that are unavailable for older clang. Per benchmarks, only uses builtins on Windows, where speedup is significant. Also adds test for unaligned ucs2 buffer write. Between #3410 and #7645, bytes were swapped twice on bigendian platforms if buffer was not two-byte aligned. See comment in #7645. PR-URL: #7645 Fixes: #7618 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
landed in 7420835 , thanks a lot @zbjornson ! |
Concerning backporting, I guess this should be backported to wherever #7157 was backported, which is v6 but not v4. If anyone disagrees let me know. |
Removes use of builtins that are unavailable for older clang. Per benchmarks, only uses builtins on Windows, where speedup is significant. Also adds test for unaligned ucs2 buffer write. Between #3410 and #7645, bytes were swapped twice on bigendian platforms if buffer was not two-byte aligned. See comment in #7645. PR-URL: #7645 Fixes: #7618 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com>
Removes use of builtins that are unavailable for older clang. Per benchmarks, only uses builtins on Windows, where speedup is significant. Also adds test for unaligned ucs2 buffer write. Between #3410 and #7645, bytes were swapped twice on bigendian platforms if buffer was not two-byte aligned. See comment in #7645. PR-URL: #7645 Fixes: #7618 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: James M Snell <jasnell@gmail.com> Conflicts: src/node_buffer.cc
Fixes #7618
Alternative to #7644 that preserve performance.
Also consolidates all byte-swapping code (following what bnoordhuis did in #7644).