-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CRC16 Hashslot Calculation #399
Conversation
Not directly related to the PR but thought this would be good place to ask 😄 Any background on why was CRC16 chosen as the key slot hash function? There's much more effort spent in industry to optimize CRC32 variants. We have CRC32C available as intrinsic and there's a lot of even faster tricks with SIMD. The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this). cc @badrishc |
I believe it is done as precaution since people expect it to be crc16 like what Redis is using. |
Ah right, I didn't know it was specced out to Redis. Looking at it now it makes sense as there's 16384 slots. Disregard my message 😄 |
Hi Paulus, can you confirm that the changes you made to CRC logic do not change the hash computation compared to what was there before, so that we retain the compatibility with what Redis uses? |
There was no behaviour change in that minor optimization. CRC16 Loop body (#198)Before; arg1 int -> rdx single-def
; loc0 ushort -> rsi
; loc1 long -> rdi single-def
; OutArgs struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
; tmp1 long -> rbx "impSpillLclRefs"
; cse1 long -> rax hoist "CSE #02: aggressive"
; ushort result = 0;
xor esi, esi
; byte* end = data + len;
movsxd rdi, edx
add rdi, rbx
; while (data < end)
cmp rbx, rdi
jae SHORT LOOP_EXIT
; ...
mov LUT_PTR, 0x7FFC59808C18 ; LUT_PTR = 0x7FFC59808C18;
LOOP_ENTRY:
lea rcx, [rbx+0x01] ; rcx = rbx + 1;
mov rbp, rcx ; rbp = rcx;
mov rcx, qword ptr [LUT_PTR] ; rcx = [LUT_PTR];
; tmp = (((result >> 8) ^ *data++) & 0xff)
mov edx, esi ; edx = esi;
sar edx, 8 ; edx >>= 8;
movzx r8, byte ptr [rbx] ; r8 = [rbx]; Load byte from address in rbx into r8
xor edx, r8d ; edx = edx ^ r8d;
movzx rdx, dl ; rdx = dl;
; result = (ushort)(*(LUT_PTR + tmp) ^ (result << 8));
movzx rcx, word ptr [rcx+2*rdx] ; rcx = [rcx + 2*rdx];
shl esi, 8 ; esi <<= 8;
xor ecx, esi ; ecx = ecx ^ esi;
movzx rsi, cx ; rsi = cx;
cmp rbp, rdi
mov rbx, rbp
jb SHORT LOOP_ENTRY
LOOP_EXIT:
; ... After; arg1 int -> rdx single-def
; loc0 ushort -> rax
; loc1 long -> rdx single-def
; loc2 long -> rcx
; cse0 long -> r8 hoist "CSE #01: aggressive"
; ushort result = 0;
xor eax, eax
; byte* end = data + len;
movsxd rdx, edx
add rdx, rcx
; while (data < end)
cmp rcx, rdx
jae SHORT EXIT
mov LUT_PTR, 0x2195CA42EB0 ; LUT_PTR = 0x2195CA42EB0;
LOOP_ENTRY: ;; offset=0x0017
; tmp = data++
lea r10, [rcx+0x01]
; nuint index = (nuint)(uint)((result >> 8) ^ *(data + 1)) & 0xff;
mov r9d, eax ; r9d = eax;
sar r9d, 8 ; r9d >>= 8
movzx rcx, byte ptr [rcx] ; rcx = [rcx]
xor ecx, r9d ; ecx = ecx ^ r9d
movzx rcx, cl ; zero extend
; result = (ushort)(Unsafe.Add(LUT_PTR, rcx) ^ (result << 8));
movzx rcx, word ptr [LUT_PTR+2*rcx] ; rcx = [LUT_PTR + 2*rcx];
shl eax, 8 ; eax <<= 8;
xor eax, ecx ; eax = eax ^ ecx;
movzx rax, ax ; rax = ax;
cmp r10, rdx
mov rcx, r10
jb SHORT LOOP_ENTRY
EXIT:
; ... |
The cluster spec requires us to map keys to the 16384 hash slot space. This is a very common server side operation and has measurable perf impact, so if there is an opportunity to speed it up that would be good. But I think existing clients expect the specific CRC mapping chosen by Redis as they might use this to route keys? (@mgravell can confirm). |
Unrelated, speaking of hash code logic, our server side key hash logic is here: https://github.com/microsoft/garnet/blob/main/libs/storage/Tsavorite/cs/src/core/Utilities/Utility.cs#L179 Any way to improve its speed, while maintaining (or improving) the good hash distribution/spread property, would be interesting as well. |
I guess the issue is that when routing requests, the client sees only the key and it has to perform the hashslot calculation to figure out which shard is responsible for that key. |
Have you verified that clients are doing this CRC computation, or are they just querying some server and using the redirect to determine and cache the mapping? |
I did not check this at first, just assumed this is how they would be doing it. However, I quick search reveals that SERedis is doing something like what I said above because I can see they have their own CRC16 implementation |
I can 100% confirm that cluster-aware clients try to perform the hash
locally for routing. The alternatives are:
1. perform dumb routing and blindly respond to `-MOVED` (or issue the
command that returns the slot) - adds latency, makes a mess of ordering
2. use a cluster-aware proxy, which just kicks the can up a step to the
proxy, since the proxy still has to route somehow
The client must know the hashing algorithm to route efficiently. At the
moment *only* the predefined crc16 with the specified seed data is
supported. Could another algorithm be negotiated? I mean, maybe, if it is
demonstrably faster, but asking client libraries to support it would be
painfully slow - it basically blocks all clients until they choose to
implement that hash (if they ever do), plus we'd need to come up with a
mechanism to even convey that difference.
…On Sat, 18 May 2024, 20:50 vazois, ***@***.***> wrote:
Have you verified that clients are doing this CRC computation, or are they
just querying some server and using the redirect to determine and cache the
mapping?
I did not check this at first, just assumed this is how they would be
doing it. However, I quick search reveals that SERedis is doing something
like what I said above because I can see they have their own CRC16
implementation
https://github.com/StackExchange/StackExchange.Redis/blob/61c13c21844ff3e92eb077523dc876688878ba25/src/StackExchange.Redis/ServerSelectionStrategy.cs#L63
—
Reply to this email directly, view it on GitHub
<#399 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEHMCDT3NHXAWKLLBC4RDZC6WJTAVCNFSM6AAAAABH5D2B2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYHE4DQMZTGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think cost of switching the hash function greatly outweights any benefits of slightly faster hashing for the cluster slot given it's apparently not a implementation detail. Also I think naive CRC16 lookup table loop is doing just good enough here given the inputs are small.
This seems to be implementation detail on the other hand and probably could be improved but with good benchmark/profiling data as proof of course. |
* fix bug and refactor crc16 to HashSlotUtils * fix migration tests * add tests for cluster keyslot * fix formatting errors
This PR addresses the hash slot calculation issue #395 by ensuring that key strings without both left and right brackets (e.g., "Hm{W\x13\x1c") are used in their entirety within the CRC16 hash function.
I also added more tests and separated CRC16 hashlot implementation into its own folder.