-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) #888
Conversation
…randomized during tests)
a59e250
to
36de2bb
Compare
Anybody may play and commit ideas to this PR. @rmuir @mikemccand |
I removed the vInt-like encoding in ByteBlockPool and BytesRefHash. After that I was able to switch to native shorts. |
Oooh I missed this @uschindler -- it looks like a nice possible opto for the costly I can benchmark this, but could you maybe modernize it to resolve the conflicts? Thanks! |
Ohhhh, I forgot about this PR. When looking at the conflicts it looks like I need to redo at least the BytesRefHash/Pool code. We can use native order at all places where it is only used in memory and not persisted to disk (BytesRefHash) and where it does not matter (LZ4). |
c1b626c
to
36de2bb
Compare
Hi @mikemccand, In fact, on x86 machines it makes no sense to benchmark it, as the LE byte order is already native :-) This PR only helps with architectures like s390x that have big endian, as the internals of BytesRefHash would never make it into a file format so they can encode their "private data" in native endianness. We still randomize the endianness on testing, so we make sure both variants work. |
@mikemccand, It now uses BytesRefBlockPool, but this one uses BIG ENDIAN byte order (for whatever reason). As I no longer know whcih of those ByteFoobar classes in Util are used internally and not serialized to disk and which ones are serialized to disk I won't change anything for now. So I restored the PR into the "known state" (it adds native varhandles) and changes LZ4 compression to use the native order (which is documented by LZ4 to not matter). So this one only improves LZ4. I have no time to look into the changes in BytesRefHash, so I give it to you to figure out where it is "ok" to change from fixed BE or LE order to native order, but care must be taken that those byte arrays are never persisted/serialized onto index file formats, |
@mikemccand: If you want to see the changes I reverted, see the above comparison: https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f Those changes need to be re-applied to the repo in correct files (not sure where this code now lives, looks like BytesRefBlockPool, but no idea, sorry) |
I think I know after looking into those changes what the problem was. Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte array are "UTF-8 like" encoded (if highest bit is set another byte follows). As this is stupid to do and requires only a few bytes more storage, I removed that encoding to always use shorts instead of "byte or BE short". When the encoding no longer matters and must not be "UTF-8 encoding like", it can use native order. But for safety you could also use LE encoding to make use of actual CPUs (ARM is also LE now). So we have 2 posisbilities:
In addition, I think the |
@uschindler <https://github.com/uschindler> pushed 0 commits.
Huh, how do you do that?
Mike McCandless
http://blog.mikemccandless.com
…On Fri, Nov 3, 2023 at 10:42 AM Uwe Schindler ***@***.***> wrote:
@mikemccand <https://github.com/mikemccand>: If you want to see the
changes I reverted, see the above comparison:
https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f
Those changes need to be re-applied to the repo in correct files (not sure
where this code now lives, looks like BytesRefBlockPool, but no idea, sorry)
I think I know after looking into those changes what the problem was.
Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte
array are "UTF-8 like" encoded (if highest bit is set another byte
follows). As this is stupid to do and requires only a few bytes more
storage, I removed that encoding to always use shorts instead of "byte or
BE short". When the encoding no longer matters and must not be "UTF-8
encoding like", it can use native order. But for safety you could also use
LE encoding to make use of actual CPUs (ARM is also LE now).
So we have 2 posisbilities:
- Change the internal encoding of bytesrefhash and remove the Big
Endian UTF-8 like encoding (or call it vShort) and switch to Little Endian
shorts
- Use native encoding to also help CPUs like s390 and use native
encoding (which also works). This PR supports this, but it is questionable
for the reasons Robert said.
—
Reply to this email directly, view it on GitHub
<#888 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGCOXAUIXXARYWAF4PRGQLYCT7GXAVCNFSM5V4VYZVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZZGI2TOMBUHAZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @uschindler! Removing vShort and switching to LE (or native -- I didn't understand the problem with that -- this is never (directly) serialized to a Lucene index) short seems good? I guess we lose a bit of RAM efficiency, sometimes taking two bytes instead of one. But we get faster CPU decode. |
I forgot about this PR, we should really apply it. #13076 is another candidate that could make use of this. |
…randomized during tests) (#888)
see https://issues.apache.org/jira/browse/LUCENE-10572
#11608