LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) #888

uschindler · 2022-05-13T22:57:36Z

see https://issues.apache.org/jira/browse/LUCENE-10572

#11608

…randomized during tests)

uschindler · 2022-05-13T23:13:50Z

Anybody may play and commit ideas to this PR. @rmuir @mikemccand

uschindler · 2022-05-13T23:34:24Z

I removed the vInt-like encoding in ByteBlockPool and BytesRefHash. After that I was able to switch to native shorts.

mikemccand · 2023-11-02T10:32:10Z

Oooh I missed this @uschindler -- it looks like a nice possible opto for the costly BytesRefHash methods, and it looks like (on the issue) you and @rmuir came to agreement on approach (this PR).

I can benchmark this, but could you maybe modernize it to resolve the conflicts? Thanks!

uschindler · 2023-11-02T11:07:27Z

Ohhhh, I forgot about this PR. When looking at the conflicts it looks like I need to redo at least the BytesRefHash/Pool code.

We can use native order at all places where it is only used in memory and not persisted to disk (BytesRefHash) and where it does not matter (LZ4).

uschindler · 2023-11-03T14:23:33Z

Hi @mikemccand,
I reset the branch to the initial commit (without BytesRefHash & Co. changes ). Then I merged and pushed.
I will now try to redo the changes.

In fact, on x86 machines it makes no sense to benchmark it, as the LE byte order is already native :-) This PR only helps with architectures like s390x that have big endian, as the internals of BytesRefHash would never make it into a file format so they can encode their "private data" in native endianness. We still randomize the endianness on testing, so we make sure both variants work.

uschindler · 2023-11-03T14:29:21Z

@mikemccand,
I checked in main branch, it no longer uses any varhandles in BytesRefHash and ByteBlockPool. No idea where the code moved to.

It now uses BytesRefBlockPool, but this one uses BIG ENDIAN byte order (for whatever reason). As I no longer know whcih of those ByteFoobar classes in Util are used internally and not serialized to disk and which ones are serialized to disk I won't change anything for now.

So I restored the PR into the "known state" (it adds native varhandles) and changes LZ4 compression to use the native order (which is documented by LZ4 to not matter). So this one only improves LZ4.

I have no time to look into the changes in BytesRefHash, so I give it to you to figure out where it is "ok" to change from fixed BE or LE order to native order, but care must be taken that those byte arrays are never persisted/serialized onto index file formats,

uschindler · 2023-11-03T14:33:54Z

@mikemccand: If you want to see the changes I reverted, see the above comparison: https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f

Those changes need to be re-applied to the repo in correct files (not sure where this code now lives, looks like BytesRefBlockPool, but no idea, sorry)

uschindler · 2023-11-03T14:42:40Z

@mikemccand: If you want to see the changes I reverted, see the above comparison: https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f

Those changes need to be re-applied to the repo in correct files (not sure where this code now lives, looks like BytesRefBlockPool, but no idea, sorry)

I think I know after looking into those changes what the problem was. Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte array are "UTF-8 like" encoded (if highest bit is set another byte follows). As this is stupid to do and requires only a few bytes more storage, I removed that encoding to always use shorts instead of "byte or BE short". When the encoding no longer matters and must not be "UTF-8 encoding like", it can use native order. But for safety you could also use LE encoding to make use of actual CPUs (ARM is also LE now).

So we have 2 posisbilities:

Change the internal encoding of bytesrefhash and remove the Big Endian UTF-8 like encoding (or call it vShort) and switch to Little Endian shorts
Use native encoding to also help CPUs like s390 and use native encoding (which also works). This PR supports this, but it is questionable for the reasons Robert said.

In addition, I think the & 0x8000 code everywhere can also be removed as the marker bit is obsolete. I did not try that back at that time.

mikemccand · 2023-11-03T14:57:39Z

@uschindler <https://github.com/uschindler> pushed 0 commits.

Huh, how do you do that? Mike McCandless http://blog.mikemccandless.com

…

On Fri, Nov 3, 2023 at 10:42 AM Uwe Schindler ***@***.***> wrote: @mikemccand <https://github.com/mikemccand>: If you want to see the changes I reverted, see the above comparison: https://github.com/apache/lucene/compare/36de2bb7fa7a0587a102cf5c4d35ac8f94976bbd..c1b626c0636821f4d7c085895359489e7dfa330f Those changes need to be re-applied to the repo in correct files (not sure where this code now lives, looks like BytesRefBlockPool, but no idea, sorry) I think I know after looking into those changes what the problem was. Internally BytesRefHash uses BIG ENDIAN, because some parts in the byte array are "UTF-8 like" encoded (if highest bit is set another byte follows). As this is stupid to do and requires only a few bytes more storage, I removed that encoding to always use shorts instead of "byte or BE short". When the encoding no longer matters and must not be "UTF-8 encoding like", it can use native order. But for safety you could also use LE encoding to make use of actual CPUs (ARM is also LE now). So we have 2 posisbilities: - Change the internal encoding of bytesrefhash and remove the Big Endian UTF-8 like encoding (or call it vShort) and switch to Little Endian shorts - Use native encoding to also help CPUs like s390 and use native encoding (which also works). This PR supports this, but it is questionable for the reasons Robert said. — Reply to this email directly, view it on GitHub <#888 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGCOXAUIXXARYWAF4PRGQLYCT7GXAVCNFSM5V4VYZVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZZGI2TOMBUHAZA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mikemccand · 2023-11-03T15:04:45Z

Thanks @uschindler! Removing vShort and switching to LE (or native -- I didn't understand the problem with that -- this is never (directly) serialized to a Lucene index) short seems good? I guess we lose a bit of RAM efficiency, sometimes taking two bytes instead of one. But we get faster CPU decode.

uschindler · 2024-02-05T16:07:22Z

I forgot about this PR, we should really apply it. #13076 is another candidate that could make use of this.

…randomized during tests) (#888)

uschindler marked this pull request as draft May 13, 2022 22:57

LUCENE-10572: Add support for varhandles in native byte order (still …

36de2bb

…randomized during tests)

uschindler force-pushed the jira/LUCENE-10572 branch from a59e250 to 36de2bb Compare May 13, 2022 23:04

uschindler force-pushed the jira/LUCENE-10572 branch from c1b626c to 36de2bb Compare November 3, 2023 14:14

Merge branch 'main' into jira/LUCENE-10572

beaa1de

fix typo

797ee37

uschindler mentioned this pull request Feb 5, 2024

Adding binary Hamming distance as similarity option for byte vectors #13076

Closed

uschindler marked this pull request as ready for review February 5, 2024 16:06

uschindler and others added 4 commits February 5, 2024 17:10

Merge branch 'apache:main' into jira/LUCENE-10572

9589da4

Rename constant; make it immune to security manager

a260d26

Add changes entry

b12070e

remove wrong issue number

011c1f7

uschindler added the type:enhancement label Feb 5, 2024

uschindler self-assigned this Feb 5, 2024

uschindler added this to the 9.10.0 milestone Feb 5, 2024

uschindler merged commit 9ab84f4 into apache:main Feb 5, 2024
4 checks passed

uschindler deleted the jira/LUCENE-10572 branch February 5, 2024 17:13

asfgit pushed a commit that referenced this pull request Feb 5, 2024

LUCENE-10572: Add support for varhandles in native byte order (still …

0f33d86

…randomized during tests) (#888)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) #888

LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) #888

uschindler commented May 13, 2022 •

edited by mocobeta

Loading

uschindler commented May 13, 2022

uschindler commented May 13, 2022

mikemccand commented Nov 2, 2023

uschindler commented Nov 2, 2023

uschindler commented Nov 3, 2023

uschindler commented Nov 3, 2023

uschindler commented Nov 3, 2023 •

edited

Loading

uschindler commented Nov 3, 2023 •

edited

Loading

mikemccand commented Nov 3, 2023 via email

mikemccand commented Nov 3, 2023

uschindler commented Feb 5, 2024

LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) #888

LUCENE-10572: Add support for varhandles in native byte order (still randomized during tests) #888

Conversation

uschindler commented May 13, 2022 • edited by mocobeta Loading

uschindler commented May 13, 2022

uschindler commented May 13, 2022

mikemccand commented Nov 2, 2023

uschindler commented Nov 2, 2023

uschindler commented Nov 3, 2023

uschindler commented Nov 3, 2023

uschindler commented Nov 3, 2023 • edited Loading

uschindler commented Nov 3, 2023 • edited Loading

mikemccand commented Nov 3, 2023 via email

mikemccand commented Nov 3, 2023

uschindler commented Feb 5, 2024

uschindler commented May 13, 2022 •

edited by mocobeta

Loading

uschindler commented Nov 3, 2023 •

edited

Loading

uschindler commented Nov 3, 2023 •

edited

Loading