-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Meta compression #203
Labels
Comments
@bigteech JavaScript implementation can start cross-language schema compatibilty work after this issue is finished. |
Open
2 tasks
chaokunyang
added a commit
that referenced
this issue
May 2, 2024
## What does this PR do? This PR implements type meta encoding for java proposed in #1240 . The type meta encoding in xlang spec proposed in #1413 will be finished in another PR based on this PR. The spec has been updated too: type meta header ``` | 8 bytes meta header | meta size | variable bytes | variable bytes | variable bytes | +-------------------------------+-----------|--------------------+-------------------+----------------+ | 7 bytes hash + 1 bytes header | 1~2 bytes | current class meta | parent class meta | ... | ``` And the encoding for packge/class/field name has been updated to: ``` - Package name encoding(omitted when class is registered): - encoding algorithm: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL` - Header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `0~62`, the value `63` the size need more byte to read, the encoding will encode `size - 62` as a varint next. - Class name encoding(omitted when class is registered): - encoding algorithm: `UTF8/LOWER_UPPER_DIGIT_SPECIAL/FIRST_TO_LOWER_SPECIAL/ALL_TO_LOWER_SPECIAL` - header: `6 bits size | 2 bits encoding flags`. The `6 bits size: 0~63` will be used to indicate size `1~64`, the value `63` the size need more byte to read, the encoding will encode `size - 63` as a varint next. - Field info: - header(8 bits): `3 bits size + 2 bits field name encoding + polymorphism flag + nullability flag + ref tracking flag`. Users can use annotation to provide those info. - 2 bits field name encoding: - encoding: `UTF8/ALL_TO_LOWER_SPECIAL/LOWER_UPPER_DIGIT_SPECIAL/TAG_ID` - If tag id is used, i.e. field name is written by an unsigned varint tag id. 2 bits encoding will be `11`. - size of field name: - The `3 bits size: 0~7` will be used to indicate length `1~7`, the value `6` the size read more bytes, the encoding will encode `size - 7` as a varint next. - If encoding is `TAG_ID`, then num_bytes of field name will be used to store tag id. - Field name: If type id is set, type id will be used instead. Otherwise meta string encoding length and data will be written instead. ``` ## Meta size Before this PR: ```java class org.apache.fury.benchmark.data.MediaContent 78 class org.apache.fury.benchmark.data.Media 208 class org.apache.fury.benchmark.data.Image 114 ``` With this PR: ```java class org.apache.fury.benchmark.data.MediaContent 53 class org.apache.fury.benchmark.data.Media 114 class org.apache.fury.benchmark.data.Image 68 ``` The size of class meta reduced by half, which is a great gain. The size can be reduded more if we introduce field name hash, but it's not related to this PR. We can discuss it in another PR. ## Related issues #1240 #203 #202 ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/incubator-fury/issues/new/choose) describing the need to do so and update the document if necessary. --> - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. -->
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
Meta share mode can reduce meta cost in every serialization. This will ensure multiple objects of same type write meta only once for space saving, and got better pperformance by memory copy meta binary.
But currently meta encoding is not compressed, the space cost will be larger for auto meta share mode.
For normal meta share mode, the meta will be sent only once for every peer, the space cost can be ignored. But for auto meta share, the meta will be sent every time serialziation happens.
Meta Compression Proposal
Schema consistent
Class will be encoded as an enumerated string by full class name.
Schema evolution
Class meta format:
Meta header
Meta header is a 64 bits number value encoded in little endian order.
0b0000~0b1110
are used to record num classes.0b1111
is preserved to indicate that Fury need toread more bytes for length using Fury unsigned int encoding. If current class doesn't has parent class, or parent
class doesn't have fields to serialize, or we're in a context which serialize fields of current class
only(
ObjectStreamSerializer#SlotInfo
is an example), num classes will be 1.flags + all layers class meta
.Single layer class meta
Type info of custom type field will be written as an one-byte flag instead of inline its meta, because the field value
may be null, and Fury can reduce this field type meta writing if object of this type is serialized to in current object
graph.
Field order are left as implementation details, which is not exposed to specification, the deserialization need to
resort fields based on Fury field comparator. In this way, fury can compute statistics for field names or types and
using a more compact encoding.
Class name will be written as an unsigned id if the class is registered.
Field name will be written as an unsigned id if the field is marked with an ID by an annotation.
Additional context
#80 #202
The text was updated successfully, but these errors were encountered: