Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Uint8Array as internal repr of Bytes and fix #117 #195

Closed
wants to merge 10 commits into from

Conversation

ailisp
Copy link
Member

@ailisp ailisp commented Aug 26, 2022

builder.c code JS Uint8Array -> C uint8_t * and vice versa works, ensured by a few passed tests (valueReturn: JS -> C, readRegister: C -> JS).

To discuss:

  • should we keep Bytes for backward-compatible. Have Bytes Uint8Array | String, or a class which constructor can take a string and check it's all in range, or just alias of Uint8Array, or just drop Bytes and use Uint8Array for low level APIs
  • Uint8Array <> String decode/encode. In nodejs, this is typically via Buffer (a subclass and enhanced version of Uint8Array) or TextEncoder/TextDecoder, but none of them is in quickjs. WIth basic String.charCodeAt/fromCharCode, I can make a simplest latin1 encoder/decoder (each Uint8 is mapped to the same 0-255 char code). With the unicode C library shipped with quickjs, it seems possible to expose a UTF-8 and UTF-16 text encoder/decoder. I think these together are sufficiently good. Other ideas are welcome.
  • We serialize state to JSON, this is problematic with unicode characters. After we have the correct enforcement on storage key / value to be Uint8Array, unicode characters will be correctly rejected, but we must come up a strategy in our sdk's auto-serialization to handle user object with unicode string properties. A few possible strategy:
    • Throw error at any unicode string character, only allow latin1 chars.
    • Always utf-8 decode at serialization, and utf-8 encode at deserialization
    • Use a schema-less binary format to replace JSON, for example, CBOR

@volovyks @austinabell What are your thoughts? Thank you!

@ailisp
Copy link
Member Author

ailisp commented Aug 30, 2022

From discussion with @volovyks , I propose the following design (all alternatives I can think of are in PR description):

  • We will change to type Bytes = Uint8Array | string.
    • Use string is unchecked and auto utf-8 decoded (same as current behavior, and same as browser/nodejs TextDecoder's default encoding).
    • Use string will be warned at compile time.
  • For those api.ts functions that were returning Bytes, we'll return Uint8Array. (Breaking change)
  • Provide TextEncoder, TextDecoder class, implemented in C, can do utf-8. utf-16le/be and latin1 encoding
  • Auto JSON serialization on arguments, state, collections are auto utf-8 decoded to uint8_t* in C, and auto utf-8 encoded when deserialize. (same as current behavior, but we'll explicitly document this fact)
  • Not in this PR, we can provide a CBOR serializer and borsh serializer support, that does not need TextDecoder.

In summary, impact to user is in low level APIs. Function returning Bytes now return Uint8Array. Function taking Bytes can still use string, but recommended to use Uint8Array. High level APIs (nearbindgen & collections) are unchanged. Given that most user contracts are built with high level APIs, the impact is minimum.

Please comment if there's anything looks wrong or there's a better design!

let storageKey = this.keyPrefix + JSON.stringify(key)
if (near.storageRemove(storageKey)) {
return JSON.parse(near.storageGetEvicted())
return JSON.parse(u8ArrayToLatin1(near.storageGetEvicted()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we localize conversion such conversions in api.ts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on a few decisions.
The complexity is arise from a few places:

  • should bytes be alias to string, alias to Uint8Array, or a polymorphic type
  • what should the storage* returns, string or Uint8Array
  • if it returns Uint8Array, what should auto-deserialization do, Uint8Array -> string -> JSON.parse, or a binary-format deserialization?

@ailisp ailisp linked an issue Sep 21, 2022 that may be closed by this pull request
Copy link
Contributor

@austinabell austinabell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We serialize state to JSON, this is problematic with unicode characters. After we have the correct enforcement on storage key / value to be Uint8Array, unicode characters will be correctly rejected, but we must come up a strategy in our sdk's auto-serialization to handle user object with unicode string properties

Why do you need to reject unicode characters if everything is being translated to utf8?

Throw error at any unicode string character, only allow latin1 chars.

Did you mean only allow utf8 chars? Why latin1? (curiosity) The changes feel like it's actually utf8

In general I think at a high level #195 (comment) makes sense

Comment on lines +5 to +7
readonly keyPrefix: string;

constructor(keyPrefix: Bytes) {
constructor(keyPrefix: string) {
Copy link
Contributor

@austinabell austinabell Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this accept Bytes input? Wouldn't it make more sense to store the prefix as UInt8Array since it isn't really a utf16 string but bytes (or intended to be)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes most sense. I'm experimenting different approaches in collections, this one tries to keep backward compatibility but implementation then looks really awkward and not correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it backwards compatible if using Bytes? One of the variants would be string? Does it affect anything if the internal type changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is backward compatible, if Bytes is string | Uint8Array.

}
throw new Error("bytes: expected string or Uint8Array");
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ret;
return String.fromCharCode(...array);

And you can delete the lines above

@ailisp
Copy link
Member Author

ailisp commented Sep 23, 2022

Why do you need to reject unicode characters if everything is being translated to utf8?

The problem is ambiguity. A utf-8 sequence in byte and a string that is utf-8 encoded from same bytes will be serialized to the same thing, and you cannot know which one it is from when doing deserialize.

Did you mean only allow utf8 chars? Why latin1? (curiosity) The changes feel like it's actually utf8

I want to note that this bullet point is one of the possible approaches, but not the approach implemented in this PR. So yeah your observation "The changes feel like it's actually utf8" is right. This alternative approach is to restrict to must pass a JS string full of latin1 character (char code 0-255 only) to ensure correctness in deserialization.

In general I think at a high level #195 (comment) makes sense

Thanks! Good to know it's a reasonable direction

@no2chem
Copy link

no2chem commented Oct 27, 2022

Hi all, what is the status on this PR?

In the current form, it's pretty much impossible to use bytes in a contract without the data being mangled in UTF-16 conversion issues, making any application which involves raw bytes unusable. As a workaround I am base64 encoding everything but this seems extremely inefficient.

IMO, bytes should = Uint8Array and nothing else. For example, near.ecrecover takes bytes as input. Why would you want the ambiguity of a signature with string.length == 64 being an unknown number bytes and an invalid signature?

I think this PR needs urgent attention as the lack of being able to using a raw bytes array makes the JS SDK unusable for many applications.

@ailisp
Copy link
Member Author

ailisp commented Nov 22, 2022

@no2chem you are right, I'll look into this week

@ailisp ailisp mentioned this pull request Nov 22, 2022
7 tasks
@ailisp
Copy link
Member Author

ailisp commented Dec 2, 2022

superseded by #308 , basically what we agreed here #195 (comment) + @no2chem suggested:

bytes should = Uint8Array and nothing else

are implemented. Let's review and further discuss in #308

@ailisp ailisp closed this Dec 2, 2022
@ailisp ailisp deleted the arraybuffer branch December 2, 2022 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change Bytes from string alias to avoid misuse
4 participants