Use Uint8Array as internal repr of Bytes and fix #117 #195

ailisp · 2022-08-26T16:29:45Z

builder.c code JS Uint8Array -> C uint8_t * and vice versa works, ensured by a few passed tests (valueReturn: JS -> C, readRegister: C -> JS).

To discuss:

should we keep Bytes for backward-compatible. Have Bytes Uint8Array | String, or a class which constructor can take a string and check it's all in range, or just alias of Uint8Array, or just drop Bytes and use Uint8Array for low level APIs
Uint8Array <> String decode/encode. In nodejs, this is typically via Buffer (a subclass and enhanced version of Uint8Array) or TextEncoder/TextDecoder, but none of them is in quickjs. WIth basic String.charCodeAt/fromCharCode, I can make a simplest latin1 encoder/decoder (each Uint8 is mapped to the same 0-255 char code). With the unicode C library shipped with quickjs, it seems possible to expose a UTF-8 and UTF-16 text encoder/decoder. I think these together are sufficiently good. Other ideas are welcome.
We serialize state to JSON, this is problematic with unicode characters. After we have the correct enforcement on storage key / value to be Uint8Array, unicode characters will be correctly rejected, but we must come up a strategy in our sdk's auto-serialization to handle user object with unicode string properties. A few possible strategy:
- Throw error at any unicode string character, only allow latin1 chars.
- Always utf-8 decode at serialization, and utf-8 encode at deserialization
- Use a schema-less binary format to replace JSON, for example, CBOR

@volovyks @austinabell What are your thoughts? Thank you!

…uffer

ailisp · 2022-08-30T03:11:53Z

From discussion with @volovyks , I propose the following design (all alternatives I can think of are in PR description):

We will change to type Bytes = Uint8Array | string.
- Use string is unchecked and auto utf-8 decoded (same as current behavior, and same as browser/nodejs TextDecoder's default encoding).
- Use string will be warned at compile time.
For those api.ts functions that were returning Bytes, we'll return Uint8Array. (Breaking change)
Provide TextEncoder, TextDecoder class, implemented in C, can do utf-8. utf-16le/be and latin1 encoding
Auto JSON serialization on arguments, state, collections are auto utf-8 decoded to uint8_t* in C, and auto utf-8 encoded when deserialize. (same as current behavior, but we'll explicitly document this fact)
Not in this PR, we can provide a CBOR serializer and borsh serializer support, that does not need TextDecoder.

In summary, impact to user is in low level APIs. Function returning Bytes now return Uint8Array. Function taking Bytes can still use string, but recommended to use Uint8Array. High level APIs (nearbindgen & collections) are unchanged. Given that most user contracts are built with high level APIs, the impact is minimum.

Please comment if there's anything looks wrong or there's a better design!

…uffer

volovyks · 2022-09-01T18:36:38Z

src/collections/lookup-map.ts

        let storageKey = this.keyPrefix + JSON.stringify(key)
        if (near.storageRemove(storageKey)) {
-            return JSON.parse(near.storageGetEvicted())
+            return JSON.parse(u8ArrayToLatin1(near.storageGetEvicted()))


Can we localize conversion such conversions in api.ts?

It depends on a few decisions.
The complexity is arise from a few places:

should bytes be alias to string, alias to Uint8Array, or a polymorphic type

what should the storage* returns, string or Uint8Array

if it returns Uint8Array, what should auto-deserialization do, Uint8Array -> string -> JSON.parse, or a binary-format deserialization?

austinabell

We serialize state to JSON, this is problematic with unicode characters. After we have the correct enforcement on storage key / value to be Uint8Array, unicode characters will be correctly rejected, but we must come up a strategy in our sdk's auto-serialization to handle user object with unicode string properties

Why do you need to reject unicode characters if everything is being translated to utf8?

Throw error at any unicode string character, only allow latin1 chars.

Did you mean only allow utf8 chars? Why latin1? (curiosity) The changes feel like it's actually utf8

In general I think at a high level #195 (comment) makes sense

austinabell · 2022-09-23T06:44:25Z

src/collections/lookup-map.ts

+    readonly keyPrefix: string;

-    constructor(keyPrefix: Bytes) {
+    constructor(keyPrefix: string) {


Why doesn't this accept Bytes input? Wouldn't it make more sense to store the prefix as UInt8Array since it isn't really a utf16 string but bytes (or intended to be)?

Yeah, that makes most sense. I'm experimenting different approaches in collections, this one tries to keep backward compatibility but implementation then looks really awkward and not correct

Isn't it backwards compatible if using Bytes? One of the variants would be string? Does it affect anything if the internal type changes?

Yes it is backward compatible, if Bytes is string | Uint8Array.

austinabell · 2022-09-23T06:51:05Z

src/utils.ts

  }
-  throw new Error("bytes: expected string or Uint8Array");
+  return ret;


Suggested change

return ret;

return String.fromCharCode(...array);

And you can delete the lines above

ailisp · 2022-09-23T12:54:16Z

Why do you need to reject unicode characters if everything is being translated to utf8?

The problem is ambiguity. A utf-8 sequence in byte and a string that is utf-8 encoded from same bytes will be serialized to the same thing, and you cannot know which one it is from when doing deserialize.

Did you mean only allow utf8 chars? Why latin1? (curiosity) The changes feel like it's actually utf8

I want to note that this bullet point is one of the possible approaches, but not the approach implemented in this PR. So yeah your observation "The changes feel like it's actually utf8" is right. This alternative approach is to restrict to must pass a JS string full of latin1 character (char code 0-255 only) to ensure correctness in deserialization.

In general I think at a high level #195 (comment) makes sense

Thanks! Good to know it's a reasonable direction

no2chem · 2022-10-27T06:48:32Z

Hi all, what is the status on this PR?

In the current form, it's pretty much impossible to use bytes in a contract without the data being mangled in UTF-16 conversion issues, making any application which involves raw bytes unusable. As a workaround I am base64 encoding everything but this seems extremely inefficient.

IMO, bytes should = Uint8Array and nothing else. For example, near.ecrecover takes bytes as input. Why would you want the ambiguity of a signature with string.length == 64 being an unknown number bytes and an invalid signature?

I think this PR needs urgent attention as the lack of being able to using a raw bytes array makes the JS SDK unusable for many applications.

ailisp · 2022-11-22T07:07:17Z

@no2chem you are right, I'll look into this week

ailisp · 2022-12-02T14:23:06Z

superseded by #308 , basically what we agreed here #195 (comment) + @no2chem suggested:

bytes should = Uint8Array and nothing else

are implemented. Let's review and further discuss in #308

ailisp and others added 7 commits August 23, 2022 17:05

experiment to replace low level of bytes from string to arraybuffer

9a974ad

use uint8array instead of arraybuffer to make API better

630e1ba

make all c api takes uint8array

90290d5

Merge develop into arraybuffer

755f586

log utf8 uint8array

9dc9f67

Merge branch 'arraybuffer' of github.com:near/near-sdk-js into arrayb…

6f77d71

…uffer

Merge refs/heads/develop into arraybuffer

37fc630

ailisp added 3 commits September 1, 2022 15:56

new Bytes interface

6558dd5

Merge branch 'arraybuffer' of github.com:near/near-sdk-js into arrayb…

97baaf4

…uffer

fix and keep lookupmap backward compatible

a4d3373

volovyks reviewed Sep 1, 2022

View reviewed changes

ailisp linked an issue Sep 21, 2022 that may be closed by this pull request

Change Bytes from string alias to avoid misuse #117

Closed

austinabell reviewed Sep 23, 2022

View reviewed changes

ailisp mentioned this pull request Nov 22, 2022

Use uint8array #308

Merged

7 tasks

ailisp closed this Dec 2, 2022

ailisp deleted the arraybuffer branch December 2, 2022 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Uint8Array as internal repr of Bytes and fix #117 #195

Use Uint8Array as internal repr of Bytes and fix #117 #195

ailisp commented Aug 26, 2022

ailisp commented Aug 30, 2022

volovyks Sep 1, 2022

ailisp Sep 2, 2022

austinabell left a comment

austinabell Sep 23, 2022 •

edited

Loading

ailisp Sep 23, 2022

austinabell Sep 23, 2022

ailisp Sep 26, 2022

austinabell Sep 23, 2022

ailisp commented Sep 23, 2022 •

edited

Loading

no2chem commented Oct 27, 2022

ailisp commented Nov 22, 2022

ailisp commented Dec 2, 2022

Use Uint8Array as internal repr of Bytes and fix #117 #195

Use Uint8Array as internal repr of Bytes and fix #117 #195

Conversation

ailisp commented Aug 26, 2022

ailisp commented Aug 30, 2022

volovyks Sep 1, 2022

Choose a reason for hiding this comment

ailisp Sep 2, 2022

Choose a reason for hiding this comment

austinabell left a comment

Choose a reason for hiding this comment

austinabell Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

ailisp Sep 23, 2022

Choose a reason for hiding this comment

austinabell Sep 23, 2022

Choose a reason for hiding this comment

ailisp Sep 26, 2022

Choose a reason for hiding this comment

austinabell Sep 23, 2022

Choose a reason for hiding this comment

ailisp commented Sep 23, 2022 • edited Loading

no2chem commented Oct 27, 2022

ailisp commented Nov 22, 2022

ailisp commented Dec 2, 2022

austinabell Sep 23, 2022 •

edited

Loading

ailisp commented Sep 23, 2022 •

edited

Loading