Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue 5336 #5453

Closed
wants to merge 1 commit into from
Closed

issue 5336 #5453

wants to merge 1 commit into from

Conversation

frank-dong-ms-zz
Copy link
Contributor

fix issue #5336

  1. use byte array to create tensor instead of string
  2. use Unicode encode instead of UTF8

This issue is little bit complicated so please read through below:

User want to load a pb model in ML.NET, the input tensor looks like below which is a serialized Example object (a binary buffer, not a text string):
inputs['inputs'] tensor_info: dtype: DT_STRING shape: (-1) name: input_example_tensor:0

I find a workable solution is first convert Example object to protobuf encoded byte array using:
example.ToByteArray()
then convert byte array to string (char array) using some sort of reliable encoding (ideally Unicode or Base64 encoding):
Encoding.Unicode.GetString(example.ToByteArray())
Then ML.NET will convert the string back to byte array with same encoding and pass to tf.net:
Encoding.Unicode.GetBytes(((ReadOnlyMemory<char>)(object)data[i]).ToArray());

The method ML.NET uses to create Tensor is CastDataAndReturnAsTensor, previously we are using UTF8 to decode the string and convert to byte array, UTF8 is not reliable encoding as I described in this comment so I would like to change the encoding to Unicode.
Also, recently Xiaoyun upgrade our TF version in this PR and changed to use string[] instead of byte[][] to create Tensor, in this case we need to use byte[][] as the input string itself is converted from binary buffer(protobuf encoded).

@frank-dong-ms-zz
Copy link
Contributor Author

Realized UTF8 is the default encoding for tensorflow so I can't use Unicode encoding here.

@frank-dong-ms-zz frank-dong-ms-zz deleted the frdong/issue-5336 branch October 24, 2020 01:36
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants