Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix:go emoji to java emoji(a character takes up 3-4 bytes of special symbols) #129

Closed
pantianying opened this issue Sep 5, 2019 · 8 comments · Fixed by #131
Closed

Comments

@pantianying
Copy link
Member

What would you like to be added:

Why is this needed:
image

@wongoo
Copy link
Contributor

wongoo commented Sep 5, 2019

@pantianying pls provide a unit test so that we can follow up

@pantianying
Copy link
Member Author

@pantianying
Copy link
Member Author

ps:

  1. Java returns the string of emoji, and the go side will scramble.

  2. If Java returns Emoji in a complex map structure, the whole serialization will fail.

  3. go coded emoji, Java can not receive.

@pantianying pantianying changed the title fix:go emoji to java emoji fix:go emoji to java emoji(a character takes up 3-4 bytes of special symbols) Sep 10, 2019
@wongoo
Copy link
Contributor

wongoo commented Sep 11, 2019

@pantianying I find that java uses two 16-bit characters to represent emoji "🤣", while golang uses one rune to represent it. So the length of the emoji in java is 2, while 1 in golang.

The hessian protocol says that:

The length is the number of 16-bit characters.

So, it's a bug of golang hessian2, I will try to fix it.

@wongoo
Copy link
Contributor

wongoo commented Sep 12, 2019

New knowledge:

  1. A char is encoded in UTF-8 format in com.caucho.hessian.io.Hessian2Output.printString()
  /**
   * Prints a string to the stream, encoded as UTF-8
   *
   * @param v the string to print.
   */
  public void printString(char []v, int strOffset, int length)
    throws IOException
  {
    int offset = _offset;
    byte []buffer = _buffer;

    for (int i = 0; i < length; i++) {
      if (SIZE <= offset + 16) {
        _offset = offset;
        flushBuffer();
        offset = _offset;
      }

      char ch = v[i + strOffset];

      if (ch < 0x80)
        buffer[offset++] = (byte) (ch);
      else if (ch < 0x800) {
        buffer[offset++] = (byte) (0xc0 + ((ch >> 6) & 0x1f));
        buffer[offset++] = (byte) (0x80 + (ch & 0x3f));
      }
      else {
        buffer[offset++] = (byte) (0xe0 + ((ch >> 12) & 0xf));
        buffer[offset++] = (byte) (0x80 + ((ch >> 6) & 0x3f));
        buffer[offset++] = (byte) (0x80 + (ch & 0x3f));
      }
    }

    _offset = offset;
  }
  1. A UTF-8 character is decoded in com.caucho.hessian.io.Hessian2Input.parseUTF8Char()
  private int parseUTF8Char()
    throws IOException
  {
    int ch = _offset < _length ? (_buffer[_offset++] & 0xff) : read();

    if (ch < 0x80)
      return ch;
    else if ((ch & 0xe0) == 0xc0) {
      int ch1 = read();
      int v = ((ch & 0x1f) << 6) + (ch1 & 0x3f);

      return v;
    }
    else if ((ch & 0xf0) == 0xe0) {
      int ch1 = read();
      int ch2 = read();
      int v = ((ch & 0x0f) << 12) + ((ch1 & 0x3f) << 6) + (ch2 & 0x3f);

      return v;
    }
    else
      throw error("bad utf-8 encoding at " + codeName(ch));
  }

@fangyincheng
Copy link
Contributor

image

@wongoo
Copy link
Contributor

wongoo commented Sep 16, 2019

java only support ucs-2, while golang support ucs-4.

@wongoo
Copy link
Contributor

wongoo commented Sep 16, 2019

i'm working on it, and it may be resolved as early as tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants