-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String Field does not handle non-UTF8 values correctly #297
Comments
Hey @cheukwing! Thanks for reporting the issue. I'm not sure how to solve this issue, but PR your created #298 reverts this PR #227 which fixed the issue, when encoded BCD length was not equal to the original length (for 2 input bytes / ASCII chars you get 1 output byte / hex). Following that PR we assume that the values we work with (when we specify the field Length) are ASCII chars, int64 or bytes. I'm not sure how we can support UTF8 strings. Here is how we can do it by working with runes instead of bytes: #299. I took your branch and test in it. |
To support utf8 strings wouldn't we need to encode |
@adamdecaf I found that we can use utf8.RuneCount(data) to address this. I also think that we should switch to runes only in String and In the #299 I did exactly this: switched to runes without modifying the logic. All tests that were created by @cheukwing in his PR pass. |
@cheukwing the utf8 support is tricky. I'm still not sure how we should address it. While #299 addresses your need, I added one more test to show the controversy of the change: In the test, the spec := &Spec{
Length: 10,
Description: "Field",
Enc: encoding.Binary,
Pref: prefix.Binary.Fixed,
Pad: padding.Left(' '),
}
str := NewStringValue("hüllo")
str.SetSpec(spec)
packed, err := str.Pack()
require.NoError(t, err)
assert.Len(t, packed, 10) // fails
str2 := NewString(spec)
_, err = str2.Unpack(packed)
require.NoError(t, err)
assert.Equal(t, "hüllo", str2.Value()) // fails The #299 PR works only because of custom encoding that encodes utf8 char into single byte char. It's not possible to have a fixed length for the utf8 string as when you read data from the network you read N bytes, not N utf8 chars. Maybe the solution here is to create custom field? We can use |
The issue is not so much about supporting UTF-8, but about (fully) supporting encodings which are not subsets of UTF-8 (e.g. EBCDIC1047). Currently, because Go strings are UTF-8, the encoded length/padding of EBCDIC1047 strings which include characters which are two bytes in UTF-8 will be incorrect. We do not necessarily need to use runes everywhere to support this. I tried your new test on my original PR, and it passed, although I understand that adding this padding logic in the field is not ideal. |
Sounds like we need to read/write the raw bytes to fully support EBCDIC1047 and translate them into Go runes. |
The String field (and potentially other fields) does not handle the encoding of non-UTF8 values correctly, leading to incorrect lengths.
For example, take the value
hüllo
, and encoding EBCDIC 1047.In UTF-8 hexadecimal bytes, this is
0x68 0xC3 0xBC 0x6C 0x6C 0x6F
, whereü
is two bytes0xC3 0xBC
.In EBCDIC 1047, this is
0x88 0xDC 0x93 0x93 0x96
, whereü
is one byte0xDC
.In the String field, padding is done before the data is encoded, and assumes the length of the data in UTF8 will be the same as the length of the data in the encoded format.
This means that when we specify a fixed length of 10 with padding, we get
0x40 0x40 0x40 0x40 0x88 0xDC 0x93 0x93 0x96
, which is only length 9.When we specify a variable LL length, we get
0xF0 0xF6 0x88 0xDC 0x93 0x93 0x96
, which encoded that it has a length of 60xF0 0xF6
, yet the actual data only has a length of 5.The text was updated successfully, but these errors were encountered: