String Field does not handle non-UTF8 values correctly #297

cheukwing · 2023-11-06T11:48:18Z

The String field (and potentially other fields) does not handle the encoding of non-UTF8 values correctly, leading to incorrect lengths.

For example, take the value hüllo, and encoding EBCDIC 1047.

In UTF-8 hexadecimal bytes, this is 0x68 0xC3 0xBC 0x6C 0x6C 0x6F, where ü is two bytes 0xC3 0xBC.
In EBCDIC 1047, this is 0x88 0xDC 0x93 0x93 0x96, where ü is one byte 0xDC.

In the String field, padding is done before the data is encoded, and assumes the length of the data in UTF8 will be the same as the length of the data in the encoded format.

This means that when we specify a fixed length of 10 with padding, we get 0x40 0x40 0x40 0x40 0x88 0xDC 0x93 0x93 0x96, which is only length 9.
When we specify a variable LL length, we get 0xF0 0xF6 0x88 0xDC 0x93 0x93 0x96, which encoded that it has a length of 6 0xF0 0xF6, yet the actual data only has a length of 5.

spec := &Spec{
	Length:      10,
	Description: "Field",
	Enc:         encoding.EBCDIC1047,
	Pref:        prefix.EBCDIC1047.Fixed,
	Pad:         padding.Left(' '),
}
str := NewString(spec)
str.SetValue("hüllo")

packed, _ := str.Pack()
fmt.Printf("%X\n", packed) // 4040404088DC939396

lengthedSpec := &Spec{
	Length:      10,
	Description: "Field",
	Enc:         encoding.EBCDIC1047,
	Pref:        prefix.EBCDIC1047.LL,
}
str = NewString(lengthedSpec)
str.SetValue("hüllo")
packed, _ = str.Pack()
fmt.Printf("%X\n", packed) // F0F688DC939396

The text was updated successfully, but these errors were encountered:

alovak · 2023-11-08T17:18:20Z

Hey @cheukwing! Thanks for reporting the issue. I'm not sure how to solve this issue, but PR your created #298 reverts this PR #227 which fixed the issue, when encoded BCD length was not equal to the original length (for 2 input bytes / ASCII chars you get 1 output byte / hex). Following that PR we assume that the values we work with (when we specify the field Length) are ASCII chars, int64 or bytes.

I'm not sure how we can support UTF8 strings. Here is how we can do it by working with runes instead of bytes: #299. I took your branch and test in it.

adamdecaf · 2023-11-08T18:48:58Z

To support utf8 strings wouldn't we need to encode []rune? []byte wouldn't work because a rune can be more than one byte.

alovak · 2023-11-08T19:07:32Z

@adamdecaf I found that we can use utf8.RuneCount(data) to address this. I also think that we should switch to runes only in String and padding package (it's uses runes already, but not everywhere).

In the #299 I did exactly this: switched to runes without modifying the logic. All tests that were created by @cheukwing in his PR pass.

alovak · 2023-11-14T13:47:33Z

@cheukwing the utf8 support is tricky. I'm still not sure how we should address it.

While #299 addresses your need, I added one more test to show the controversy of the change:

https://github.com/moov-io/iso8583/pull/299/files#diff-628a377b7ef394e8fa0b877233eee6e2fbc4617798fe045b7cc8f31935c92f51R63

In the test, the Spec encoding is Binary. The test fails, as the value we pack and unpacked value do not match 🤷‍♂️ and the packed length is not 10 as expected by the spec.

	spec := &Spec{
		Length:      10,
		Description: "Field",
		Enc:         encoding.Binary, 
		Pref:        prefix.Binary.Fixed,
		Pad:         padding.Left(' '),
	}
	str := NewStringValue("hüllo")
	str.SetSpec(spec)
	packed, err := str.Pack()
	require.NoError(t, err)

	assert.Len(t, packed, 10) // fails

	str2 := NewString(spec)
	_, err = str2.Unpack(packed)
	require.NoError(t, err)

	assert.Equal(t, "hüllo", str2.Value()) // fails

The #299 PR works only because of custom encoding that encodes utf8 char into single byte char. It's not possible to have a fixed length for the utf8 string as when you read data from the network you read N bytes, not N utf8 chars.

Maybe the solution here is to create custom field? We can use runes but I think that may be confusing in the future. Please, share your thoughts about it all.

cheukwing · 2023-11-14T15:10:36Z

The issue is not so much about supporting UTF-8, but about (fully) supporting encodings which are not subsets of UTF-8 (e.g. EBCDIC1047). Currently, because Go strings are UTF-8, the encoded length/padding of EBCDIC1047 strings which include characters which are two bytes in UTF-8 will be incorrect. We do not necessarily need to use runes everywhere to support this.

I tried your new test on my original PR, and it passed, although I understand that adding this padding logic in the field is not ideal.
The problem with the current implementation is that we encode the data before we pad it, so we do not know how many padding characters we should add if the number of runes differs from the encoded length.

adamdecaf · 2023-11-14T15:19:46Z

Sounds like we need to read/write the raw bytes to fully support EBCDIC1047 and translate them into Go runes.

cheukwing mentioned this issue Nov 6, 2023

Fix encoding non-UTF8 values in String field #298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String Field does not handle non-UTF8 values correctly #297

String Field does not handle non-UTF8 values correctly #297

cheukwing commented Nov 6, 2023

alovak commented Nov 8, 2023

adamdecaf commented Nov 8, 2023

alovak commented Nov 8, 2023

alovak commented Nov 14, 2023 •

edited

Loading

cheukwing commented Nov 14, 2023

adamdecaf commented Nov 14, 2023

String Field does not handle non-UTF8 values correctly #297

String Field does not handle non-UTF8 values correctly #297

Comments

cheukwing commented Nov 6, 2023

alovak commented Nov 8, 2023

adamdecaf commented Nov 8, 2023

alovak commented Nov 8, 2023

alovak commented Nov 14, 2023 • edited Loading

cheukwing commented Nov 14, 2023

adamdecaf commented Nov 14, 2023

alovak commented Nov 14, 2023 •

edited

Loading