Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String Field does not handle non-UTF8 values correctly #297

Open
cheukwing opened this issue Nov 6, 2023 · 6 comments
Open

String Field does not handle non-UTF8 values correctly #297

cheukwing opened this issue Nov 6, 2023 · 6 comments

Comments

@cheukwing
Copy link
Contributor

The String field (and potentially other fields) does not handle the encoding of non-UTF8 values correctly, leading to incorrect lengths.

For example, take the value hüllo, and encoding EBCDIC 1047.

In UTF-8 hexadecimal bytes, this is 0x68 0xC3 0xBC 0x6C 0x6C 0x6F, where ü is two bytes 0xC3 0xBC.
In EBCDIC 1047, this is 0x88 0xDC 0x93 0x93 0x96, where ü is one byte 0xDC.

In the String field, padding is done before the data is encoded, and assumes the length of the data in UTF8 will be the same as the length of the data in the encoded format.

This means that when we specify a fixed length of 10 with padding, we get 0x40 0x40 0x40 0x40 0x88 0xDC 0x93 0x93 0x96, which is only length 9.
When we specify a variable LL length, we get 0xF0 0xF6 0x88 0xDC 0x93 0x93 0x96, which encoded that it has a length of 6 0xF0 0xF6, yet the actual data only has a length of 5.

spec := &Spec{
	Length:      10,
	Description: "Field",
	Enc:         encoding.EBCDIC1047,
	Pref:        prefix.EBCDIC1047.Fixed,
	Pad:         padding.Left(' '),
}
str := NewString(spec)
str.SetValue("hüllo")

packed, _ := str.Pack()
fmt.Printf("%X\n", packed) // 4040404088DC939396

lengthedSpec := &Spec{
	Length:      10,
	Description: "Field",
	Enc:         encoding.EBCDIC1047,
	Pref:        prefix.EBCDIC1047.LL,
}
str = NewString(lengthedSpec)
str.SetValue("hüllo")
packed, _ = str.Pack()
fmt.Printf("%X\n", packed) // F0F688DC939396
@alovak
Copy link
Contributor

alovak commented Nov 8, 2023

Hey @cheukwing! Thanks for reporting the issue. I'm not sure how to solve this issue, but PR your created #298 reverts this PR #227 which fixed the issue, when encoded BCD length was not equal to the original length (for 2 input bytes / ASCII chars you get 1 output byte / hex). Following that PR we assume that the values we work with (when we specify the field Length) are ASCII chars, int64 or bytes.

I'm not sure how we can support UTF8 strings. Here is how we can do it by working with runes instead of bytes: #299. I took your branch and test in it.

@adamdecaf
Copy link
Member

To support utf8 strings wouldn't we need to encode []rune? []byte wouldn't work because a rune can be more than one byte.

@alovak
Copy link
Contributor

alovak commented Nov 8, 2023

@adamdecaf I found that we can use utf8.RuneCount(data) to address this. I also think that we should switch to runes only in String and padding package (it's uses runes already, but not everywhere).

In the #299 I did exactly this: switched to runes without modifying the logic. All tests that were created by @cheukwing in his PR pass.

@alovak
Copy link
Contributor

alovak commented Nov 14, 2023

@cheukwing the utf8 support is tricky. I'm still not sure how we should address it.

While #299 addresses your need, I added one more test to show the controversy of the change:

https://github.com/moov-io/iso8583/pull/299/files#diff-628a377b7ef394e8fa0b877233eee6e2fbc4617798fe045b7cc8f31935c92f51R63

In the test, the Spec encoding is Binary. The test fails, as the value we pack and unpacked value do not match 🤷‍♂️ and the packed length is not 10 as expected by the spec.

	spec := &Spec{
		Length:      10,
		Description: "Field",
		Enc:         encoding.Binary, 
		Pref:        prefix.Binary.Fixed,
		Pad:         padding.Left(' '),
	}
	str := NewStringValue("hüllo")
	str.SetSpec(spec)
	packed, err := str.Pack()
	require.NoError(t, err)

	assert.Len(t, packed, 10) // fails

	str2 := NewString(spec)
	_, err = str2.Unpack(packed)
	require.NoError(t, err)

	assert.Equal(t, "hüllo", str2.Value()) // fails

The #299 PR works only because of custom encoding that encodes utf8 char into single byte char. It's not possible to have a fixed length for the utf8 string as when you read data from the network you read N bytes, not N utf8 chars.

Maybe the solution here is to create custom field? We can use runes but I think that may be confusing in the future. Please, share your thoughts about it all.

@cheukwing
Copy link
Contributor Author

The issue is not so much about supporting UTF-8, but about (fully) supporting encodings which are not subsets of UTF-8 (e.g. EBCDIC1047). Currently, because Go strings are UTF-8, the encoded length/padding of EBCDIC1047 strings which include characters which are two bytes in UTF-8 will be incorrect. We do not necessarily need to use runes everywhere to support this.

I tried your new test on my original PR, and it passed, although I understand that adding this padding logic in the field is not ideal.
The problem with the current implementation is that we encode the data before we pad it, so we do not know how many padding characters we should add if the number of runes differs from the encoded length.

@adamdecaf
Copy link
Member

Sounds like we need to read/write the raw bytes to fully support EBCDIC1047 and translate them into Go runes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants