Transliteration error #3

zfLQ2qx2 · 2020-03-02T21:13:25Z

Thank you very much for the wonderful library, I am glad it is here!

I did come across one transliteration issue - I believe してください should be shitekudasai instead of shitekutasai. I tried a number of Hiragana-Romaji converters which used number of different methods and all choose "da" instead of "ta" for the fourth syllable.

mozillazg · 2020-03-04T01:25:00Z

Thanks for your report. I'll fixed it later.

zfLQ2qx2 · 2020-03-06T15:11:22Z

@mozillazg I'm curious to see how you do it, looking at x030.go I see that 0x3060 is "da" so I'm not understanding how it becomes ta to being with.

mozillazg · 2020-03-07T02:46:23Z

@zfLQ2qx2 I can't reproduce the issue:

$ go version
go version go1.13.4 darwin/amd64

$ go run unidecode/main.go "してください" | grep shitekudasai
shitekudasai

Let me know if anything was missed.

zfLQ2qx2 · 2020-03-09T00:06:55Z

@mozillazg Looks like the difference is that I'm normalizing the string to fully decomposed form using golang.org/x/text/transform and calling transform.Chain(norm.NFD) prior to transliterating with go-unidecode.

Before Hex: e38197e381a6e3818fe381a0e38195e38184
U+3057 'し' starts at byte position 0
U+3066 'て' starts at byte position 3
U+304F 'く' starts at byte position 6
U+3060 'だ' starts at byte position 9
U+3055 'さ' starts at byte position 12
U+3044 'い' starts at byte position 15

After Hex: e38197e381a6e3818fe3819fe38299e38195e38184
U+3057 'し' starts at byte position 0
U+3066 'て' starts at byte position 3
U+304F 'く' starts at byte position 6
U+305F 'た' starts at byte position 9
U+3099 '゙' starts at byte position 12
U+3055 'さ' starts at byte position 15
U+3044 'い' starts at byte position 18

So looks like the normalization process changes 0x3060 to 0x305F plus 0x3099 (which is "combining katakana-hiragana voiced sound mark") and gets transliterated to "ta" and "" respectively. Ok, so now I understand where "ta" is coming from, so it looks like the workaround is to normalize to the fully composed form instead of decomposed form.

I chose the fully decomposed form because I was trying to match the output of a nodejs function, but honestly there are several test cases for that which are kind of dubious, so I think using the fully composed form and then updating the test cases to match is the way to go.

Apologies for having bothered you with this, but was interesting to work out.

mozillazg added the question label Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transliteration error #3

Transliteration error #3

zfLQ2qx2 commented Mar 2, 2020

mozillazg commented Mar 4, 2020

zfLQ2qx2 commented Mar 6, 2020

mozillazg commented Mar 7, 2020

zfLQ2qx2 commented Mar 9, 2020

Transliteration error #3

Transliteration error #3

Comments

zfLQ2qx2 commented Mar 2, 2020

mozillazg commented Mar 4, 2020

zfLQ2qx2 commented Mar 6, 2020

mozillazg commented Mar 7, 2020

zfLQ2qx2 commented Mar 9, 2020