Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transliteration error #3

Open
zfLQ2qx2 opened this issue Mar 2, 2020 · 4 comments
Open

Transliteration error #3

zfLQ2qx2 opened this issue Mar 2, 2020 · 4 comments
Labels

Comments

@zfLQ2qx2
Copy link

zfLQ2qx2 commented Mar 2, 2020

Thank you very much for the wonderful library, I am glad it is here!

I did come across one transliteration issue - I believe してください should be shitekudasai instead of shitekutasai. I tried a number of Hiragana-Romaji converters which used number of different methods and all choose "da" instead of "ta" for the fourth syllable.

@mozillazg
Copy link
Owner

Thanks for your report. I'll fixed it later.

@zfLQ2qx2
Copy link
Author

zfLQ2qx2 commented Mar 6, 2020

@mozillazg I'm curious to see how you do it, looking at x030.go I see that 0x3060 is "da" so I'm not understanding how it becomes ta to being with.

@mozillazg
Copy link
Owner

@zfLQ2qx2 I can't reproduce the issue:

$ go version
go version go1.13.4 darwin/amd64

$ go run unidecode/main.go "してください" | grep shitekudasai
shitekudasai

Let me know if anything was missed.

@zfLQ2qx2
Copy link
Author

zfLQ2qx2 commented Mar 9, 2020

@mozillazg Looks like the difference is that I'm normalizing the string to fully decomposed form using golang.org/x/text/transform and calling transform.Chain(norm.NFD) prior to transliterating with go-unidecode.

Before Hex: e38197e381a6e3818fe381a0e38195e38184
U+3057 'し' starts at byte position 0
U+3066 'て' starts at byte position 3
U+304F 'く' starts at byte position 6
U+3060 'だ' starts at byte position 9
U+3055 'さ' starts at byte position 12
U+3044 'い' starts at byte position 15

After Hex: e38197e381a6e3818fe3819fe38299e38195e38184
U+3057 'し' starts at byte position 0
U+3066 'て' starts at byte position 3
U+304F 'く' starts at byte position 6
U+305F 'た' starts at byte position 9
U+3099 '゙' starts at byte position 12
U+3055 'さ' starts at byte position 15
U+3044 'い' starts at byte position 18

So looks like the normalization process changes 0x3060 to 0x305F plus 0x3099 (which is "combining katakana-hiragana voiced sound mark") and gets transliterated to "ta" and "" respectively. Ok, so now I understand where "ta" is coming from, so it looks like the workaround is to normalize to the fully composed form instead of decomposed form.

I chose the fully decomposed form because I was trying to match the output of a nodejs function, but honestly there are several test cases for that which are kind of dubious, so I think using the fully composed form and then updating the test cases to match is the way to go.

Apologies for having bothered you with this, but was interesting to work out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants