-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add CJK support #35
Conversation
@ngryman I know you are busy, but would you spend a while to review the changes please? This is a pretty significant fix. |
Hi, you did a great job in this PR! However, I would still like to point some of the problems: Treating full-width punctuations as wordsThis is quite simple, so let's just treat them as punctuation and ignore them in the word count. Chinese ready, but not JapaneseIn my closed PR #34, I only did some preliminary tests for CJK words. In this PR, you only tested Simplified Chinese characters and did not test Japanese nor Korean words. Another issue for Japanese is whether we need to treat Katakana as one single word. 'アメリカ' (America) is one example, you don't treat it as 4 words, but treat it as 1. So that's some issues I can come up with right now. I would help contribute to the code if I've got the time. |
@f0rb1d Thanks for the review! |
Regarding Katakana though, I asked my Japanese-speaking friend and she said treating multiple characters as one word overcomplicates things since there can be "contractions" as well. If you do find this functionality necessary, can you provide a raw algorithm? |
@Josh-Cena MS Word did treat full-width punctuations as words, but they're no different from their half-width counterparts - they are the same, it's just that they're in different Unicode zones. As for Katakana, unfortunately, I'm not a native Japanese speaker either. The algorithm you suggested might be hard to implement since there are some "contractions". But if it's just merely done by treating Katakana characters as one word, that should be easy to do. |
Sorry for not replying recently... I've been busy with other stuff. I will come back to this later and pick up on the punctuation and Katakana issues. |
@Josh-Cena @f0rb1d Sorry for the super late answer and thanks for the contribution! I admit that I have pretty limited knowledge here so I'll trust your judgment on this. As long as this feature doesn't bring any regression I think it's a great addition 👏 . Let me know when you both think it's ready to 🚀 |
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Hey @ngryman I'm done! No worries since I've been quite busy for the last few months as well, but hope it gets merged soon because as you know this lib is used by a lot of SSGs🤪 Several notes:
@f0rb1d I've handled Katakana as English letters and made punctuation not counted towards final word count. Does that look good to you? |
Sure we can merge this without further approvals regarding the feature itself. However, it seems that the PR introduces some regressions: https://travis-ci.org/github/ngryman/reading-time/builds/771866964. The CI status should show up here, but for some reason, it's not working 🤔 Could you take a look at these regressions? My answers to your suggestions/comments are below 👇
Right, I wouldn't worry about performance for now. If it happens to be an issue, we can address this later.
Sure, could you add this notice to the README?
When I created this package, there were no such things as lock files 🙀 . I guess it makes me a grandpa 👴
Thanks again! 👌 |
Ah, wasn't expecting Update. Seems travis-ci.org stops working anyways. I'll add the github workflow here Also, if you don't object some new syntaxes 🤪 We can probably migrate to ES6 classes and get rid of all the |
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
@ngryman Hey, I think I've found the reason. The Also there're some I've added more unit tests for |
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Hey @ngryman, I've shipped two more improvements.
I tried to not introduce any breaking changes. If you are open to breaking changes (by releasing another major version
|
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great contribution here 🙇🏻♂️
There are now multiple orthogonal changes involved so the PR is getting a bit more complex. These are all great changes, it just makes the review a bit more difficult for me. I'm currently moving to a new city, lots of things to handle right now, so I apologize for the review time.
That being said, I managed to pull off sometime today 🙌.
I'm generally 👍 for everything that you proposed here. I just left a couple of minor suggestions below.
One thing that I would like to do, however, is to move the CJK implementation into a dedicated function (ie. readingTimeCJK
). I think that would make sense to have it in a dedicated file as well.
Two main reasons for that:
- It seems pretty unlikely that real-world text bodies would be composed of a balanced mix of Latin and CJK characters. I would consider this as an edge case for now. So I don't think we have to support both at the same time. I'd be fine with either counting Latin or CJK words separately.
- It will make the code review and maintenance easier. If we need to improve/patch the CJK implementation later, there's no risk of regressions for the Latin implementation and vice versa.
Could you move the current CJK implementation to a dedicated function/file and expose this via the index.js
?
Thanks again for your contribution and your patience on this 🙏
@ngryman Thanks for the review! I've addressed your suggestions. However, as both an end user and a contributor to Docusaurus, an SSG that uses this library, I have to contest the idea of moving CJK to a separate file and exposing a different function (maybe I didn't fully understand your intention here).
In the end my personal preference, as a bilingual, is to stay on the safer side and handle the text as correctly as possible. I did offer to make CJK opt-in once (i.e., we provide one naïve implementation that only applies to Latin languages and one sophisticated implementation that works on both, and provide an option in the exposed I understand that CJK seems to be dominating the logic now, but that's just because it's so sophisticated 🤪 P.S. Regressions can largely be prevented if we have good unit tests. Plus I will always be available for help if we need more changes, so can we keep it this way for now? We can discuss more about the pros & cons |
@Josh-Cena Great, ok your points make sense to me 👍 I didn't see things that way. Alright, let's merge this. I'll double-check a couple of things by the end of next week when things settle down a bit on my side and will release a new version then. I think your 2 suggestions for a |
Background
The library cannot handle CJK words because they are not separated by word bounds. For example, "你好吗" should be three words since each character is a word, but this is not recognized by the original algorithm.
#34 already did some preliminary work but has a series of issues. This PR also seems to be stale, so I implemented my own version, including some fixes and refactoring.
Contribution
Known issues
Consider the following case:
This is two words, the first:
Hello,
, the second:world!
. No problem.But in Chinese,
Whether the comma and the exclamation mark should be counted as word is ambiguous. To me it shouldn't—as is the case for English. Nevertheless, because they can't be counted together with the CJK character, this sentence is still 6-word long instead of 4.
Interestingly, if you use the Chinese full-width punctuation character like
Then the expected result is unambiguously 6. So the original case is still debatable.
Test case 16
should handle a CJK paragraph with Latin punctuation
also demonstrates this problem, but in order to make it pass I had to change the expected value fromwords: 13
towords: 15
😅