Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add CJK support #35

Merged
merged 18 commits into from
Aug 26, 2021
Merged

feat: add CJK support #35

merged 18 commits into from
Aug 26, 2021

Conversation

Josh-Cena
Copy link
Collaborator

@Josh-Cena Josh-Cena commented May 21, 2021

Background

The library cannot handle CJK words because they are not separated by word bounds. For example, "你好吗" should be three words since each character is a word, but this is not recognized by the original algorithm.

#34 already did some preliminary work but has a series of issues. This PR also seems to be stale, so I implemented my own version, including some fixes and refactoring.

Contribution

  • Defined the range of CJK characters;
  • Refactored the word counting algorithm—more concise, more straightforward;
  • Added some CJK-related unit tests

Known issues

Consider the following case:

Hello, world!

This is two words, the first: Hello,, the second: world!. No problem.

But in Chinese,

你好, 世界!

Whether the comma and the exclamation mark should be counted as word is ambiguous. To me it shouldn't—as is the case for English. Nevertheless, because they can't be counted together with the CJK character, this sentence is still 6-word long instead of 4.

Interestingly, if you use the Chinese full-width punctuation character like

你好,世界!

Then the expected result is unambiguously 6. So the original case is still debatable.

Test case 16 should handle a CJK paragraph with Latin punctuation also demonstrates this problem, but in order to make it pass I had to change the expected value from words: 13 to words: 15 😅

@Josh-Cena
Copy link
Collaborator Author

@ngryman I know you are busy, but would you spend a while to review the changes please? This is a pretty significant fix.

test/reading-time.js Outdated Show resolved Hide resolved
test/reading-time.js Outdated Show resolved Hide resolved
@f0rb1d
Copy link

f0rb1d commented Jun 2, 2021

Hi, you did a great job in this PR! However, I would still like to point some of the problems:

Treating full-width punctuations as words

This is quite simple, so let's just treat them as punctuation and ignore them in the word count.

Chinese ready, but not Japanese

In my closed PR #34, I only did some preliminary tests for CJK words. In this PR, you only tested Simplified Chinese characters and did not test Japanese nor Korean words. Another issue for Japanese is whether we need to treat Katakana as one single word. 'アメリカ' (America) is one example, you don't treat it as 4 words, but treat it as 1.

So that's some issues I can come up with right now. I would help contribute to the code if I've got the time.

@Josh-Cena
Copy link
Collaborator Author

@f0rb1d Thanks for the review!
For full width punctuation, as I mentioned, this is where ambiguity comes in: in word processors like MS Word, you do see punctuation being treated as words. Still, I do agree that only counting words seems better.
For Korean and Japanese, sorry for accidentally forgetting them😂 You see, I don't speak either language, and I wasn't aware of the difference between Katakana or Hiragana. I'm going to fix that and try to come up with a few test cases with Google Translate, but I'd be looking forward to your help on creating some edge cases with these languages

@Josh-Cena
Copy link
Collaborator Author

Regarding Katakana though, I asked my Japanese-speaking friend and she said treating multiple characters as one word overcomplicates things since there can be "contractions" as well. If you do find this functionality necessary, can you provide a raw algorithm?

@f0rb1d
Copy link

f0rb1d commented Jun 4, 2021

@Josh-Cena MS Word did treat full-width punctuations as words, but they're no different from their half-width counterparts - they are the same, it's just that they're in different Unicode zones.

As for Katakana, unfortunately, I'm not a native Japanese speaker either. The algorithm you suggested might be hard to implement since there are some "contractions". But if it's just merely done by treating Katakana characters as one word, that should be easy to do.

@Josh-Cena
Copy link
Collaborator Author

Sorry for not replying recently... I've been busy with other stuff. I will come back to this later and pick up on the punctuation and Katakana issues.

@ngryman
Copy link
Owner

ngryman commented Aug 15, 2021

@Josh-Cena @f0rb1d Sorry for the super late answer and thanks for the contribution!

I admit that I have pretty limited knowledge here so I'll trust your judgment on this. As long as this feature doesn't bring any regression I think it's a great addition 👏 . Let me know when you both think it's ready to 🚀

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
@Josh-Cena
Copy link
Collaborator Author

Hey @ngryman I'm done! No worries since I've been quite busy for the last few months as well, but hope it gets merged soon because as you know this lib is used by a lot of SSGs🤪

Several notes:

  1. This algorithm works to the best of my knowledge, and I've added a bunch of test cases to ensure that's the case. Performance-wise this may be slower, but it's definitely more correct than before. If you are not happy with that I can try making CJK support opt-in, but I guess performance is not the bottleneck anyways?
  2. I've added a ton of inline comments explaining what I'm doing and calling for potential contributors to improve its rigor. Maybe you can put up some notice on the README as well, explaining that this package may not work the best for some languages / use cases?
  3. I've committed the package-lock.json file as well because this is best practice (to sync dependency versions across forks). If you don't like it feel free to remove it anyways, but you are advised not to.
  4. I've upgraded some dependencies (mocha and onchange) because there are security alerts. No breaking changes.

@f0rb1d I've handled Katakana as English letters and made punctuation not counted towards final word count. Does that look good to you?

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
@Josh-Cena
Copy link
Collaborator Author

Josh-Cena commented Aug 17, 2021

@ngryman In case @f0rb1d is unavailable to review this, let's just merge it for now since I've addressed their suggestions already. Also love to hear your opinions on my suggestions / questions

@ngryman
Copy link
Owner

ngryman commented Aug 19, 2021

Sure we can merge this without further approvals regarding the feature itself.

However, it seems that the PR introduces some regressions: https://travis-ci.org/github/ngryman/reading-time/builds/771866964. The CI status should show up here, but for some reason, it's not working 🤔

Could you take a look at these regressions?


My answers to your suggestions/comments are below 👇

This algorithm works to the best of my knowledge, and I've added a bunch of test cases to ensure that's the case. Performance-wise this may be slower, but it's definitely more correct than before. If you are not happy with that, I can try making CJK support opt-in, but I guess performance is not the bottleneck anyways?

Right, I wouldn't worry about performance for now. If it happens to be an issue, we can address this later.

I've added a ton of inline comments explaining what I'm doing and calling for potential contributors to improve its rigor. Maybe you can put up some notice on the README as well, explaining that this package may not work the best for some languages / use cases?

Sure, could you add this notice to the README?

I've committed the package-lock.json file as well because this is best practice (to sync dependency versions across forks). If you don't like it, feel free to remove it anyway, but you are advised not to.

When I created this package, there were no such things as lock files 🙀 . I guess it makes me a grandpa 👴
Good call. Thanks for adding it 👍

I've upgraded some dependencies (mocha and onchange) because there are security alerts. No breaking changes.

Thanks again! 👌

@Josh-Cena
Copy link
Collaborator Author

Josh-Cena commented Aug 19, 2021

Ah, wasn't expecting stream to fail, so I didn't run that test. GitHub now also has its own CI: https://github.com/features/actions which you might be interested in looking into. I can help you get started in another PR.

Update. Seems travis-ci.org stops working anyways. I'll add the github workflow here

Also, if you don't object some new syntaxes 🤪 We can probably migrate to ES6 classes and get rid of all the ReadingTimeStream.prototype stuff

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
@Josh-Cena
Copy link
Collaborator Author

Josh-Cena commented Aug 20, 2021

@ngryman Hey, I think I've found the reason. The readingTime now always treats the text as string and converts Buffer to string as well. This is required because Buffer would read the data as byte stream, but UTF-8 encoded data easily exceeds one byte width. In this case I dropped the use of hex codes as buffer word bounds, and always prefer using string characters.

Also there're some unknown encoding errors, which are just because of a malformed forEach call. (the callback of forEach has two parameters, and the second, which is the element's index, is implicitly passed to analyzer.bind, causing a weird encoding: 1 that's not recognized)

I've added more unit tests for stream to ensure that CJK works here as well.

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
@Josh-Cena
Copy link
Collaborator Author

Josh-Cena commented Aug 20, 2021

Hey @ngryman, I've shipped two more improvements.

  1. I used JSDoc to better annotate the types, since I guess using TypeScript would be too cumbersome for this small lib.
  2. I decided to export readingTimeStream through the main interface const {readingTimeStream} = require('reading-time'), since the import method in the doc actually doesn't work (it should be require('reading-time/lib/stream') instead of require('reading-time/stream'). Still, I've added a TypeScript declaration for both importing method, so the latter would still work, despite the former probably being better stylistically.

I tried to not introduce any breaking changes. If you are open to breaking changes (by releasing another major version 2.0.0), here are what I would suggest you consider:

  1. Convert readingTimeStream to an ES6 class. This would mean you lose the ability to instantiate without the new keywork, however. I will think it over to see what we can do better, but the current ReadingTimeStream.prototype and util.inherits(ReadingTimeStream, Transform) just don't look ideal to me
  2. Join the community in an effort to migrate all packages to ESM (ES Modules). This would mean converting all require() and module.exports usage to import and export. Optionally, we can release a reading-time.js.cjs file (as what is many libs doing) for those still needing CJS imports.

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Signed-off-by: Josh-Cena <sidachen2003@gmail.com>
Copy link
Owner

@ngryman ngryman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great contribution here 🙇🏻‍♂️

There are now multiple orthogonal changes involved so the PR is getting a bit more complex. These are all great changes, it just makes the review a bit more difficult for me. I'm currently moving to a new city, lots of things to handle right now, so I apologize for the review time.

That being said, I managed to pull off sometime today 🙌.
I'm generally 👍 for everything that you proposed here. I just left a couple of minor suggestions below.

One thing that I would like to do, however, is to move the CJK implementation into a dedicated function (ie. readingTimeCJK). I think that would make sense to have it in a dedicated file as well.

Two main reasons for that:

  1. It seems pretty unlikely that real-world text bodies would be composed of a balanced mix of Latin and CJK characters. I would consider this as an edge case for now. So I don't think we have to support both at the same time. I'd be fine with either counting Latin or CJK words separately.
  2. It will make the code review and maintenance easier. If we need to improve/patch the CJK implementation later, there's no risk of regressions for the Latin implementation and vice versa.

Could you move the current CJK implementation to a dedicated function/file and expose this via the index.js?
Thanks again for your contribution and your patience on this 🙏

README.md Outdated Show resolved Hide resolved
.eslintrc Show resolved Hide resolved
index.d.ts Outdated Show resolved Hide resolved
index.js Outdated Show resolved Hide resolved
@Josh-Cena
Copy link
Collaborator Author

Josh-Cena commented Aug 24, 2021

@ngryman Thanks for the review! I've addressed your suggestions.

However, as both an end user and a contributor to Docusaurus, an SSG that uses this library, I have to contest the idea of moving CJK to a separate file and exposing a different function (maybe I didn't fully understand your intention here).

  1. As this library is primarily used by SSGs like Vuepress and Docusaurus, there isn't an API exposed to the actual end user (those site creators) to select "I am writing Chinese" or "I am writing English" (at least I don't see such an API being necessary). Note that even when sites have i18n, we cannot use the site's locale to infer a specific article's language, because even a Chinese site may have one or two English articles.
  2. If we exposed a unified API readingTime and made it call two sets of logic internally, it becomes our burden to guess whether the text is CJK or Latin, which isn't easy either—should we poll the first ten characters? The first one hundred? What if the text starts with some English quote but is an entirely Chinese article? What would be a fair representation of the article's language?
  3. Even if the CJK logic is separated, the readingTimeCJK function itself won't be largely different from what the current readingTime function would look like, because most of the work done is stripping punctuation away, which can't be avoided with or without Latin letters.

In the end my personal preference, as a bilingual, is to stay on the safer side and handle the text as correctly as possible. I did offer to make CJK opt-in once (i.e., we provide one naïve implementation that only applies to Latin languages and one sophisticated implementation that works on both, and provide an option in the exposed readingTime API called handleCJK), but in the end that doesn't simplify the code base either (it just adds a new naiveReadingTime function).

I understand that CJK seems to be dominating the logic now, but that's just because it's so sophisticated 🤪

P.S. Regressions can largely be prevented if we have good unit tests. Plus I will always be available for help if we need more changes, so can we keep it this way for now? We can discuss more about the pros & cons

@ngryman
Copy link
Owner

ngryman commented Aug 26, 2021

@Josh-Cena Great, ok your points make sense to me 👍 I didn't see things that way.

Alright, let's merge this. I'll double-check a couple of things by the end of next week when things settle down a bit on my side and will release a new version then. I think your 2 suggestions for a 2.0.0 are great. This package definitely needs a new youth :)

@ngryman ngryman merged commit c8d3f43 into ngryman:master Aug 26, 2021
@Josh-Cena
Copy link
Collaborator Author

Friendly ping for @ngryman 👀 Looking forward to the next release a lot

Also #40 is there

@ngryman
Copy link
Owner

ngryman commented Sep 10, 2021

@Josh-Cena Released!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants