feat: add CJK support #35

Josh-Cena · 2021-05-21T04:24:15Z

Background

The library cannot handle CJK words because they are not separated by word bounds. For example, "你好吗" should be three words since each character is a word, but this is not recognized by the original algorithm.

#34 already did some preliminary work but has a series of issues. This PR also seems to be stale, so I implemented my own version, including some fixes and refactoring.

Contribution

Defined the range of CJK characters;
Refactored the word counting algorithm—more concise, more straightforward;
Added some CJK-related unit tests

Known issues

Consider the following case:

Hello, world!

This is two words, the first: Hello,, the second: world!. No problem.

But in Chinese,

你好, 世界!

Whether the comma and the exclamation mark should be counted as word is ambiguous. To me it shouldn't—as is the case for English. Nevertheless, because they can't be counted together with the CJK character, this sentence is still 6-word long instead of 4.

Interestingly, if you use the Chinese full-width punctuation character like

你好，世界！

Then the expected result is unambiguously 6. So the original case is still debatable.

Test case 16 should handle a CJK paragraph with Latin punctuation also demonstrates this problem, but in order to make it pass I had to change the expected value from words: 13 to words: 15 😅

Josh-Cena · 2021-05-29T01:47:25Z

@ngryman I know you are busy, but would you spend a while to review the changes please? This is a pretty significant fix.

test/reading-time.js

f0rb1d · 2021-06-02T20:34:12Z

Hi, you did a great job in this PR! However, I would still like to point some of the problems:

Treating full-width punctuations as words

This is quite simple, so let's just treat them as punctuation and ignore them in the word count.

Chinese ready, but not Japanese

In my closed PR #34, I only did some preliminary tests for CJK words. In this PR, you only tested Simplified Chinese characters and did not test Japanese nor Korean words. Another issue for Japanese is whether we need to treat Katakana as one single word. 'アメリカ' (America) is one example, you don't treat it as 4 words, but treat it as 1.

So that's some issues I can come up with right now. I would help contribute to the code if I've got the time.

Josh-Cena · 2021-06-02T23:30:22Z

@f0rb1d Thanks for the review!
For full width punctuation, as I mentioned, this is where ambiguity comes in: in word processors like MS Word, you do see punctuation being treated as words. Still, I do agree that only counting words seems better.
For Korean and Japanese, sorry for accidentally forgetting them😂 You see, I don't speak either language, and I wasn't aware of the difference between Katakana or Hiragana. I'm going to fix that and try to come up with a few test cases with Google Translate, but I'd be looking forward to your help on creating some edge cases with these languages

Josh-Cena · 2021-06-03T00:24:58Z

Regarding Katakana though, I asked my Japanese-speaking friend and she said treating multiple characters as one word overcomplicates things since there can be "contractions" as well. If you do find this functionality necessary, can you provide a raw algorithm?

f0rb1d · 2021-06-04T05:19:03Z

@Josh-Cena MS Word did treat full-width punctuations as words, but they're no different from their half-width counterparts - they are the same, it's just that they're in different Unicode zones.

As for Katakana, unfortunately, I'm not a native Japanese speaker either. The algorithm you suggested might be hard to implement since there are some "contractions". But if it's just merely done by treating Katakana characters as one word, that should be easy to do.

Josh-Cena · 2021-06-16T10:21:54Z

Sorry for not replying recently... I've been busy with other stuff. I will come back to this later and pick up on the punctuation and Katakana issues.

ngryman · 2021-08-15T09:54:35Z

@Josh-Cena @f0rb1d Sorry for the super late answer and thanks for the contribution!

I admit that I have pretty limited knowledge here so I'll trust your judgment on this. As long as this feature doesn't bring any regression I think it's a great addition 👏 . Let me know when you both think it's ready to 🚀

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena · 2021-08-15T12:43:20Z

Hey @ngryman I'm done! No worries since I've been quite busy for the last few months as well, but hope it gets merged soon because as you know this lib is used by a lot of SSGs🤪

Several notes:

This algorithm works to the best of my knowledge, and I've added a bunch of test cases to ensure that's the case. Performance-wise this may be slower, but it's definitely more correct than before. If you are not happy with that I can try making CJK support opt-in, but I guess performance is not the bottleneck anyways?
I've added a ton of inline comments explaining what I'm doing and calling for potential contributors to improve its rigor. Maybe you can put up some notice on the README as well, explaining that this package may not work the best for some languages / use cases?
I've committed the package-lock.json file as well because this is best practice (to sync dependency versions across forks). If you don't like it feel free to remove it anyways, but you are advised not to.
I've upgraded some dependencies (mocha and onchange) because there are security alerts. No breaking changes.

@f0rb1d I've handled Katakana as English letters and made punctuation not counted towards final word count. Does that look good to you?

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena · 2021-08-17T09:52:59Z

@ngryman In case @f0rb1d is unavailable to review this, let's just merge it for now since I've addressed their suggestions already. Also love to hear your opinions on my suggestions / questions

ngryman · 2021-08-19T14:09:32Z

Sure we can merge this without further approvals regarding the feature itself.

However, it seems that the PR introduces some regressions: https://travis-ci.org/github/ngryman/reading-time/builds/771866964. The CI status should show up here, but for some reason, it's not working 🤔

Could you take a look at these regressions?

My answers to your suggestions/comments are below 👇

This algorithm works to the best of my knowledge, and I've added a bunch of test cases to ensure that's the case. Performance-wise this may be slower, but it's definitely more correct than before. If you are not happy with that, I can try making CJK support opt-in, but I guess performance is not the bottleneck anyways?

Right, I wouldn't worry about performance for now. If it happens to be an issue, we can address this later.

I've added a ton of inline comments explaining what I'm doing and calling for potential contributors to improve its rigor. Maybe you can put up some notice on the README as well, explaining that this package may not work the best for some languages / use cases?

Sure, could you add this notice to the README?

I've committed the package-lock.json file as well because this is best practice (to sync dependency versions across forks). If you don't like it, feel free to remove it anyway, but you are advised not to.

When I created this package, there were no such things as lock files 🙀 . I guess it makes me a grandpa 👴
Good call. Thanks for adding it 👍

I've upgraded some dependencies (mocha and onchange) because there are security alerts. No breaking changes.

Thanks again! 👌

Josh-Cena · 2021-08-19T14:17:23Z

Ah, wasn't expecting stream to fail, so I didn't run that test. GitHub now also has its own CI: https://github.com/features/actions which you might be interested in looking into. I can help you get started in another PR.

Update. Seems travis-ci.org stops working anyways. I'll add the github workflow here

Also, if you don't object some new syntaxes 🤪 We can probably migrate to ES6 classes and get rid of all the ReadingTimeStream.prototype stuff

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena · 2021-08-20T01:50:42Z

@ngryman Hey, I think I've found the reason. The readingTime now always treats the text as string and converts Buffer to string as well. This is required because Buffer would read the data as byte stream, but UTF-8 encoded data easily exceeds one byte width. In this case I dropped the use of hex codes as buffer word bounds, and always prefer using string characters.

Also there're some unknown encoding errors, which are just because of a malformed forEach call. (the callback of forEach has two parameters, and the second, which is the element's index, is implicitly passed to analyzer.bind, causing a weird encoding: 1 that's not recognized)

I've added more unit tests for stream to ensure that CJK works here as well.

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena · 2021-08-20T13:21:14Z

Hey @ngryman, I've shipped two more improvements.

I used JSDoc to better annotate the types, since I guess using TypeScript would be too cumbersome for this small lib.
I decided to export readingTimeStream through the main interface const {readingTimeStream} = require('reading-time'), since the import method in the doc actually doesn't work (it should be require('reading-time/lib/stream') instead of require('reading-time/stream'). Still, I've added a TypeScript declaration for both importing method, so the latter would still work, despite the former probably being better stylistically.

I tried to not introduce any breaking changes. If you are open to breaking changes (by releasing another major version 2.0.0), here are what I would suggest you consider:

Convert readingTimeStream to an ES6 class. This would mean you lose the ability to instantiate without the new keywork, however. I will think it over to see what we can do better, but the current ReadingTimeStream.prototype and util.inherits(ReadingTimeStream, Transform) just don't look ideal to me
Join the community in an effort to migrate all packages to ESM (ES Modules). This would mean converting all require() and module.exports usage to import and export. Optionally, we can release a reading-time.js.cjs file (as what is many libs doing) for those still needing CJS imports.

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

ngryman

Thanks for the great contribution here 🙇🏻‍♂️

There are now multiple orthogonal changes involved so the PR is getting a bit more complex. These are all great changes, it just makes the review a bit more difficult for me. I'm currently moving to a new city, lots of things to handle right now, so I apologize for the review time.

That being said, I managed to pull off sometime today 🙌.
I'm generally 👍 for everything that you proposed here. I just left a couple of minor suggestions below.

One thing that I would like to do, however, is to move the CJK implementation into a dedicated function (ie. readingTimeCJK). I think that would make sense to have it in a dedicated file as well.

Two main reasons for that:

It seems pretty unlikely that real-world text bodies would be composed of a balanced mix of Latin and CJK characters. I would consider this as an edge case for now. So I don't think we have to support both at the same time. I'd be fine with either counting Latin or CJK words separately.
It will make the code review and maintenance easier. If we need to improve/patch the CJK implementation later, there's no risk of regressions for the Latin implementation and vice versa.

Could you move the current CJK implementation to a dedicated function/file and expose this via the index.js?
Thanks again for your contribution and your patience on this 🙏

README.md

.eslintrc

index.d.ts

index.js

Josh-Cena · 2021-08-24T23:55:40Z

@ngryman Thanks for the review! I've addressed your suggestions.

However, as both an end user and a contributor to Docusaurus, an SSG that uses this library, I have to contest the idea of moving CJK to a separate file and exposing a different function (maybe I didn't fully understand your intention here).

As this library is primarily used by SSGs like Vuepress and Docusaurus, there isn't an API exposed to the actual end user (those site creators) to select "I am writing Chinese" or "I am writing English" (at least I don't see such an API being necessary). Note that even when sites have i18n, we cannot use the site's locale to infer a specific article's language, because even a Chinese site may have one or two English articles.
If we exposed a unified API readingTime and made it call two sets of logic internally, it becomes our burden to guess whether the text is CJK or Latin, which isn't easy either—should we poll the first ten characters? The first one hundred? What if the text starts with some English quote but is an entirely Chinese article? What would be a fair representation of the article's language?
Even if the CJK logic is separated, the readingTimeCJK function itself won't be largely different from what the current readingTime function would look like, because most of the work done is stripping punctuation away, which can't be avoided with or without Latin letters.

In the end my personal preference, as a bilingual, is to stay on the safer side and handle the text as correctly as possible. I did offer to make CJK opt-in once (i.e., we provide one naïve implementation that only applies to Latin languages and one sophisticated implementation that works on both, and provide an option in the exposed readingTime API called handleCJK), but in the end that doesn't simplify the code base either (it just adds a new naiveReadingTime function).

I understand that CJK seems to be dominating the logic now, but that's just because it's so sophisticated 🤪

P.S. Regressions can largely be prevented if we have good unit tests. Plus I will always be available for help if we need more changes, so can we keep it this way for now? We can discuss more about the pros & cons

ngryman · 2021-08-26T15:46:02Z

@Josh-Cena Great, ok your points make sense to me 👍 I didn't see things that way.

Alright, let's merge this. I'll double-check a couple of things by the end of next week when things settle down a bit on my side and will release a new version then. I think your 2 suggestions for a 2.0.0 are great. This package definitely needs a new youth :)

Josh-Cena · 2021-09-10T07:53:12Z

Friendly ping for @ngryman 👀 Looking forward to the next release a lot

Also #40 is there

ngryman · 2021-09-10T10:00:02Z

@Josh-Cena Released!

feat: add CJK support

29dd15b

f0rb1d mentioned this pull request Jun 2, 2021

feature: add preliminary support for CJK languages #34

Closed

f0rb1d reviewed Jun 2, 2021

View reviewed changes

test/reading-time.js Outdated Show resolved Hide resolved

f0rb1d reviewed Jun 2, 2021

View reviewed changes

test/reading-time.js Outdated Show resolved Hide resolved

Finalize CJK support

5963e6d

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena added 3 commits August 15, 2021 21:16

Bug fix

2f6ac34

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Fix typo

a55a673

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Better Japanese

61c9e22

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Update README

4f15fc2

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena added 5 commits August 20, 2021 09:52

Try fixing stream

4ddf6f0

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Try adding workflow

a5bf0fc

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Final fix

acf5623

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

More test cases

76f79b0

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Update workflow

1a9adf6

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena added 3 commits August 20, 2021 21:31

Significantly improve code style

7491dee

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Merge branch 'master' into master

732e62f

Remove unnecessary check

42a056b

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Josh-Cena added 3 commits August 20, 2021 21:34

Merge branch 'master' of github.com:Josh-Cena/reading-time

cfd5caf

Explicit toString

6bd3600

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

Fix export

10691c8

Signed-off-by: Josh-Cena <sidachen2003@gmail.com>

ngryman reviewed Aug 24, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

.eslintrc Show resolved Hide resolved

index.d.ts Outdated Show resolved Hide resolved

index.js Outdated Show resolved Hide resolved

Address suggestions

7fcc10f

ngryman merged commit c8d3f43 into ngryman:master Aug 26, 2021

ngryman mentioned this pull request Aug 26, 2021

require('reading-time/stream') is not found #28

Closed

Josh-Cena mentioned this pull request Aug 27, 2021

fix: fix a few TS errors #37

Merged

Josh-Cena mentioned this pull request Oct 22, 2021

Help us complete the word counting algorithm! #46

Open

cohu-dev mentioned this pull request Apr 25, 2022

Suggestion: Japanese reading-time #52

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CJK support #35

feat: add CJK support #35

Josh-Cena commented May 21, 2021 •

edited

Loading

Josh-Cena commented May 29, 2021

f0rb1d commented Jun 2, 2021

Josh-Cena commented Jun 2, 2021

Josh-Cena commented Jun 3, 2021

f0rb1d commented Jun 4, 2021

Josh-Cena commented Jun 16, 2021

ngryman commented Aug 15, 2021

Josh-Cena commented Aug 15, 2021

Josh-Cena commented Aug 17, 2021 •

edited

Loading

ngryman commented Aug 19, 2021

Josh-Cena commented Aug 19, 2021 •

edited

Loading

Josh-Cena commented Aug 20, 2021 •

edited

Loading

Josh-Cena commented Aug 20, 2021 •

edited

Loading

ngryman left a comment •

edited

Loading

Josh-Cena commented Aug 24, 2021 •

edited

Loading

ngryman commented Aug 26, 2021

Josh-Cena commented Sep 10, 2021

ngryman commented Sep 10, 2021

feat: add CJK support #35

feat: add CJK support #35

Conversation

Josh-Cena commented May 21, 2021 • edited Loading

Background

Contribution

Known issues

Josh-Cena commented May 29, 2021

f0rb1d commented Jun 2, 2021

Treating full-width punctuations as words

Chinese ready, but not Japanese

Josh-Cena commented Jun 2, 2021

Josh-Cena commented Jun 3, 2021

f0rb1d commented Jun 4, 2021

Josh-Cena commented Jun 16, 2021

ngryman commented Aug 15, 2021

Josh-Cena commented Aug 15, 2021

Josh-Cena commented Aug 17, 2021 • edited Loading

ngryman commented Aug 19, 2021

Josh-Cena commented Aug 19, 2021 • edited Loading

Josh-Cena commented Aug 20, 2021 • edited Loading

Josh-Cena commented Aug 20, 2021 • edited Loading

ngryman left a comment • edited Loading

Choose a reason for hiding this comment

Josh-Cena commented Aug 24, 2021 • edited Loading

ngryman commented Aug 26, 2021

Josh-Cena commented Sep 10, 2021

ngryman commented Sep 10, 2021

Josh-Cena commented May 21, 2021 •

edited

Loading

Josh-Cena commented Aug 17, 2021 •

edited

Loading

Josh-Cena commented Aug 19, 2021 •

edited

Loading

Josh-Cena commented Aug 20, 2021 •

edited

Loading

Josh-Cena commented Aug 20, 2021 •

edited

Loading

ngryman left a comment •

edited

Loading

Josh-Cena commented Aug 24, 2021 •

edited

Loading