Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention does not work with some languages (Hebrew) #4642

Closed
oleq opened this issue Apr 4, 2019 · 3 comments · Fixed by ckeditor/ckeditor5-mention#71
Closed

Mention does not work with some languages (Hebrew) #4642

oleq opened this issue Apr 4, 2019 · 3 comments · Fixed by ckeditor/ckeditor5-mention#71
Assignees
Labels
package:mention type:bug This issue reports a buggy (incorrect) behavior.
Milestone

Comments

@oleq
Copy link
Member

oleq commented Apr 4, 2019

  1. Set a feed with some Hebrew text:
feed: [ 'שנב', 'גקכ', 'Barney', 'Lily', 'Marshall', 'Robin', 'Ted' ]
  1. Switch the keyboard to Hebrew,
  2. Try to autocomplete them (they correspond to "abc" and "def" in an English keyboard).

Kapture 2019-04-04 at 15 40 35


Not sure if this is RTL–related or IME–related but this should work anyway.

@oleq
Copy link
Member Author

oleq commented Apr 4, 2019

The culprit is this regexp https://github.com/ckeditor/ckeditor5-mention/blob/master/src/mentionui.js#L504 which only considers western languages.

We need to come up with something smarter and more inclusive.

@jodator
Copy link
Contributor

jodator commented Apr 16, 2019

So after quick research the RegExp in JavaScript does not support unicode grapheme matching:

const regExp = xRegExp( createPattern( marker, 0 ) );

const re = new RegExp( '(^| )(\\@)([\\pL0-9]*?)$' );
// console.log( re.toString(), ':' );
console.log( 'RegExp: ', ' @foo', re.test( ' @foo' ) );
console.log( 'RegExp: ', ' @שנב', re.test( ' @שנב' ) );

const xre = xRegExp( '(^| )(\\@)([\\pL0-9]*?)$' );
// console.log( xre.toString(), ':' );
console.log( 'xRegExp: ', ' @foo', xre.test( ' @foo' ) );
console.log( 'xRegExp: ', ' @שנב', xre.test( ' @שנב' ) );

logs:

(^| )(\@)([\pL0-9]*?)$
RegExp:   @foo false
RegExp:   @שנב false
xRegExp:   @foo true
xRegExp:   @שנב true

The POC in action:
Peek 2019-04-16 15-00


So the above solution has obvious plus: it allows to use simpler notations for regular expressions as in other languages.

The drawbacks are as usual:

  • another dependency
  • bigger size of a build
  • I'm lost with licensing (but it has this "friendly" MIT license)

Technical details of XRegExp library: AFAICS it augments the RegExp object so if you do xre.toString() you'd get:

/(^| )(\@)([A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮٯٱ-ۓەۥۦۮۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࢽऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএঐও-নপ-রলশ-হঽৎড়ঢ়য়-ৡৰৱৼਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલળવ-હઽૐૠૡૹଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହଽଡ଼ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠೡೱೲഅ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะาำเ-ๆກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-ະາຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢄᢇ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳱᳵᳶᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々〆〱-〵〻〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆺㇰ-ㇿ㐀-䶵一-鿯ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪꘫꙀ-ꙮꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞹꟷ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭥꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּסּףּפּצּ-ﮱﯓ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷻﹰ-ﹴﹶ-ﻼA-Za-zヲ-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ0-9]*?)$/

which is pretty big.

As an alternative I see that we could create similar regex but with unicode ranges:

const re2 = new RegExp( '(^| )(\\@)([\\u05D0-\\uFB4Fa-zA-Z0-9]*?)$' );
console.log( 'RegExp2: ', ' @foo', re2.test( ' @foo' ) );
console.log( 'RegExp2: ', ' @שנב', re2.test( ' @שנב' ) );

which also works but would require to deeper dig into each supported language specifics (punctuations, etc):

RegExp2:   @foo true
RegExp2:   @שנב true

@jodator
Copy link
Contributor

jodator commented May 7, 2019

Note: We should use the ES2018 RegExp as @mlewand pointed out here.

oleq referenced this issue in ckeditor/ckeditor5-mention Jun 28, 2019
Fix: Mentions should work when different UTF character classes are used in the feed configuration. Closes #38.
@mlewand mlewand transferred this issue from ckeditor/ckeditor5-mention Oct 9, 2019
@mlewand mlewand added this to the iteration 25 milestone Oct 9, 2019
@mlewand mlewand added status:confirmed type:bug This issue reports a buggy (incorrect) behavior. package:mention labels Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package:mention type:bug This issue reports a buggy (incorrect) behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants