Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

Add sentence validator for Cantonese #605

Merged
merged 7 commits into from
Feb 15, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions server/lib/validation/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ const ru = require('./languages/ru');
const th = require('./languages/th');
const ur = require('./languages/ur');
const uz = require('./languages/uz');
const yue = require('./languages/yue');

const VALIDATORS = {
bas,
Expand All @@ -27,6 +28,7 @@ const VALIDATORS = {
th,
ur,
uz,
yue,
};

module.exports = {
Expand Down
34 changes: 34 additions & 0 deletions server/lib/validation/languages/yue.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
// Minimum of words that qualify as a sentence.
laubonghaudoi marked this conversation as resolved.
Show resolved Hide resolved
const MIN_LENGTH = 3;

// Maximum of words allowed per sentence to keep recordings in a manageable duration.
laubonghaudoi marked this conversation as resolved.
Show resolved Hide resolved
const MAX_LENGTH = 50;

const INVALIDATIONS = [{
fn: (sentence) => {
return sentence.length < MIN_LENGTH || sentence.length > MAX_LENGTH;
},
error: `Number of characters must be between ${MIN_LENGTH} and ${MAX_LENGTH} (inclusive)`,
}, {
regex: /[0-9]+/,
error: "Sentence should not contain numbers",
}, {
regex: /[<>+*#@%^[\]()\/]/,
laubonghaudoi marked this conversation as resolved.
Show resolved Hide resolved
error: "Sentence should not contain symbols",
}, {
// 7 or more repeating characters in a row is likely a non-formal spelling or difficult to read.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, did that happen often in Sentence Collector? 7 repeating characters seems like a lot, but I have absolutely no language experience apart from latin-based languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sometimes happens when people dump uncleaned sentences directly crawled from web. For examples sentences with long tailing dots, such as額........ Such sentences are most likely junk.

regex: /(.)\1{6}/,
error: "Sentence should not contain more than 7 of the same character in a row",
}, {
// Emoji range from https://www.regextester.com/106421 and
// https://stackoverflow.com/questions/10992921/how-to-remove-emoji-code-using-javascript
regex: /(\u00a9|\u00ae|[\u2000-\u3300]|[\u2580-\u27bf]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]|[\ue000-\uf8ff])/,
error: "Sentence should not contain emojis or other special Unicode symbols",
}, {
regex: /[\u5427](\s|$)/,
error: 'Sentence should not end with Mandarin particles',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this message correct? This would also reject \u5427 followed by a space. Is that also considered ending a sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sometimes a space indicates a pause or the end of a sentence. I have amended this rule in the latest commit.

}];

module.exports = {
INVALIDATIONS,
};