Skip to content

Commit

Permalink
diffWords now takes an optional intlSegmenter option
Browse files Browse the repository at this point in the history
  • Loading branch information
ryota-ka committed Oct 8, 2024
1 parent bdaf7ad commit b9b1798
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 1 deletion.
3 changes: 3 additions & 0 deletions types/diff/diff-tests.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ Diff.diffChars(one, other, {
Diff.diffChars(one, other, (value) => {
value; // $ExpectType Change[]
});
Diff.diffWords("吾輩は猫である。名前はまだ無い。", "吾輩は猫である。名前はたぬき。", {
intlSegmenter: new Intl.Segmenter("ja-JP", { granularity: "word" }),
});
// $ExpectType Change[]
Diff.diffLines(
"line\nold value\nline",
Expand Down
9 changes: 9 additions & 0 deletions types/diff/index.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,15 @@ export interface WordsOptions extends BaseOptions {
* `true` to ignore leading and trailing whitespace. This is the same as `diffWords()`.
*/
ignoreWhitespace?: boolean | undefined;

/**
* An optional [`Intl.Segmenter`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object (which must have a `granularity` of `'word'`) for `diffWords` to use to split the text into words.
*
* By default, `diffWords` does not use an `Intl.Segmenter`, just some regexes for splitting text into words. This will tend to give worse results than `Intl.Segmenter` would, but ensures the results are consistent across environments; `Intl.Segmenter` behaviour is only loosely specced and the implementations in browsers could in principle change dramatically in future. If you want to use `diffWords` with an `Intl.Segmenter` but ensure it behaves the same whatever environment you run it in, use an `Intl.Segmenter` polyfill instead of the JavaScript engine's native `Intl.Segmenter` implementation.
*
* Using an `Intl.Segmenter` should allow better word-level diffing of non-English text than the default behaviour. For instance, `Intl.Segmenter`s can generally identify via built-in dictionaries which sequences of adjacent Chinese characters form words, allowing word-level diffing of Chinese. By specifying a language when instantiating the segmenter (e.g. `new Intl.Segmenter('sv', {granularity: 'word'})`) you can also support language-specific rules, like treating Swedish's colon separated contractions (like *k:a* for *kyrka*) as single words; by default this would be seen as two words separated by a colon.
*/
intlSegmenter?: Intl.Segmenter | undefined;
}

export interface LinesOptions extends BaseOptions {
Expand Down
3 changes: 2 additions & 1 deletion types/diff/tsconfig.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
"compilerOptions": {
"module": "node16",
"lib": [
"es6"
"es6",
"es2022.intl"
],
"noImplicitAny": true,
"noImplicitThis": true,
Expand Down

0 comments on commit b9b1798

Please sign in to comment.