Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching lone surrogates *only* #28

Closed
mathiasbynens opened this issue Jan 24, 2015 · 3 comments
Closed

Matching lone surrogates *only* #28

mathiasbynens opened this issue Jan 24, 2015 · 3 comments

Comments

@mathiasbynens
Copy link
Owner

See mathiasbynens/regexpu#16 and https://gist.github.com/mathiasbynens/bbe7f870208abcfec860.

var set = regenerate()
  .addRange(0xD800, 0xDBFF) // lone high surrogates
  .addRange(0xDC00, 0xDFFF); // lone low surrogates

var match = '𝌆'.match(RegExp('(' + set.toString() + ')'));
console.log(match == null);
// expected: true
// actual: false, since the surrogate halves are matched

Instead, it would make more sense to match lone surrogates only in such cases.

@mathiasbynens
Copy link
Owner Author

This doesn’t fully fix the issue. Doing so is hard, since JS doesn’t support lookbehind.

var set = regenerate().addRange(0xD800, 0xDBFF).addRange(0xDC00, 0xDFFF);

var regex = RegExp('^a(?:' + set.toString() + ')b$');
// currently, this results in `/^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b$/`
console.log(regex.test('a\uD834b')); // expected: true; actual: true
console.log(regex.test('a\uDC00b')); // expected: true; actual: false

@mathiasbynens
Copy link
Owner Author

Given the above code:

var set = regenerate().addRange(0xD800, 0xDBFF).addRange(0xDC00, 0xDFFF);

var regex = RegExp('^a(?:' + set.toString() + ')b$');
// currently, this results in `/^a(?:[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])b$/`

Consider these three tests:

console.log(regex.test('a\uD834b')); // expected: true
console.log(regex.test('a\uDC00b')); // expected: true
console.log(regex.test('a\uD834\uDF06b')); // expected: false

There are two options:

a. Either we pass test 1 and 3 but fail test 2 (i.e. lone low surrogates aren’t matched accurately). (As in the current implementation in v1.2.1.)
b. We pass test 1 and 2 but fail test 3 (i.e., surrogate halves in pairs are matched as if they are lone surrogates).

Which is the lesser evil — a or b?

@mathiasbynens
Copy link
Owner Author

As Marja said:

Paired surrogates are the normal case, so I think cases involving only them should have priority.

Let’s go with a, i.e. the current implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant