Restore eszett ß - which is not accented. #44

michaelkirk · 2023-07-11T18:54:58Z

As requested at #12 (comment), ß should not be replaced.

Originally discussed in #12, this behavior regressed at #31

Since someone requested #31 in the first place, it's likely that this change will surprise at least one person (/cc @gollenia), and probably others.

Is there a recommended alternative for the behavior that people like @gollenia want? Maybe something like https://github.com/anyascii/anyascii ? (disclaimer: I've never used it)

This reverts commit 4b147a0.

missinglink · 2023-07-18T18:30:40Z

I'm not sure how universally true this is, but... *if performing unicode decomposition of a single character yields two code points, and one of those belongs to a combining diacritical block, this is a clear candidate for diacritical removal.

In fact this method can be quite effective.

> 'Š'.normalize('NFKD').split('')
< ["S", "̌"] (2)

> 'Š'.normalize('NFKD').split('').map(c => c.charCodeAt(0).toString(16))
< ["53", "30c"] (2)

On the other hand, if a character cannot be decomposed into two code points then I'd argue it's not 'accented', although again, I'm not sure how universally true this is.

> 'ß'.normalize('NFKD').split('')
< ["ß"] (1)

So yeah, 👍 Eszett is not a 'accented' IMO

tyxla

Thanks for the PR 👍

Seems like tests are currently failing - see my comment below.

tyxla · 2023-07-20T12:01:04Z

test.js

+
+// See https://github.com/tyxla/remove-accents/issues/12
+tape('ß is not accented', function(t) {
+  t.same(removeAccents.remove('Straße'), 'Straße');


Tape needs t.end() to end the suite - see prior existing tests.

I'd very much like to refactor from tape to jest or another modern runner, but that is work for another PR.

Oh geeze sorry. Apparently I didn't run the tests. I've fixed this (and run the tests this time).

I forgot to call `t.end()` 🤦 The now outdated "ß" -> "ss" was added to the "remove accents from string" test case as part of 1fe0b90 It seems like maybe the misunderstanding was that the string contained every to-be-sanitized character, but that's not true. Since ß now has it's own unit test, I've removed it from the "remove accents from string" test.

tyxla · 2023-07-24T12:20:38Z

Thanks again, @michaelkirk 🙌

Is there a recommended alternative for the behavior that people like @gollenia want? Maybe something like https://github.com/anyascii/anyascii ? (disclaimer: I've never used it)

I guess folks can manually .replace() the Eszett if they need to in their string. It's a special case, so that would be justified IMHO.

michaelkirk added 2 commits July 11, 2023 11:44

Revert "Added german "sharp s" ß (tyxla#31)"

02c18a9

This reverts commit 4b147a0.

add regression test for ß

8c8597f

michaelkirk mentioned this pull request Jul 18, 2023

Eszett #12

Closed

missinglink mentioned this pull request Jul 18, 2023

experiment using unicode decomposition & regex char ranges #45

Draft

tyxla requested changes Jul 20, 2023

View reviewed changes

tyxla approved these changes Jul 24, 2023

View reviewed changes

tyxla merged commit 365d297 into tyxla:master Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore eszett ß - which is not accented. #44

Restore eszett ß - which is not accented. #44

michaelkirk commented Jul 11, 2023 •

edited

Loading

missinglink commented Jul 18, 2023 •

edited

Loading

tyxla left a comment

tyxla Jul 20, 2023

michaelkirk Jul 20, 2023

tyxla commented Jul 24, 2023

Restore eszett ß - which is not accented. #44

Restore eszett ß - which is not accented. #44

Conversation

michaelkirk commented Jul 11, 2023 • edited Loading

missinglink commented Jul 18, 2023 • edited Loading

tyxla left a comment

Choose a reason for hiding this comment

tyxla Jul 20, 2023

Choose a reason for hiding this comment

michaelkirk Jul 20, 2023

Choose a reason for hiding this comment

tyxla commented Jul 24, 2023

michaelkirk commented Jul 11, 2023 •

edited

Loading

missinglink commented Jul 18, 2023 •

edited

Loading