Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore eszett ß - which is not accented. #44

Merged
merged 3 commits into from
Jul 24, 2023

Conversation

michaelkirk
Copy link
Contributor

@michaelkirk michaelkirk commented Jul 11, 2023

As requested at #12 (comment), ß should not be replaced.

Originally discussed in #12, this behavior regressed at #31

Since someone requested #31 in the first place, it's likely that this change will surprise at least one person (/cc @gollenia), and probably others.

Is there a recommended alternative for the behavior that people like @gollenia want? Maybe something like https://github.com/anyascii/anyascii ? (disclaimer: I've never used it)

@michaelkirk michaelkirk mentioned this pull request Jul 18, 2023
@missinglink
Copy link

missinglink commented Jul 18, 2023

I'm not sure how universally true this is, but... *if performing unicode decomposition of a single character yields two code points, and one of those belongs to a combining diacritical block, this is a clear candidate for diacritical removal.

In fact this method can be quite effective.

> 'Š'.normalize('NFKD').split('')
< ["S", "̌"] (2)

> 'Š'.normalize('NFKD').split('').map(c => c.charCodeAt(0).toString(16))
< ["53", "30c"] (2)

On the other hand, if a character cannot be decomposed into two code points then I'd argue it's not 'accented', although again, I'm not sure how universally true this is.

> 'ß'.normalize('NFKD').split('')
< ["ß"] (1)

So yeah, 👍 Eszett is not a 'accented' IMO

Copy link
Owner

@tyxla tyxla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR 👍

Seems like tests are currently failing - see my comment below.


// See https://github.com/tyxla/remove-accents/issues/12
tape('ß is not accented', function(t) {
t.same(removeAccents.remove('Straße'), 'Straße');
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tape needs t.end() to end the suite - see prior existing tests.

I'd very much like to refactor from tape to jest or another modern runner, but that is work for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh geeze sorry. Apparently I didn't run the tests. I've fixed this (and run the tests this time).

I forgot to call `t.end()` 🤦

The now outdated "ß" -> "ss" was added to the "remove accents from string" test case as part of 1fe0b90

It seems like maybe the misunderstanding was that the string contained
every to-be-sanitized character, but that's not true. Since ß now has
it's own unit test, I've removed it from the "remove accents from
string" test.
@tyxla
Copy link
Owner

tyxla commented Jul 24, 2023

Thanks again, @michaelkirk 🙌

Is there a recommended alternative for the behavior that people like @gollenia want? Maybe something like https://github.com/anyascii/anyascii ? (disclaimer: I've never used it)

I guess folks can manually .replace() the Eszett if they need to in their string. It's a special case, so that would be justified IMHO.

@tyxla tyxla merged commit 365d297 into tyxla:master Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants