Discussion: Non-Ascii test cases. #428

Insti · 2016-10-31T09:58:15Z

While updating the anagram test cases (issue: #413)
Discussion of handling non-ascii characters came up, and we decided that we would NOT use non-ascii characters in those tests.

@NobbZ made a good point:

I do not think, that we should test for anything that is not in ASCII.

Most languages do not cope very well with codepoints beyond ASCII

There are letters beyond ASCII, that would need another step of normalization. Eg the german "es-zett" (ß) which is is only allowed as lowercase (but there is a capital version available which is allowed to be used on titlepages and headlines), on capitalisation it is usually turned into "SS". There has been the long-s, and similar characters in ancient greek (gamma was one of them AFAIK).

Isogram (as of 20161031) also has non-ascii test cases.

Are there other problems that have non-ascii test cases?

I've created this issue so we can discuss the general policy of whether non-ascii characters should be used in test cases and have a a thread to point to when it comes up again in the future.

Proposal:
All test cases should only use ASCII characters
(Unless extended character handling is integral to the problem.)

The text was updated successfully, but these errors were encountered:

rbasso · 2016-10-31T12:01:13Z

Aye!

I believe that the same logic can be applied to a broader class of questions:

Q: Should we test for ...?
A: No! Unless that is a fundamental part of the problem.

Anytime we testing for something that is nothing fundamental, we reduce diversity in the solutions and it gets boring to review code. 😁

ErikSchierboom · 2016-11-01T08:48:24Z

Agreed. I think having the non-ASCII test in Isogram does not really add anything. In fact, I think most people will be confused by it. If someone could come up with another exercise in which non-ASCII character handling does make sense, I'm all for it. But for the current exercises, I think it should be removed.

NobbZ · 2016-11-01T09:45:48Z

I do like to have some of non US-ASCII test cases as an optional part for a very small subset of exercises. They can teach you a lot of unicode-handling if and only if you are open to it and want to learn it.

Anagram is not one of them, since it requires you to normalize charakters and some of them can't be normalized. In the example of the german ß vs. SS, is MASSE an anagram of Maße? How is it the other way round?

Word-Count on the other hand does gain a lot by adding the Unicode sugar to separate words from each other. But, well, the normalisation problem persists, but its influence is by far not that strong.

behrtam · 2016-11-03T10:21:37Z

I would agree that non-ASCII test cases don't make sense in every exercise, especially if they add unnecessary complexity for example if normalization is needed.
But we should add unicode character where ever it makes sense. It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.

Insti · 2016-11-03T12:34:43Z

It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.

I agree, this is why there need to be specific exercises that deal with multi-language text handling, and all the gotchas and edge cases involved in that. But that shouldn't be required in an exercise that is about the algorithm for detecting anagrams.

petertseng · 2016-11-04T07:58:53Z

Looks like we're going sort of Unix philosophy with many exercises - do one thing and do it well. This is why we are cutting Unicode from a few text exercises. I welcome the move!

For the time when that new Unicode exercise(s) solidifes, I have a list of things I've seen over the months that could stand to go in it:

Consider creating new Unicode exercise go#200 (because of ledger: handling unicode go#195)
Potential new test in anagram: Same bytes different characters. #318
Add test case to Hamming that shows the difference between len() and chars().count() rust#128

petertseng · 2016-11-04T08:01:41Z

Are there other problems that have non-ascii test cases?

We can try git grep -Pnl "[\x80-\xFF]" to find them, ~~though I admit this can have false positives~~. But at least it shouldn't have false negatives... right?

Edit: Actually, it may not have false positives either...

For me, this has found:

exercises/atbash-cipher/canonical-data.json
exercises/bob/canonical-data.json
exercises/forth/canonical-data.json
exercises/isogram/canonical-data.json
exercises/pangram/canonical-data.json
exercises/run-length-encoding/canonical-data.json
exercises/scrabble-score/canonical-data.json

In exercism#428 the decision was made to remove non-ascii test cases from exercises that are not explicitly about extended character set handling. This PR removes the non-ascii test cases from the `canonical_data.json` for this exercise.

ldwoolley · 2016-11-07T07:13:42Z

I have been working in power generation for the last few years, and the frequency that unicode characters (non ASCII) breaks existing code increases every year. Data collection spreads across the globe and into countries that don't use ASCII contained alphabets. Many common shortcuts to manage punctuation are bad practice outside of case sensitive alphabets. I understand KISS, but I can't think of a better place then here in these learning exercises to help people move to thinking in Unicode, and away from using ASCII crutches. Another example, of these crutches is that not all languages have the concept of upper and lower case (bicameral vs. unicameral alphabets, think Persian, Arabic, Hebrew). I think the test cases should challenge one to handle case, but also handle a situation where the alphabet does not have a case. Anagrams and isograms exist in these languages as well.

NobbZ · 2016-11-07T08:58:43Z

@ldwoolley what you say here is obviously true. Thats exactly the point why we said, that we need extra exercises that teach unicode.

BUT! We need to slow that down a bit. In most languages everything not-US-ASCII is a PITA. Most languages require you to use even external libraries.

So removing non-US-ASCII achieves multiple goals:

Reduce complexity of exercises
Make exercises more focused
Remove dependencies from exercises
Make it easier to have uniform test-data across different tracks.

behrtam · 2016-11-07T09:44:35Z

But would it hurt to have the non-ASCII test cases as part of the test suits only that they are deactivated/skipped with a comment like "if you are not new to programming and/or care about unicode it might be interesting to thing about ..."?

Insti · 2016-11-28T21:43:05Z

Conclusion:
All test cases should only use ASCII characters
(Unless extended character handling is integral to the problem.)

We should add exercises that explicitly deal with multi-language characters: See #455

robkeim · 2017-01-31T23:50:59Z

@Insti this discussion should now be closed right?

The canonical-data.json files have also now been updated via #441.

exercism/problem-specifications#529 exercism/problem-specifications#441 exercism/problem-specifications#428

This removes the unicode test cases ([x-common/428](exercism/problem-specifications#428), [x-common/434](exercism/problem-specifications#434)) and adds the new white space and lowercase tests ([x-common/624](exercism/problem-specifications#624)).

This removes the unicode test cases (exercism/problem-specifications#428, exercism/problem-specifications#434) and adds the new white space and lowercase tests (exercism/problem-specifications#624).

Insti added question discussion labels Oct 31, 2016

petertseng mentioned this issue Nov 4, 2016

Consider creating new Unicode exercise exercism/go#200

Closed

This was referenced Nov 4, 2016

isogram: Revise canonical test data #433

Merged

run-length-encoding: Remove non-ASCII test cases #434

Merged

This was referenced Nov 5, 2016

Label for policy decisions "hidden" in issues? exercism/discussions#96

Closed

pangram: missing case-sensitivity edge case #266

Closed

This was referenced Nov 6, 2016

pangram: Remove non-ascii test cases. #440

Merged

Update or remove non-ASCII test cases. #441

Closed

Insti added the policy label Nov 28, 2016

Insti mentioned this issue Nov 28, 2016

Suggest specific exercises that deal with multi-language text handling. #455

Open

petertseng mentioned this issue Nov 29, 2016

bob: remove Unicode characters exercism/haskell#446

Merged

petertseng mentioned this issue Jan 14, 2017

No multibyte rune in hamming problem exercism/go#441

Closed

rbasso pushed a commit to exercism/haskell that referenced this issue Feb 1, 2017

forth: remove non-ASCII cases

e6c0172

exercism/problem-specifications#529 exercism/problem-specifications#441 exercism/problem-specifications#428

rbasso pushed a commit to exercism/haskell that referenced this issue Feb 1, 2017

atbash-cipher: remove non-ASCII cases

2ae21ea

exercism/problem-specifications#529 exercism/problem-specifications#441 exercism/problem-specifications#428

rbasso pushed a commit to exercism/haskell that referenced this issue Feb 1, 2017

scrabble-score: remove non-ASCII cases

785a3e5

exercism/problem-specifications#529 exercism/problem-specifications#441 exercism/problem-specifications#428

petertseng mentioned this issue Feb 4, 2017

added rotational-cipher exercise #534

Merged

NobbZ mentioned this issue Feb 20, 2017

[isogram] remove unicode testcase exercism/elixir#297

Closed

petertseng mentioned this issue Mar 4, 2017

Incomplete test coverage for scrabble-score exercism/rust#264

Closed

behrtam mentioned this issue Mar 8, 2017

run-length-encoding: Update test cases exercism/python#425

Merged

Insti closed this as completed Apr 18, 2017

robphoenix mentioned this issue Apr 27, 2017

bob: ensure generator is up-to-date exercism/go#624

Closed

rbasso mentioned this issue May 9, 2017

luhn 1.0.0.2: only test isValid, use (most) x-common cases exercism/haskell#533

Merged

petertseng mentioned this issue May 17, 2017

Add canonical-data for poker #793

Merged

Insti mentioned this issue Sep 8, 2017

pangram: rework tests (discussion) #893

Closed

Insti mentioned this issue Sep 16, 2017

Discussion: Testing 'invalid' input. #902

Open

petertseng mentioned this issue Sep 17, 2017

nucleotide-count: refactoring tests (discussion) #895

Closed

coriolinus mentioned this issue Dec 22, 2017

Pangram: Add tests for accented characters and non-latin scripts #1048

Closed

NobbZ mentioned this issue Dec 22, 2017

Pangram: Add tests for accented characters and non-latin scripts exercism/csharp#502

Closed

coriolinus mentioned this issue Feb 8, 2018

reverse-string: multi-byte character strings #1175

Closed

petertseng mentioned this issue Mar 16, 2018

crypto-square exercise tests contradict README.md exercism/rust#458

Closed

petertseng mentioned this issue Sep 4, 2018

Pangram: Diacritics and ligatures #1318

Closed

rpottsoh mentioned this issue Dec 2, 2018

Add Unicode test to bob #1413

Closed

petertseng mentioned this issue Jun 5, 2019

Pangram: Add non-basic characters in the test suite exercism/haskell#824

Closed

W8CYE mentioned this issue Sep 22, 2022

Anagram: 'non-ascii' character doesn't match with problem description exercism/go#2494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Non-Ascii test cases. #428

Discussion: Non-Ascii test cases. #428

Insti commented Oct 31, 2016 •

edited

Loading

rbasso commented Oct 31, 2016

ErikSchierboom commented Nov 1, 2016

NobbZ commented Nov 1, 2016

behrtam commented Nov 3, 2016

Insti commented Nov 3, 2016

petertseng commented Nov 4, 2016 •

edited

Loading

petertseng commented Nov 4, 2016 •

edited

Loading

ldwoolley commented Nov 7, 2016

NobbZ commented Nov 7, 2016

behrtam commented Nov 7, 2016

Insti commented Nov 28, 2016 •

edited

Loading

robkeim commented Jan 31, 2017

Discussion: Non-Ascii test cases. #428

Discussion: Non-Ascii test cases. #428

Comments

Insti commented Oct 31, 2016 • edited Loading

rbasso commented Oct 31, 2016

ErikSchierboom commented Nov 1, 2016

NobbZ commented Nov 1, 2016

behrtam commented Nov 3, 2016

Insti commented Nov 3, 2016

petertseng commented Nov 4, 2016 • edited Loading

petertseng commented Nov 4, 2016 • edited Loading

ldwoolley commented Nov 7, 2016

NobbZ commented Nov 7, 2016

behrtam commented Nov 7, 2016

Insti commented Nov 28, 2016 • edited Loading

robkeim commented Jan 31, 2017

Insti commented Oct 31, 2016 •

edited

Loading

petertseng commented Nov 4, 2016 •

edited

Loading

petertseng commented Nov 4, 2016 •

edited

Loading

Insti commented Nov 28, 2016 •

edited

Loading