Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Non-Ascii test cases. #428

Closed
Insti opened this issue Oct 31, 2016 · 12 comments
Closed

Discussion: Non-Ascii test cases. #428

Insti opened this issue Oct 31, 2016 · 12 comments

Comments

@Insti
Copy link
Contributor

Insti commented Oct 31, 2016

While updating the anagram test cases (issue: #413)
Discussion of handling non-ascii characters came up, and we decided that we would NOT use non-ascii characters in those tests.

@NobbZ made a good point:

I do not think, that we should test for anything that is not in ASCII.

  1. Most languages do not cope very well with codepoints beyond ASCII
  2. There are letters beyond ASCII, that would need another step of normalization. Eg the german "es-zett" (ß) which is is only allowed as lowercase (but there is a capital version available which is allowed to be used on titlepages and headlines), on capitalisation it is usually turned into "SS". There has been the long-s, and similar characters in ancient greek (gamma was one of them AFAIK).

Isogram (as of 20161031) also has non-ascii test cases.

Are there other problems that have non-ascii test cases?

I've created this issue so we can discuss the general policy of whether non-ascii characters should be used in test cases and have a a thread to point to when it comes up again in the future.

Proposal:
All test cases should only use ASCII characters
(Unless extended character handling is integral to the problem.)

@rbasso
Copy link
Contributor

rbasso commented Oct 31, 2016

Aye!

I believe that the same logic can be applied to a broader class of questions:

Q: Should we test for ...?
A: No! Unless that is a fundamental part of the problem.

Anytime we testing for something that is nothing fundamental, we reduce diversity in the solutions and it gets boring to review code. 😁

@ErikSchierboom
Copy link
Member

Agreed. I think having the non-ASCII test in Isogram does not really add anything. In fact, I think most people will be confused by it. If someone could come up with another exercise in which non-ASCII character handling does make sense, I'm all for it. But for the current exercises, I think it should be removed.

@NobbZ
Copy link
Member

NobbZ commented Nov 1, 2016

I do like to have some of non US-ASCII test cases as an optional part for a very small subset of exercises. They can teach you a lot of unicode-handling if and only if you are open to it and want to learn it.

Anagram is not one of them, since it requires you to normalize charakters and some of them can't be normalized. In the example of the german ß vs. SS, is MASSE an anagram of Maße? How is it the other way round?

Word-Count on the other hand does gain a lot by adding the Unicode sugar to separate words from each other. But, well, the normalisation problem persists, but its influence is by far not that strong.

@behrtam
Copy link
Contributor

behrtam commented Nov 3, 2016

I would agree that non-ASCII test cases don't make sense in every exercise, especially if they add unnecessary complexity for example if normalization is needed.
But we should add unicode character where ever it makes sense. It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.

@Insti
Copy link
Contributor Author

Insti commented Nov 3, 2016

It is 2016, emojis have conquered the world and not everyone can stay in the US bubble. People need to learn how to deal with unicode at some point in the tracks.

I agree, this is why there need to be specific exercises that deal with multi-language text handling, and all the gotchas and edge cases involved in that. But that shouldn't be required in an exercise that is about the algorithm for detecting anagrams.

@petertseng
Copy link
Member

petertseng commented Nov 4, 2016

Looks like we're going sort of Unix philosophy with many exercises - do one thing and do it well. This is why we are cutting Unicode from a few text exercises. I welcome the move!

For the time when that new Unicode exercise(s) solidifes, I have a list of things I've seen over the months that could stand to go in it:

@petertseng
Copy link
Member

petertseng commented Nov 4, 2016

Are there other problems that have non-ascii test cases?

We can try git grep -Pnl "[\x80-\xFF]" to find them, though I admit this can have false positives. But at least it shouldn't have false negatives... right?

Edit: Actually, it may not have false positives either...

For me, this has found:

exercises/atbash-cipher/canonical-data.json
exercises/bob/canonical-data.json
exercises/forth/canonical-data.json
exercises/isogram/canonical-data.json
exercises/pangram/canonical-data.json
exercises/run-length-encoding/canonical-data.json
exercises/scrabble-score/canonical-data.json

Insti added a commit to Insti/x-common that referenced this issue Nov 6, 2016
In exercism#428 the decision was made
to remove non-ascii test cases from exercises that are not explicitly
about extended character set handling.

This PR removes the non-ascii test cases from the `canonical_data.json`
for this exercise.
@ldwoolley
Copy link

I have been working in power generation for the last few years, and the frequency that unicode characters (non ASCII) breaks existing code increases every year. Data collection spreads across the globe and into countries that don't use ASCII contained alphabets. Many common shortcuts to manage punctuation are bad practice outside of case sensitive alphabets. I understand KISS, but I can't think of a better place then here in these learning exercises to help people move to thinking in Unicode, and away from using ASCII crutches. Another example, of these crutches is that not all languages have the concept of upper and lower case (bicameral vs. unicameral alphabets, think Persian, Arabic, Hebrew). I think the test cases should challenge one to handle case, but also handle a situation where the alphabet does not have a case. Anagrams and isograms exist in these languages as well.

@NobbZ
Copy link
Member

NobbZ commented Nov 7, 2016

@ldwoolley what you say here is obviously true. Thats exactly the point why we said, that we need extra exercises that teach unicode.

BUT! We need to slow that down a bit. In most languages everything not-US-ASCII is a PITA. Most languages require you to use even external libraries.

So removing non-US-ASCII achieves multiple goals:

  1. Reduce complexity of exercises
  2. Make exercises more focused
  3. Remove dependencies from exercises
  4. Make it easier to have uniform test-data across different tracks.

@behrtam
Copy link
Contributor

behrtam commented Nov 7, 2016

But would it hurt to have the non-ASCII test cases as part of the test suits only that they are deactivated/skipped with a comment like "if you are not new to programming and/or care about unicode it might be interesting to thing about ..."?

@Insti
Copy link
Contributor Author

Insti commented Nov 28, 2016

Conclusion:
All test cases should only use ASCII characters
(Unless extended character handling is integral to the problem.)

We should add exercises that explicitly deal with multi-language characters: See #455

@robkeim
Copy link
Contributor

robkeim commented Jan 31, 2017

@Insti this discussion should now be closed right?

The canonical-data.json files have also now been updated via #441.

behrtam added a commit to behrtam/python that referenced this issue Mar 8, 2017
This removes the unicode test cases ([x-common/428](exercism/problem-specifications#428), [x-common/434](exercism/problem-specifications#434)) and
adds the new white space and lowercase tests ([x-common/624](exercism/problem-specifications#624)).
behrtam added a commit to behrtam/python that referenced this issue Mar 8, 2017
behrtam added a commit to behrtam/python that referenced this issue Mar 16, 2017
behrtam added a commit to exercism/python that referenced this issue Mar 16, 2017
@Insti Insti closed this as completed Apr 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants