-
-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Non-Ascii test cases. #428
Comments
Aye! I believe that the same logic can be applied to a broader class of questions:
Anytime we testing for something that is nothing fundamental, we reduce diversity in the solutions and it gets boring to review code. 😁 |
Agreed. I think having the non-ASCII test in Isogram does not really add anything. In fact, I think most people will be confused by it. If someone could come up with another exercise in which non-ASCII character handling does make sense, I'm all for it. But for the current exercises, I think it should be removed. |
I do like to have some of non US-ASCII test cases as an optional part for a very small subset of exercises. They can teach you a lot of unicode-handling if and only if you are open to it and want to learn it. Anagram is not one of them, since it requires you to normalize charakters and some of them can't be normalized. In the example of the german Word-Count on the other hand does gain a lot by adding the Unicode sugar to separate words from each other. But, well, the normalisation problem persists, but its influence is by far not that strong. |
I would agree that non-ASCII test cases don't make sense in every exercise, especially if they add unnecessary complexity for example if normalization is needed. |
I agree, this is why there need to be specific exercises that deal with multi-language text handling, and all the gotchas and edge cases involved in that. But that shouldn't be required in an exercise that is about the algorithm for detecting anagrams. |
Looks like we're going sort of Unix philosophy with many exercises - do one thing and do it well. This is why we are cutting Unicode from a few text exercises. I welcome the move! For the time when that new Unicode exercise(s) solidifes, I have a list of things I've seen over the months that could stand to go in it: |
We can try Edit: Actually, it may not have false positives either... For me, this has found: exercises/atbash-cipher/canonical-data.json |
In exercism#428 the decision was made to remove non-ascii test cases from exercises that are not explicitly about extended character set handling. This PR removes the non-ascii test cases from the `canonical_data.json` for this exercise.
I have been working in power generation for the last few years, and the frequency that unicode characters (non ASCII) breaks existing code increases every year. Data collection spreads across the globe and into countries that don't use ASCII contained alphabets. Many common shortcuts to manage punctuation are bad practice outside of case sensitive alphabets. I understand KISS, but I can't think of a better place then here in these learning exercises to help people move to thinking in Unicode, and away from using ASCII crutches. Another example, of these crutches is that not all languages have the concept of upper and lower case (bicameral vs. unicameral alphabets, think Persian, Arabic, Hebrew). I think the test cases should challenge one to handle case, but also handle a situation where the alphabet does not have a case. Anagrams and isograms exist in these languages as well. |
@ldwoolley what you say here is obviously true. Thats exactly the point why we said, that we need extra exercises that teach unicode. BUT! We need to slow that down a bit. In most languages everything not-US-ASCII is a PITA. Most languages require you to use even external libraries. So removing non-US-ASCII achieves multiple goals:
|
But would it hurt to have the non-ASCII test cases as part of the test suits only that they are deactivated/skipped with a comment like "if you are not new to programming and/or care about unicode it might be interesting to thing about ..."? |
Conclusion: We should add exercises that explicitly deal with multi-language characters: See #455 |
This removes the unicode test cases ([x-common/428](exercism/problem-specifications#428), [x-common/434](exercism/problem-specifications#434)) and adds the new white space and lowercase tests ([x-common/624](exercism/problem-specifications#624)).
This removes the unicode test cases (exercism/problem-specifications#428, exercism/problem-specifications#434) and adds the new white space and lowercase tests (exercism/problem-specifications#624).
This removes the unicode test cases (exercism/problem-specifications#428, exercism/problem-specifications#434) and adds the new white space and lowercase tests (exercism/problem-specifications#624).
This removes the unicode test cases (exercism/problem-specifications#428, exercism/problem-specifications#434) and adds the new white space and lowercase tests (exercism/problem-specifications#624).
While updating the
anagram
test cases (issue: #413)Discussion of handling non-ascii characters came up, and we decided that we would NOT use non-ascii characters in those tests.
@NobbZ made a good point:
Isogram (as of 20161031) also has non-ascii test cases.
Are there other problems that have non-ascii test cases?
I've created this issue so we can discuss the general policy of whether non-ascii characters should be used in test cases and have a a thread to point to when it comes up again in the future.
Proposal:
All test cases should only use ASCII characters
(Unless extended character handling is integral to the problem.)
The text was updated successfully, but these errors were encountered: