-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
titlecase: chars not starting a word can be converted to lowercase #23393
Conversation
d12bf25
to
7ff1b18
Compare
Can we just get rid of this function? Case manipulation is subtle and tricky and not something you want to have coupled with your language runtime version. Title case is worse since it depends not only on case changing but also on what characters are considered to separate words. This seems like a total morass that the standard library should not be getting into. |
The naive I agree that anything much more sophisticated in the way of case transformations should go into a package. |
As I argued in #19469, if you want the "strict" behavior you can always do |
The trouble with public functions that are not the Right Way™ to do it, is that people use them and then we fall into a cycle of having to tell people to use the other implementation. This exactly the problem with {read,write}dlm which we keep having to tell people not to use and use some other CSV reader. (That situation is exacerbated by too many CSV readers and other data ecosystem fragmentation, but the point remains.) If the function is internally useful, we can have an non-exported simple version. That doesn't have this issue. |
I rarely work with strings, mostly when hacking the REPL, but even there, I needed
I was not aware of that, but in the only instance I find in the /doc, it uses |
What to do here? Either merge this (my vote), deprecate, or status quo... triage? |
Since this function is now part of the stdlib |
@StefanKarpinski, note that the parsing of Julia itself is Unicode version-dependent, since it depends on Unicode categories to determine what counts as an identifier. |
Fair enough, but that doesn't really affect string processing. |
This is now an issue for the |
I'm marking this as 1.0 but note that it's "stdlib", so it does not block feature freeze or an alpha. |
base/deprecated.jl
Outdated
@@ -1708,6 +1708,9 @@ export hex2num | |||
# PR 23341 | |||
@deprecate diagm(A::SparseMatrixCSC) spdiagm(sparsevec(A)) | |||
|
|||
# PR #23393 | |||
@deprecate titlecase(s::AbstractString) titlecase(s, false, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky; we don't want to tell people to call the 3-argument version in their code.
+1, I think we should just do this (rebased appropriately of course). Only small issue is how to deal with the old |
There are 2 new arguments for compatibility (which could be turned into keyword arguments now):
If I understand correctly, you suggest 1) that we could just hard-break the |
Maybe I misunderstood --- is the intent to permanently add a third argument, or only use it for the deprecation period? I guess it would be ok to permanently add the argument, but it should be called something involving "spaces" instead of "compat". |
There doesn't seem to be a problem statement anywhere that I can find so I'm having a hard time understanding what problem is being fixed here. |
The problem is that our titlecase function only changes the first characters of words, and only considers spaces to be word separators. Other languages (following unicode recommendations) also lowercase non-initial word characters, and consider any non-letter to be a word separator. |
The intent for the
|
I like the I still don't quite understand what |
That has been answered at least twice in this thread. It's the new behavior implemented here, of lowercasing other characters. |
I would be fine with |
While re-reading the OP few days ago, I realized how bad it was, sorry for that!
Me neither, I was hoping for a suggestion of a better name ;-) |
Let's just go with |
A keyword argument `strict` is added to `titlecase` to control whether to convert those chars to lowercase. The default value is `true`, which makes this change breaking. This is how some languages (e.g. Python) implement this function, and is compatible with http://www.unicode.org/L2/L1999/99190.htm.
7ff1b18
to
f94ab0a
Compare
Rebased accordingly. Compared to the initial version, I unexported the new |
Let's just leave it for now. That would involve a whole discussion about what the best name is. |
First commit:
strict
is added totitlecase
to controlwhether to convert those chars to lowercase.
This is useful e.g. for REPL: implement Alt-{u,c,l} to change the case of the next word #23379.
the new behavior (
strict=true
) in the future.This is to be compatible with the
istitle
function, so thatistitle(titlecase(s)) == true
whens
has at least 1 letter.This is also how some languages (e.g. python) implement it, and
is compatible with http://www.unicode.org/L2/L1999/99190.htm.
Second commit: "titlecase: all non-letters are considered word-separators"
The old behavior is deprecated. This PR is coupled with #23394 but independant in terms of working code.