Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utfxx vs. UTFxx #10456

Closed
stevengj opened this issue Mar 9, 2015 · 10 comments
Closed

utfxx vs. UTFxx #10456

stevengj opened this issue Mar 9, 2015 · 10 comments
Labels
needs decision A decision on this change is needed

Comments

@stevengj
Copy link
Member

stevengj commented Mar 9, 2015

In #10452, @JeffBezanson got rid of most of the lower-case numeric conversions, as discussed in #1470. In #1470, it was similarly suggested that UTFxxString be renamed to UTFxx, and that utfxx(s) be eliminated in favor of UTFxx(s). Most people seemed to be in favor of this?

(One slight tweak is that UTF16String(x::Array{UInt16}) is currently a low-level constructor that accepts a nul-terminated array, whereas utf16(x::Array{UInt16}) adds the nul terminator for you. However, this could be changed to UTF16(x::Array{UInt16}, addnul=true), so that UTF16(x) is the high-level version that adds the nul, whereas UTF16(x, false) is the low-level version that requires x to already have the nul.)

@stevengj stevengj added the needs decision A decision on this change is needed label Mar 9, 2015
@ivarne
Copy link
Member

ivarne commented Mar 9, 2015

I don't like guessing what a boolean parameter is supposed to mean. How about a UTF16_raw(::Array{Uint16}) or something?

@mbauman
Copy link
Member

mbauman commented Mar 9, 2015

This is a common pattern that I've run into with constructors, and maybe it deserves a common idiom… especially since the previous sortof-idiom (lower/upper-case) is on the way out. I often want to default to checking argument consistency in constructors, but need an escape hatch for internal code where I already know those arguments are consistent. It's a much harder problem than user-extensible @inbounds, though, and even that hasn't been solved yet (#8227).

+1 for the rename, we can bikeshed an @unchecked_arguments constructor/function annotation or common _unchecked suffix later.

@quinnj
Copy link
Member

quinnj commented Mar 9, 2015

There was talk of making an Unsafe module, so you could potentially have UTF16 as high-level and Unsafe.UTF16 as low-level.

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

@quinnj, except that the unsafe constructor needs to be an inner constructor. @ivarne, because it is an inner constructor, I think that rules out any name but UTF16(...)? I don't see the obscurity of the boolean argument as a big issue in this case, because it will mainly be used internally. We could do a keyword, of course, although there is a slight performance hit to that. @mbauman, you're right that this is not an uncommon pattern and it would be nice to have a more general solution.

@nalimilan
Copy link
Member

How does renaming all string types fit with @StefanKarpinski's rework of strings? Anywa, I'm not really fan of removing the String term, as all users will not necessarily associate the UTF acronym to strings -- all of that to save a few letters...

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

@nalimilan, users are already writing utfxx(x) for conversions; this just makes the typename consistent with that. And if you are willing to type UTF8String, you presumably already know what UTF-8 is, in which case the String adds nothing: UTFxx can hardly refer to anything other than a string type.

I see this as orthogonal to Stefan's rework of ByteString, which doesn't depend on whether we spell it UTF8String or UTF8. (My vague impression was that String or Str might become an alias for the default string type, analogous to Int, in which case users who don't care about the specific encoding would never type "UTF".)

@nalimilan
Copy link
Member

@stevengj I was more concerned about users seeing Array{UTF8, 1} e.g. after reading a file. People who use the functions are probably much more knowledgeable.

Anyway, I guess my point is that the benefit isn't compelling, and the cost, while small, is not negligible.

@tkelman
Copy link
Contributor

tkelman commented Mar 10, 2015

UTF8 is not especially readable or immediately obvious as a string type to me. Our Char is a unicode codepoint, but in other languages unicode codepoint might be a distinct char type, and unicode string a separate type from byte strings, right?

@JeffBezanson
Copy link
Member

That's a fair point --- but nor is it obvious what the difference is between UTF8String(x), convert(UTF8String, x) and utf8(x). Maybe just accept the long names? Personally I don't convert strings to different encodings all that often.

@StefanKarpinski
Copy link
Member

Made irrelevant by the removal of these names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests

8 participants