utfxx vs. UTFxx #10456

stevengj · 2015-03-09T13:34:11Z

In #10452, @JeffBezanson got rid of most of the lower-case numeric conversions, as discussed in #1470. In #1470, it was similarly suggested that UTFxxString be renamed to UTFxx, and that utfxx(s) be eliminated in favor of UTFxx(s). Most people seemed to be in favor of this?

(One slight tweak is that UTF16String(x::Array{UInt16}) is currently a low-level constructor that accepts a nul-terminated array, whereas utf16(x::Array{UInt16}) adds the nul terminator for you. However, this could be changed to UTF16(x::Array{UInt16}, addnul=true), so that UTF16(x) is the high-level version that adds the nul, whereas UTF16(x, false) is the low-level version that requires x to already have the nul.)

The text was updated successfully, but these errors were encountered:

ivarne · 2015-03-09T13:38:20Z

I don't like guessing what a boolean parameter is supposed to mean. How about a UTF16_raw(::Array{Uint16}) or something?

mbauman · 2015-03-09T14:05:53Z

This is a common pattern that I've run into with constructors, and maybe it deserves a common idiom… especially since the previous sortof-idiom (lower/upper-case) is on the way out. I often want to default to checking argument consistency in constructors, but need an escape hatch for internal code where I already know those arguments are consistent. It's a much harder problem than user-extensible @inbounds, though, and even that hasn't been solved yet (#8227).

+1 for the rename, we can bikeshed an @unchecked_arguments constructor/function annotation or common _unchecked suffix later.

quinnj · 2015-03-09T14:32:22Z

There was talk of making an Unsafe module, so you could potentially have UTF16 as high-level and Unsafe.UTF16 as low-level.

stevengj · 2015-03-09T14:42:44Z

@quinnj, except that the unsafe constructor needs to be an inner constructor. @ivarne, because it is an inner constructor, I think that rules out any name but UTF16(...)? I don't see the obscurity of the boolean argument as a big issue in this case, because it will mainly be used internally. We could do a keyword, of course, although there is a slight performance hit to that. @mbauman, you're right that this is not an uncommon pattern and it would be nice to have a more general solution.

nalimilan · 2015-03-09T14:45:06Z

How does renaming all string types fit with @StefanKarpinski's rework of strings? Anywa, I'm not really fan of removing the String term, as all users will not necessarily associate the UTF acronym to strings -- all of that to save a few letters...

stevengj · 2015-03-09T16:39:00Z

@nalimilan, users are already writing utfxx(x) for conversions; this just makes the typename consistent with that. And if you are willing to type UTF8String, you presumably already know what UTF-8 is, in which case the String adds nothing: UTFxx can hardly refer to anything other than a string type.

I see this as orthogonal to Stefan's rework of ByteString, which doesn't depend on whether we spell it UTF8String or UTF8. (My vague impression was that String or Str might become an alias for the default string type, analogous to Int, in which case users who don't care about the specific encoding would never type "UTF".)

nalimilan · 2015-03-09T17:14:15Z

@stevengj I was more concerned about users seeing Array{UTF8, 1} e.g. after reading a file. People who use the functions are probably much more knowledgeable.

Anyway, I guess my point is that the benefit isn't compelling, and the cost, while small, is not negligible.

tkelman · 2015-03-10T06:52:39Z

UTF8 is not especially readable or immediately obvious as a string type to me. Our Char is a unicode codepoint, but in other languages unicode codepoint might be a distinct char type, and unicode string a separate type from byte strings, right?

JeffBezanson · 2015-03-10T15:26:50Z

That's a fair point --- but nor is it obvious what the difference is between UTF8String(x), convert(UTF8String, x) and utf8(x). Maybe just accept the long names? Personally I don't convert strings to different encodings all that often.

StefanKarpinski · 2016-09-13T23:14:02Z

Made irrelevant by the removal of these names.

stevengj added the needs decision A decision on this change is needed label Mar 9, 2015

StefanKarpinski closed this as completed Sep 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utfxx vs. UTFxx #10456

utfxx vs. UTFxx #10456

stevengj commented Mar 9, 2015

ivarne commented Mar 9, 2015

mbauman commented Mar 9, 2015

quinnj commented Mar 9, 2015

stevengj commented Mar 9, 2015

nalimilan commented Mar 9, 2015

stevengj commented Mar 9, 2015

nalimilan commented Mar 9, 2015

tkelman commented Mar 10, 2015

JeffBezanson commented Mar 10, 2015

StefanKarpinski commented Sep 13, 2016

utfxx vs. UTFxx #10456

utfxx vs. UTFxx #10456

Comments

stevengj commented Mar 9, 2015

ivarne commented Mar 9, 2015

mbauman commented Mar 9, 2015

quinnj commented Mar 9, 2015

stevengj commented Mar 9, 2015

nalimilan commented Mar 9, 2015

stevengj commented Mar 9, 2015

nalimilan commented Mar 9, 2015

tkelman commented Mar 10, 2015

JeffBezanson commented Mar 10, 2015

StefanKarpinski commented Sep 13, 2016