-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compact String implementation #6612
Comments
Was updated to Ascii rather than ISO-8859-1 in RFC 7230 3.2.4. Field Parsing. Latin-1 conflicts with Utf8 whereas Ascii is compatible (so you can cross detect between the two) |
Interestingly the current string format does already have an IsAscii flag on it Though it considers However changing the string to be 8-bit rather than 16-bit based on this flag would likely break |
ASCII Strings is something that @evincarofautumn experimented with a lot for Mono: the proposal is here http://www.mono-project.com/docs/advanced/runtime/docs/ascii-strings/ (and a few more details in the mono-devel list thread about it) I'm sure he can go into great details about the pros/cons 😄 |
Hmm, I didn't think about compatibility with code that uses |
Could always stackalloc length * 2 bytes for fixed start then copy and widen with Or allow both fixed byte* and fixed char* |
Why use ASCII instead of Latin 1 for the 1 byte version? Latin1 will permit to use one byte for a lot of accented characters of Western Europe as 'è', 'ò', 'ù'... Could have sense to have a third type at this point for UTF32 in the case of strings containing characters that could not represented in UTF16 (Chinese characters, Klingon, Emoticons, ...) if not resorting to the kludge of use surrogate pairs (that are not really supported in .NET[1])? Probably it would not accomplish a lot of compression in this case but the implementation could be more correct. [1] De facto .NET String are really UCS2 not UTF16. |
ASCII and UTF8 are compatible; Latin 1 and UTF8 are not so require conversions between the two. UTF8 is just as compact for the non-accented characters and is current the general interchange format, so using Latin 1 would add conversion load, and allocation to and from UTF8; whereas ASCII requires no conversion. Also using Latin 1 in places where only ASCII is allowed (http headers etc) adds a validation step to and from. There are fast ASCII algorithms that don't apply to Latin 1, e.g case insentive compare (ignore a single bit), upper and lower casing (flip a bit) etc. e.g. If a string is known to be ASCII (1 byte per char) it requires no validation steps and its bytes can be used directly, and conversion to UTF8 is instant, as its bytes are @already also UTF8. |
The reason Mono has disabled fixed on strings is purely to help us locate the code paths that have fixed in as many places as possible and upgrade the code. The idea is to perform a copy of the data on demand if necessary when a "fixed" that has not been upgraded is found, preserving compatibility at the expense of a string copy. |
I once also tried to make the case that .net was ucs2 and not UTF16 in front of the whole ECMA committee, only to find out that I was wrong :-) |
@benaadams, I live in Europe and we use Latin-1. Consider the cultural impact of using Latin-1 from my European perspective. For the US ASCII might seem like a slam dunk, but if you think about this with European eyes it's not. Could a compiler-switch be a good middleground? Use ASCII as default, but open up for Latin1... |
@HansOlavS I don't think Latin-1 would be great; I'm not thinking of it from a US perspective (I live in Europe also); more from high throughput interchange formats.
Latin-1 (ISO-8859-1) is also a superset of ASCII but requires conversion to and from String, UTF8 and Windows native. Only 6% of webpages use ISO-8859-1. It also doesn't actually cover all European lanugages as well as ISO-8859 having 16 variants for the fuller coverage. Related you need to be using ISO-8859-7 (Latin/Greek), ISO-8859-15 (Latin-9) or ISO-8859-16 (Fantasy mix) to include the Euro symbol € as its not included in Latin-1. The Latin formats don't have the quick processing properties of ASCII e.g. case-intensive compare, uppercasing, lowercasing, and require character translation look ups; so at best become a slightly more compressed format with but with translation penalties. It might be worth having a string8byte opaque intermediate format; that supports string items such as indexOf, split, etc, then requires translation to String, UTF8 and ASCII (for subset validation). However, I'm not sure that provides a great advantage over using |
Disclaimer: this is way out of my league... ;) I get that ASCII aligns neatly with a lot of low-level protocols (like HTTP), but my concern was memory footprint and reduction when loading up my .NET app that contains a lot of Latin-1 compatible strings. It would be super-nice to have those stored as 8-bit chars in-memory instead of the 16-bits .NET now uses. If you're familiar with Delphi (Anders Hejlsbergs predecessor to C#) it had 'string' (which was ANSI string, depending on your computers localization settings or whatever) and 'widestring' which was unicode (not sure of the actual encoding). Actually, my very first reaction to C# and .NET (reading a C# book back in 2002) was that all strings were "unicode" and 16-bits. I thought that was a strange choice regarding memory consumption. |
And regarding special cases like €, I don't care as that is an edge-case and I'm happy to have it stored in UTF16. |
ASCII has other benefits, when marshaling out to UTF8 it requires no processing. It is a direct copy. Anything th |
Anything that uses the top bit would prevent this optimization from working. The was one of the reasons for Mono's work to use ascii |
I get the ASCII->UTF8 compatibility and marshaling and p/invoke and such. I guess I'm still a bit uneasy with .NET's decision to have all strings in memory as 16-bits UTF16 and miss the old days in Delphi where I could have ANSI (Latin-1) string and widestring and make a conscious decision on where to use what. But maybe this concern is totally beside the point in this thread and I misunderstood. |
For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252. The opened issue is talking to compress the strings and waste half of a bytes to be compatible with ASCII does not make sense to me. To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127? A part from Java, Python has doing a similar thing to the one I was thinking: characters that are not representable is UCS2 would use UCS4 (4 Byte). If we were interested to be compatible with UTF8 we could use UTF8 directly but it is not the most compact representation: a lot of simple characters as 'è' become 3 bytes, the more complex could become 6 bytes long! I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong? |
Why don't use separate type? |
@HansOlavS It would be possible to use Latin-1 as the compact encoding. Scanning would be just as fast because the second Unicode block (U+0080–U+00FF) corresponds to the upper range of Latin-1, as @fanol rightly pointed out. We could also do some clever hacks with the range U+0080–U+009F, which contains rarely used control characters; for example, encoding € as U+0091 (“Private Use 1”). Some ASCII-only optimisations would no longer work. Marshalling would require an extra scan per string. And since Latin-1 has incomplete coverage for many languages, it might not make much of a difference; we should measure real-world apps. I don’t like that this optimisation privileges some languages over others, but frankly it’s because the character encodings are already biased. @zabulus All the existing APIs are using |
Sure you can use a separate type, that's what the UTF-8 string library is trying to achieve. However this proposal was to see if something more wide-ranging and more deeply embedded in the CLR would be possible. With the side-effect that you don't have to make any code-changes to use it. |
I see one more concern about such strings, except |
As Latin1 is a subset of UTF16 the marshalling code if should convert to a C/C++ wstring would simply duplicate any byte adding a 0 before it, probably one could made the thing faster using SSE intrinsic. It is obvious that in some cases this implementation will have performance cost it depends if it is more important to use less memory or to be faster... |
Characters 128 - 159 differ between the Windows-1252 variant and UTF16
Then additionally in the fast path algorithms you need to do a validation check, per character, which makes them less fast path... (can't use a straight int, long, vector bitwise or) |
Sorry, I guess I didn't actually ask any questions when I proposed this. I guess I'm wondering:
|
@fanol Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input ( @mattwarren Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the |
@evincarofautumn that's interesting to know, thanks |
Though not for the Windows-1252 variant due to the conflict in characters 128 - 159 |
OK the Microsoft version of Latin1 while more clever in the use of the only 255 characters is not adapt because it is not a subset of UTF16 while the real Latin1 is https://en.wikipedia.org/wiki/ISO/IEC_8859-1 so it is this that should be used. Patience for '€' and other characters in the block 128 - 159 strings containing them will be not compressible. |
@whoisj, as I saw it, the original question was related to how strings are stored in memory (especially related to caching), not so much about how they are passed around. Whether we use a traditional encoding (as you're suggesting) or a data structure (as I'm suggesting), translation to and from a normal UCS-2 representation would be needed for most native calls, would they not? (I don't deal with a lot of interop, so I plead ignorance on how string values are usually passed to native libraries on various platforms.) I think the structured approach could result in faster conversions than converting traditional encodings, because the converter would be able to work with regions of characters (i.e., each of a known width) rather than having to convert one character at a time (as would be required with, say, UTF-8). What I'm suggesting would also be a new, fully-optional type that wouldn't require rewiring the dark innards of System.String. |
See https://github.com/dotnet/coreclr/issues/7083#issuecomment-245713571 for why changing |
I've added this to our list of potential post 2.0 investments - this is potentially a BIG one though and it's certainly not going to get a yay/nay right here in this issue. This remains in future and on our radar. |
I'm asking myself but to make more compact the String type the problem does not lie in the Char type in reality? It would be possible to encode Char in UTF-8? Doing this we get:
How to express this new Char in the way that effectively occupies really 1 Byte, 2 Byte, 3 Byte... as the character code is? A stack allocated array could do this? struct Char
{
Char value;
fixed Byte data[]; // this imply that fixed could used in "safe" context
// Used by compiler to create a Char literal
unsafe public Char (byte *data, int len)
{
this.data = stackalloc(len);
this.data = data;
}
} Probably there is a motivation because no one has taken this route (Java, Pithon) and instead changed the String to not be an array of characters anymore. |
On second though the field 'data' could be declared as |
@fanoI I do not believe the CLR has a facility for two instances of the same type to be different sizes. And we definitely wouldn't want to use indirection to the heap. |
Use of Perhaps we'll need to make |
Well, it's quite simple: you don't. If changing Anyway, any attempt at changing the |
If I understood well how a Span works effectively lives in the stack and could have any size is this correct? @whoisj making Char 4 byte will be not be useful for this proposal to make String more compacts they will become more larges indeed and for interop / unsafe scenario seems clear that the useful property of .Net that a String is simply an array of Char should be maintained (fixed should never copy!). |
Yes yes, of course - I was being pedantic. 😏 |
Like pretty much all types Span has a fixed size. In general, the idea of a variable sized type is pretty bizarre.
And that implies that char has to have fixed size. You can't have an array of variable sized types. |
Just my few thoughts:
|
Background
I came across this idea after seeing that Java has implemented it, see JEP 254 for the full details, plus this InfoQ article for some extra info.
I'm aware of the CoreFX Labs effort to implement a UTF-8 string library, but this proposal doesn't require using a different library, it changes the implementation of all strings in the CoreCLR (a bit ambitious I know!!)
Note: this optimisation work because many types of strings in web-based applications can actually be encoded as
ASCII
orISO-8859-1 (Latin-1)
, for instance URLs, HTTP Headers, etc. So even with internationalised applications there are still some possible savings.Implementation
One way of implementing this is to change the internal string layout to the following, with an extra
m_type
field:Assuming that there are no padding implications for adding a field like this?
Another approach is to have an abstract
String
base class and then use either aUTF16
orLatin1
implementation via virtual dispatch, but this may have performance implications.Then the (pseudo) code for the equals method become:
Obviously I know this is not a small undertaking, it would require changes to the
String
andStringBuilder
classes, plus several others. Also the JIT or intrinsics would have to be involved to ensure that all the methods that requires an extraif (type == LATIN1)
check are as fast as possible (length, indexOf, etc).Advantages
Disadvantages
ASCII
orISO-8859-1 (Latin-1)
this will have an overhead for no gain.Research
This feature is not worth doing if it doesn't actually save space for common types of .NET applications. So that this can be measured I wrote HeapStringAnalyser that makes use of CLR MD (see Analysing .NET Memory Dumps with CLR MD for more info), @NickCraver from Stack Overflow was kind enough to run my tool and one of their memory dumps and it gave the following output:
In this case the
String
data could be compressed from ~3,358 MB down to ~1,789 MB, a pretty impressive saving! But this test would have to be repeated on a far wider range of apps to see if the memory savings are repeated.Update: there's a really thorough write-up by @evincarofautumn who's actually implemented this suggestion against the Mono code-base
The text was updated successfully, but these errors were encountered: