Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact String implementation #6612

Open
mattwarren opened this issue Sep 7, 2016 · 76 comments
Open

Compact String implementation #6612

mattwarren opened this issue Sep 7, 2016 · 76 comments
Labels
area-Meta enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@mattwarren
Copy link
Contributor

Background

I came across this idea after seeing that Java has implemented it, see JEP 254 for the full details, plus this InfoQ article for some extra info.

I'm aware of the CoreFX Labs effort to implement a UTF-8 string library, but this proposal doesn't require using a different library, it changes the implementation of all strings in the CoreCLR (a bit ambitious I know!!)

Note: this optimisation work because many types of strings in web-based applications can actually be encoded as ASCII or ISO-8859-1 (Latin-1), for instance URLs, HTTP Headers, etc. So even with internationalised applications there are still some possible savings.

Implementation

One way of implementing this is to change the internal string layout to the following, with an extra m_type field:

public sealed class String : IComparable, ICloneable, IConvertible, IEnumerable
    , IComparable<String>, IEnumerable<char>, IEquatable<String>
{
    [NonSerialized]private int  m_stringLength;
    [NonSerialized]private byte m_type;
    [NonSerialized]private char m_firstChar;
    ....

Assuming that there are no padding implications for adding a field like this?

Another approach is to have an abstract String base class and then use either a UTF16 or Latin1 implementation via virtual dispatch, but this may have performance implications.

Then the (pseudo) code for the equals method become:

public boolean Equals(string other) 
{
    if (this.type != other.type)
       return false;
    if (type == LATIN1)
        return StringLatin1.Equals(this, other);
    else 
        return StringLatinUTF16.Equals(this, other);
}

Obviously I know this is not a small undertaking, it would require changes to the String and StringBuilder classes, plus several others. Also the JIT or intrinsics would have to be involved to ensure that all the methods that requires an extra if (type == LATIN1) check are as fast as possible (length, indexOf, etc).

Advantages

  1. less memory usage overall (as-per @davidfowl "At the top of every ASP.NET profile… strings! ")
  2. strings become more cache-friendly, which may give better performance

Disadvantages

  1. Breaks COM or P/Invoke interop that relies on the current string format
  2. If very few strings in the application are ASCII or ISO-8859-1 (Latin-1) this will have an overhead for no gain.
  3. Probably something else I haven't thought about ;-)

Research

This feature is not worth doing if it doesn't actually save space for common types of .NET applications. So that this can be measured I wrote HeapStringAnalyser that makes use of CLR MD (see Analysing .NET Memory Dumps with CLR MD for more info), @NickCraver from Stack Overflow was kind enough to run my tool and one of their memory dumps and it gave the following output:

...

Overall 30,703,367 "System.String" objects take up 4,320,235,704 bytes (4,120.10 MB)
Of this underlying byte arrays (as Unicode) take up 3,521,948,162 bytes (3,358.79 MB)
Remaining data (object headers, other fields, etc) is 798,287,542 bytes (761.31 MB), at 26 bytes per object

Actual Encoding that the "System.String" could be stored as (with corresponding data size)
    3,347,868,352 bytes are ASCII
        5,078,902 bytes are ISO-8859-1 (Latin-1)
      169,000,908 bytes are Unicode (UTF-16)
Total: 3,521,948,162 bytes (expected: 3,521,948,162)

Compression Summary:
    1,676,473,627 bytes Compressed (to ISO-8859-1 (Latin-1))
      169,000,908 bytes Uncompressed (as Unicode/UTF-16)
       30,703,367 bytes EXTRA to enable compression (one byte field, per "System.String" object)
Total: 1,876,177,902 bytes, compared to 3,521,948,162 before compression

In this case the String data could be compressed from ~3,358 MB down to ~1,789 MB, a pretty impressive saving! But this test would have to be repeated on a far wider range of apps to see if the memory savings are repeated.

Update: there's a really thorough write-up by @evincarofautumn who's actually implemented this suggestion against the Mono code-base

@benaadams
Copy link
Member

benaadams commented Sep 7, 2016

Was updated to Ascii rather than ISO-8859-1 in RFC 7230 3.2.4. Field Parsing.

Latin-1 conflicts with Utf8 whereas Ascii is compatible (so you can cross detect between the two)

@benaadams
Copy link
Member

Interestingly the current string format does already have an IsAscii flag on it

Though it considers ' and - to not be Ascii 😦

However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed statements on strings without special handling (and passing to pinvoke etc)

@akoeplinger
Copy link
Member

akoeplinger commented Sep 7, 2016

ASCII Strings is something that @evincarofautumn experimented with a lot for Mono: the proposal is here http://www.mono-project.com/docs/advanced/runtime/docs/ascii-strings/ (and a few more details in the mono-devel list thread about it)

I'm sure he can go into great details about the pros/cons 😄

@mattwarren
Copy link
Contributor Author

However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed statements on strings without special handling (and passing to pinvoke etc)

Hmm, I didn't think about compatibility with code that uses fixed against strings, yeah that's a problem. Looks like the mono experiment solved it by disabling fixed on strings.

@benaadams
Copy link
Member

benaadams commented Sep 7, 2016

Looks like the mono experiment solved it by disabling fixed on strings.

Could always stackalloc length * 2 bytes for fixed start then copy and widen with _mm_unpacklo_epi8 & _mm_unpackhi_epi8 then compress back at fixed closed. Would be slow though...

Or allow both fixed byte* and fixed char*

@fanoI
Copy link

fanoI commented Sep 8, 2016

Why use ASCII instead of Latin 1 for the 1 byte version? Latin1 will permit to use one byte for a lot of accented characters of Western Europe as 'è', 'ò', 'ù'...

Could have sense to have a third type at this point for UTF32 in the case of strings containing characters that could not represented in UTF16 (Chinese characters, Klingon, Emoticons, ...) if not resorting to the kludge of use surrogate pairs (that are not really supported in .NET[1])? Probably it would not accomplish a lot of compression in this case but the implementation could be more correct.

[1] De facto .NET String are really UCS2 not UTF16.

@benaadams
Copy link
Member

benaadams commented Sep 8, 2016

Why use ASCII instead of Latin 1

ASCII and UTF8 are compatible; Latin 1 and UTF8 are not so require conversions between the two.

UTF8 is just as compact for the non-accented characters and is current the general interchange format, so using Latin 1 would add conversion load, and allocation to and from UTF8; whereas ASCII requires no conversion.

Also using Latin 1 in places where only ASCII is allowed (http headers etc) adds a validation step to and from. There are fast ASCII algorithms that don't apply to Latin 1, e.g case insentive compare (ignore a single bit), upper and lower casing (flip a bit) etc.

e.g. If a string is known to be ASCII (1 byte per char) it requires no validation steps and its bytes can be used directly, and conversion to UTF8 is instant, as its bytes are @already also UTF8.

@migueldeicaza
Copy link
Contributor

The reason Mono has disabled fixed on strings is purely to help us locate the code paths that have fixed in as many places as possible and upgrade the code.

The idea is to perform a copy of the data on demand if necessary when a "fixed" that has not been upgraded is found, preserving compatibility at the expense of a string copy.

@migueldeicaza
Copy link
Contributor

I once also tried to make the case that .net was ucs2 and not UTF16 in front of the whole ECMA committee, only to find out that I was wrong :-)

http://www.unicode.org/faq/utf_bom.html#utf16-11

@HansOlavS
Copy link

@benaadams, I live in Europe and we use Latin-1. Consider the cultural impact of using Latin-1 from my European perspective. For the US ASCII might seem like a slam dunk, but if you think about this with European eyes it's not.

Could a compiler-switch be a good middleground? Use ASCII as default, but open up for Latin1...

@benaadams
Copy link
Member

benaadams commented Sep 8, 2016

@HansOlavS I don't think Latin-1 would be great; I'm not thinking of it from a US perspective (I live in Europe also); more from high throughput interchange formats.

  • String is the correct format for Windows p/invoke, platform internationalisation and UI display.
  • ASCII is the correct format for a lot of protocols with their limited command set and dislike of getting involved with network byte order and endianness (and a straight subset of String with char narrowing).
  • UTF8 is now the general standard for most text; both a direct superset of ASCII and endianness free, as well as having a straight forward vector route to UTF16.

Latin-1 (ISO-8859-1) is also a superset of ASCII but requires conversion to and from String, UTF8 and Windows native. Only 6% of webpages use ISO-8859-1. It also doesn't actually cover all European lanugages as well as ISO-8859 having 16 variants for the fuller coverage.

Related you need to be using ISO-8859-7 (Latin/Greek), ISO-8859-15 (Latin-9) or ISO-8859-16 (Fantasy mix) to include the Euro symbol € as its not included in Latin-1.

The Latin formats don't have the quick processing properties of ASCII e.g. case-intensive compare, uppercasing, lowercasing, and require character translation look ups; so at best become a slightly more compressed format with but with translation penalties.

It might be worth having a string8byte opaque intermediate format; that supports string items such as indexOf, split, etc, then requires translation to String, UTF8 and ASCII (for subset validation).

However, I'm not sure that provides a great advantage over using Span<byte> or btye[] and the Encoder namespace?

@HansOlavS
Copy link

Disclaimer: this is way out of my league... ;)

I get that ASCII aligns neatly with a lot of low-level protocols (like HTTP), but my concern was memory footprint and reduction when loading up my .NET app that contains a lot of Latin-1 compatible strings. It would be super-nice to have those stored as 8-bit chars in-memory instead of the 16-bits .NET now uses.

If you're familiar with Delphi (Anders Hejlsbergs predecessor to C#) it had 'string' (which was ANSI string, depending on your computers localization settings or whatever) and 'widestring' which was unicode (not sure of the actual encoding).

Actually, my very first reaction to C# and .NET (reading a C# book back in 2002) was that all strings were "unicode" and 16-bits. I thought that was a strange choice regarding memory consumption.

@HansOlavS
Copy link

And regarding special cases like €, I don't care as that is an edge-case and I'm happy to have it stored in UTF16.

@migueldeicaza
Copy link
Contributor

ASCII has other benefits, when marshaling out to UTF8 it requires no processing. It is a direct copy.

Anything th

@migueldeicaza
Copy link
Contributor

Anything that uses the top bit would prevent this optimization from working. The was one of the reasons for Mono's work to use ascii

@HansOlavS
Copy link

I get the ASCII->UTF8 compatibility and marshaling and p/invoke and such. I guess I'm still a bit uneasy with .NET's decision to have all strings in memory as 16-bits UTF16 and miss the old days in Delphi where I could have ANSI (Latin-1) string and widestring and make a conscious decision on where to use what. But maybe this concern is totally beside the point in this thread and I misunderstood.

@fanoI
Copy link

fanoI commented Sep 8, 2016

For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.

The opened issue is talking to compress the strings and waste half of a bytes to be compatible with ASCII does not make sense to me. To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127?

A part from Java, Python has doing a similar thing to the one I was thinking:
http://legacy.python.org/dev/peps/pep-0393/

characters that are not representable is UCS2 would use UCS4 (4 Byte).

If we were interested to be compatible with UTF8 we could use UTF8 directly but it is not the most compact representation: a lot of simple characters as 'è' become 3 bytes, the more complex could become 6 bytes long!

I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?

@zabulus
Copy link

zabulus commented Sep 8, 2016

Why don't use separate type?

@evincarofautumn
Copy link
Contributor

@HansOlavS It would be possible to use Latin-1 as the compact encoding. Scanning would be just as fast because the second Unicode block (U+0080–U+00FF) corresponds to the upper range of Latin-1, as @fanol rightly pointed out. We could also do some clever hacks with the range U+0080–U+009F, which contains rarely used control characters; for example, encoding € as U+0091 (“Private Use 1”).

Some ASCII-only optimisations would no longer work. Marshalling would require an extra scan per string. And since Latin-1 has incomplete coverage for many languages, it might not make much of a difference; we should measure real-world apps.

I don’t like that this optimisation privileges some languages over others, but frankly it’s because the character encodings are already biased.

@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as it’s not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.

@mattwarren
Copy link
Contributor Author

mattwarren commented Sep 8, 2016

@zabulus

Wh(y) don't use separate type?

Sure you can use a separate type, that's what the UTF-8 string library is trying to achieve.

However this proposal was to see if something more wide-ranging and more deeply embedded in the CLR would be possible. With the side-effect that you don't have to make any code-changes to use it.

@zabulus
Copy link

zabulus commented Sep 9, 2016

@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as it’s not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.

I see one more concern about such strings, except unsafe/fixed is about marshaling such strings with P/Invoke. Specially in case using the new strings in W(ide) versions of Win32 API.

@fanoI
Copy link

fanoI commented Sep 9, 2016

As Latin1 is a subset of UTF16 the marshalling code if should convert to a C/C++ wstring would simply duplicate any byte adding a 0 before it, probably one could made the thing faster using SSE intrinsic.
If I'm not wrong the algorith that there work for the ASCII part of an UTF8 string would work to convert a Latin1 string to UFT16 too:
https://woboq.com/blog/utf-8-processing-using-simd.html

It is obvious that in some cases this implementation will have performance cost it depends if it is more important to use less memory or to be faster...

@benaadams
Copy link
Member

benaadams commented Sep 9, 2016

For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.

I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?

Characters 128 - 159 differ between the Windows-1252 variant and UTF16

To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127

Then additionally in the fast path algorithms you need to do a validation check, per character, which makes them less fast path... (can't use a straight int, long, vector bitwise or)

@mattwarren
Copy link
Contributor Author

Sorry, I guess I didn't actually ask any questions when I proposed this. I guess I'm wondering:

  1. Has something like this been considered before, are there fundamental reasons why it can't be or won't be done in the CLR?
  2. If not, is this something that would be considered in the future or are things like System.Text.Utf8 the recommended solution?

@evincarofautumn
Copy link
Contributor

@fanol Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw). UTF-8 would be slightly more involved.

@mattwarren Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the get_OffsetToStringData API used internally by fixed is less than ideal.

@mattwarren
Copy link
Contributor Author

mattwarren commented Sep 9, 2016

Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the get_OffsetToStringData API used internally by fixed is less than ideal.

@evincarofautumn that's interesting to know, thanks

@benaadams
Copy link
Member

Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw).

Though not for the Windows-1252 variant due to the conflict in characters 128 - 159

@fanoI
Copy link

fanoI commented Sep 10, 2016

OK the Microsoft version of Latin1 while more clever in the use of the only 255 characters is not adapt because it is not a subset of UTF16 while the real Latin1 is https://en.wikipedia.org/wiki/ISO/IEC_8859-1 so it is this that should be used. Patience for '€' and other characters in the block 128 - 159 strings containing them will be not compressible.

@richardtallent
Copy link

@whoisj, as I saw it, the original question was related to how strings are stored in memory (especially related to caching), not so much about how they are passed around.

Whether we use a traditional encoding (as you're suggesting) or a data structure (as I'm suggesting), translation to and from a normal UCS-2 representation would be needed for most native calls, would they not? (I don't deal with a lot of interop, so I plead ignorance on how string values are usually passed to native libraries on various platforms.)

I think the structured approach could result in faster conversions than converting traditional encodings, because the converter would be able to work with regions of characters (i.e., each of a known width) rather than having to convert one character at a time (as would be required with, say, UTF-8).

What I'm suggesting would also be a new, fully-optional type that wouldn't require rewiring the dark innards of System.String.

@svick
Copy link
Contributor

svick commented Oct 14, 2016

@richardtallent

What I'm suggesting would also be a new, fully-optional type that wouldn't require rewiring the dark innards of System.String.

See https://github.com/dotnet/coreclr/issues/7083#issuecomment-245713571 for why changing string instead of creating a new type might be preferable.

@danmoseley danmoseley removed their assignment Mar 6, 2017
@danmoseley
Copy link
Member

danmoseley commented Mar 6, 2017

I've added this to our list of potential post 2.0 investments - this is potentially a BIG one though and it's certainly not going to get a yay/nay right here in this issue. This remains in future and on our radar.

@fanoI
Copy link

fanoI commented Apr 12, 2017

I'm asking myself but to make more compact the String type the problem does not lie in the Char type in reality? It would be possible to encode Char in UTF-8?

Doing this we get:

  1. ASCII character occupy effectively 1 Byte
  2. Latin1 characters (as 'è') will occupy 2 Bytes
  3. It will permit to express characters that now are not accepted as '丽' (0xf0afa080)
  4. String does not change will remains the array of Characters that is now the fact that Char is now encoded in UTF-8 could be considered an implementation detail (fixed char * buf = str will continue to work without doing copy, string encoding conversion will be equally fast it should not make any difference between doing UFT-16 ---> CP437 and UFT-8 --> CP437, a for on a string will work without surprises yet if the string contains weird characters as '丽')

How to express this new Char in the way that effectively occupies really 1 Byte, 2 Byte, 3 Byte... as the character code is? A stack allocated array could do this?

struct Char
{
        Char value;
        fixed Byte data[]; // this imply that fixed could used in "safe" context

        // Used by compiler to create a Char literal
        unsafe public Char (byte *data, int len) 
        {
                this.data = stackalloc(len);
                this.data = data;
        }
}

Probably there is a motivation because no one has taken this route (Java, Pithon) and instead changed the String to not be an array of characters anymore.

@fanoI
Copy link

fanoI commented Apr 12, 2017

On second though the field 'data' could be declared as Span<byte> so we have safe code and the effect of it to have size 1 byte, 2 bytes, 3 bytes or 4 bytes as needed for the Char represented.

@jnm2
Copy link
Contributor

jnm2 commented Apr 12, 2017

@fanoI I do not believe the CLR has a facility for two instances of the same type to be different sizes. And we definitely wouldn't want to use indirection to the heap.

@whoisj
Copy link

whoisj commented Apr 12, 2017

Use of Span<byte> seems to be most creative solution. 😄

Perhaps we'll need to make char 32-bits like it is in many Linux projects. 😣

@mikedn
Copy link
Contributor

mikedn commented Apr 12, 2017

How to express this new Char in the way that effectively occupies really 1 Byte, 2 Byte, 3 Byte... as the character code is? A stack allocated array could do this?

Well, it's quite simple: you don't. If changing char was actually a possibility then the only reasonable option would be to make it 4 bytes in size so it can actually represent a code point.

Anyway, any attempt at changing the String encoding in such a way that chars are no longer fixed size is doomed to fail.

@fanoI
Copy link

fanoI commented Apr 12, 2017

If I understood well how a Span works effectively lives in the stack and could have any size is this correct?
I'm unsure if a Span<byte> = { 0xf0, 0xaf, 0xa0, 0x80 } will really occupy 4 bytes into the stack or if there is overhead.

@whoisj making Char 4 byte will be not be useful for this proposal to make String more compacts they will become more larges indeed and for interop / unsafe scenario seems clear that the useful property of .Net that a String is simply an array of Char should be maintained (fixed should never copy!).

@whoisj
Copy link

whoisj commented Apr 12, 2017

making Char 4 byte will be not be useful for this proposal to make ...

Yes yes, of course - I was being pedantic. 😏

@mikedn
Copy link
Contributor

mikedn commented Apr 12, 2017

If I understood well how a Span works effectively lives in the stack and could have any size is this correct?

Like pretty much all types Span has a fixed size. In general, the idea of a variable sized type is pretty bizarre.

seems clear that the useful property of .Net that a String is simply an array of Char should be maintained (fixed should never copy!).

And that implies that char has to have fixed size. You can't have an array of variable sized types.

@markusschaber
Copy link

markusschaber commented Oct 19, 2017

Just my few thoughts:

  1. For the question whether to use ASCII or LATIN-1 for the "byte strings": The type tag could differentiate three types (ASCII, LATIN-1 and UTF-16). Thus, LATIN-1 could still profit from the smaller storage, while we could also exploit the fact that ASCII is an UTF-8 subset. We could even distinguish a fourth kind for Strings which exceed LATIN-1, but do stay within the BMP (and thus, have no surrogates).

  2. For comparision: Python strings are also immutable, but they have a different internal representation. The details are too much to rehash them here, it's all described nicely at the PEP 393: https://www.python.org/dev/peps/pep-0393/ - while I think it cannot be applied to .NET in a 1:1 fashion, some of the ideas could provide useful inspiration... :-)

  3. Using an "indirect" API (where the string contents are allocated in a separate heap block) might have advantages for String Slicing (like "substring"), as we could reference the same backing store (strings are immutable, after all...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Meta enhancement Product code improvement that does NOT require public API changes/additions
Projects
No open projects
Development

No branches or pull requests