-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC story for strings? #145
Comments
Interface Types are supposed to handle the conversion from one string encoding to another |
Yeah, this is somewhat a continuation of my infamous thread on UTF-16 over at interface types. Give it a read if you haven't yet, it's great. There is indeed some overlap between the two, and my impression is that interface types will be useful in the absence of GC, but we'll still need a better story in the presence of GC, most likely integrating with the parts of interface types that apply to both. To give an example: A language may chose to not rely on linear memory at all, but use GC objects exclusively, including to represent strings, and there may be the expectation that these strings can be passed to or received from Web APIs, and that dealing with such strings internally to a module is reasonably efficient. Neither interface types nor GC, in its current forms, are very useful there, unfortunately. |
The GC proposal is not responsible for this kind of interlanguage interop. Different languages represent the same high-level concepts differently in their low-level representations for a variety of reasons. For example, OO languages need strings to have a v-table with references to an i-table and to reflection information, and each of those would need to be organized according the specific OO language's APIs. Some languages/libraries represent strings as trees. Then of course there are the issues with encodings. Given the diversity, it seems difficult for there to be a concept like "wasm string" that different systems could reasonably use for interop without copying or wrapping. This is why the Requirements document explicitly lists this kind of interop as a non-requirement. |
The requirements document also states the opposite in critical success factors, and I am very surprised to see this argument being made. If we don't have a good story for strings in Wasm GC, and given that strings are by far the most common higher data structure in most programming languages (and very common to be consumed and produced by Web APIs), I'd go as far as to say that we are about to make a critical mistake endangering the success of Wasm as part of the Web ecosystem. Not tackling this will significantly undermine the viability of AssemblyScript in favor of systems languages compiling to Wasm for example, and as such some might see this as an unexcusable standards body decision. That's exactly what I was concerned about in my comments on the requirements document. |
There is a middle ground. WebAssembly (and specifically GC) can provide the infrastructure for interop without baking specific interop designs into its core design. So someone can design a specific way to represent strings using GC/wasm types, and then various languages can choose to use that particular representation to facilitate interop. This includes JS, though you'll have to get the browsers to agree to do so. Or someone can design a module interface (with an abstract exported type) for string values and operations that wasm modules can link to and which each browser can implement according to how it implements JS strings. (For this to be efficient, you'll want a better compilation model though, which would be useful for a variety of reasons.) The high-level idea is that interlanguage interop is an ecosystem problem, not a low-level problem, and it is best served in WebAssembly by developing low-level mechanisms that can support various ecosystem solutions. |
To me the middle ground is to ensure good interoperability with JavaScript as a critical success factor, not necessarily between arbitrary languages, and I was under the impression that this is what browser makers and the overwhelming majority of the group want as well. The arguments and suggestions you are bringing up might be worth to explore, of course, who knows what we'll be getting out of it, but so far I don't see a clear merit in making this any more complicated than it needs to be. |
Concur with Ross here. The kind of interoperability you seem to be asking for is already impossible with JS. Most engines, including V8, use a multitude of internal representations for JS strings. The differences are 'papered over' at the source level. |
I am completely stumped by the responses and will have to think about it for a bit. Might well be that betting on Wasm GC and trusting you with the changes made to the requirements document instead of following up on my points was a mistake. |
@dcodeIO It would be helpful if you would more concretely illustrate the use cases you are concerned about. You mention wanting something like |
There seem to be two needs at play here: 1) String manipulation in the GC proposal needs to be competitive with string manipulation in linear memory and 2) JS interop is generally important, and interop with JS strings is particularly important. On the first point, I agree that the current MVP seems to leave a lot on the table since it can only access one array element at a time. It seems that the best solution would be to have instructions akin to the variety of MVP load and store instructions that operate on i8 arrays (or arbitrary integer/number arrays?) rather than linear memories. On the second point, the complexity and diversity of string representations in JS engines make it infeasible and undesireable to expose the underlying structure directly to WebAssembly, as @RossTate pointed out. That leaves two possible paths for first-class JS string interop: A) a "wasm string" type is introduced with instructions corresponding to all the common string operations that can be an abstraction directly on top of an engine's native JS string representation, or B) the JS API defines a coercion from JS strings to GC types and vice versa to make passing strings into and out of WebAssembly simpler. Both of these solutions are problematic. (A) does not fit in with the goal of having the GC proposal represent only low-level representation types and as Ross mentioned, most languages would not be able to use a native "wasm string" as their string type anyway. There's also practically very little difference between (A) and just importing a string type and functions on it from the environment, which will be possible with type imports. (B) is problematic because there is no good way to choose which representation of strings as a GC type to choose. Choosing any particular "blessed" representation would inherently favor some source languages over others, so it would be better to produce a more general solution like interface types, that can work for all languages. |
Actually, I think a first-class immutable string primitive that "papers over" the underlying byte encoding could still help a lot, even for languages that have a different native string representation. For example, while C# might not be able to adopt "wasm string" as the underlying representation of System.String, C# could expose this new type as 'System.Interop.WebAssembly.WasmString' and let C# developers choose if they want to use the new zero-copy WasmString or pay the cost of transcoding to get a classic C# string. I think option (A) is worth pursuing further. |
One specific point: We are discussing strings here on the GC proposal, but the issues are separable. It's very possible strings do not fit in with the goals of the GC proposal (as @tlively just mentioned), but also that @dcodeIO 's requests are valid and should be addressed by adding a string type to wasm in another proposal. My perspective is that we should add a string type to wasm, perhaps doing that in a separate proposal all by itself. The benefit of a string type would be similar to string types in other multi-language VMs (CLR, JVM, and for that matter JS) - they allow quick and easy passing of string data between languages. C won't use it (just like it won't on the CLR), but I understand @dcodeIO to be asking for AssemblyScript to be able to interop with JavaScript the way that TypeScript and many other languages that compile to JS can. That's a good goal! |
Great points above. And I'm realizing (thanks to @tlively's post) that I overlooked a concern raised in the OP about arrays. Sorry @dcodeIO! On arrays, one thought that's come up a few times is having linear memories within GC objects. This came up in at least #94 / #109, and in WebAssembly/interface-types#68 there is some discussion of slices, where I think it would be nice from an interop perspective for a slice to be either a view into a linear memory or a view into non-referential data within a GC object. |
I like the idea of a separate proposal for strings. In Java or equals languages you can hold the "native" string inside the language string like the native array inside an array object. Currently a string from JavaScript can only hold as a erxternref. I can not check an equals, etc. It is like a black box. That with Interfaces types a string must convert on every call. This will never effective. |
I expect even a separate "add a string" proposal is fraught with peril because there are so many fundamentally-incompatible low-level design choices with strings that will ultimately force the proposal to bias towards one subset of languages at the expense of others. One alternative route for JS that I've been imagining is:
The advantage of this approach is that, when targeting a Web embedding, a wasm module can use operations that precisely match JS strings while a module targeting, say, a future Go embedding could import a completely different set of Go-string-appropriate operations. Ultimately, if your aim is to write a single module that can be used across many different hosts, I think you're going to end up with an impedance mismatch unless you make a copy at the wasm/host boundary into and out of the wasm module's own private string representation. But if you're targeting only a single host and the source language is intentionally embracing the host's string type, then the preimport approach seems pretty good. |
I think picking a fixed string type is inevitable, and yes that will bias to a subset of languages, but I disagree with the assumption that that's a downside! JS, the JVM, and the CLR all have a single string type, and all show how successful that can be in letting various languages pass data back and forth easily and efficiently. If we think there's a better example, or that there's something wrong with their approach, I'd be curious to hear details about that. Otherwise I think we should learn from their success - as those VM ecosystems show, some languages will choose to use the blessed string because it's compact, convenient, and efficient, that works with a lot of other code. That's good for the ecosystem, I think. Picking one string type doesn't prevent other languages from not using it - there is no downside to them. And we'll still want Interface Types for the case where a language wants to do strings its way and convert at the boundary. Also I agree the preimport idea is interesting too, and can help with perf issues. |
One obvious downside to fixing strings to something like JS strings is that they effectively have to be encoded as UTF-16 (at least for non-ascii strings) since that's how JS indexes strings. There's no efficient way to hide that from the user without somewhat subtle perf cliffs (e.g. re-encoding the string in a different format). |
Biasing a subset of source languages is a clear downside because it directly contradicts the design goals we publish in our specification. I agree that we should learn from the experiences of the JVM, the CLR, and JS, but only when those lessons are compatible with WebAssembly's firmly established design goals. |
I agree that would be a downside there. A decision here would need to consider lots of factors (most of which I personally don't know enough about atm), including that. In general not every new feature will help every language or not help them equally. Things like Tables, exceptions, SIMD, GC, etc. may not end up used by every language, either because they choose not to or they can't - C can't use exceptions, for example, but it isn't "harmed" by us adding exceptions. Some new features may "bias" towards the languages they make sense for, but that's ok, because wasm as a whole still aims to support all languages. To be clear, I'm not saying "let's do a string type and stop thinking about languages that won't use it!" - we'll still want Interface Types and other things in this area, as I already said. |
I dug into the CLI spec and found that |
@lukewagner How would a preimport change the behavior of the string type in WebAssembly? It can give some performance improvements. But it would ever of type |
You can preimport the (abstract) string type, and that preimport can express requirements on that type such as "must support equality". |
I have no idea how I this should be possible. Also if I look into the Type Imports and Exports. But if this possible then such type would work like an internal string type. |
Yes, there are are various discussions about the shortcomings of that proposal. The "pre"-imports @lukewagner and I mentioned above provide compile-time imports, enabling the importing code to be specialized to the imported types and values/functions. One application of preimports we hope for is things like this, where we can enable the ecosystem to develop a string type that performs roughly just as well as if it were baked in string type and without having to add the type and operations to the core spec of WebAssembly itself. |
The more I think about the pre-import idea, the more I like it as a solution here. Would it allow something like the following (if not, maybe I'm misunderstanding something)?
Does that make sense @lukewagner @RossTate ? And @dcodeIO , if the above is correct, would it give AssemblyScript all it needs? |
I'm not sure I fully grok the type pre-imports. If I understand correctly, a wasm module wanting to use JS strings would need to:
It's unclear to me how to refer to the string type from JS, and how to call native string operations without wrapping them in JavaScript functions that shuffle an argument into |
The need to support "get char at index" is a big part of the concern with any proposal in this space. If "char" in this case is assumed to mean "UTF-16 code unit", and "get char at index" is expected to do O(1) random access, it doesn't leave much room for implementations to have different kinds of strings in practice, regardless of whether the string type is abstract, imported, pre-imported, or anything else. |
First of all, thanks for all the thorough comments pointing out further aspects and seeking for a solution to the problem. Appreciate it! The type import idea is interesting in a way, in that it is about what one would do today (just not as efficient) by importing wrappers around Now I know about the challenges involved in creating something universal, yet my view on this remained surprisingly consistent since the interface types UTF-16 discussion, in that I still reckon that the most reasonable thing to do is to bless the family of pretty much ubiquitous UTF encodings abstracted as an universal Wasm string type, reusing the infrastructure browsers already have for expressing strings in various representations. Or, of course, something slightly more complicated ultimately achieving the same. P.S: On a more general note: Might be that I am in a somewhat special spot due to AS's proximity to JS-land, seeing so many Web(Assembly|)Devs being left behind disappointed who initially were super excited about Wasm. Sure, there is still a constant stream of new people entering Wasm-land, but that won't continue forever and reality will hit us hard. Certainly, i31ref won't save us, but good interop can. It makes me sad to see that we are failing over something that really shouldn't be anything a bunch of the smartest theorists and programmers can't solve. Feeling tempted to close this paragraph with a famous quote by Brendan Eich, but won't, because it may be invalidated if we only do what the quote says so we can eventually add a conditional or even replace a word. |
A native string type (and even a host provided string module) sound like a really bad idea to me. It risks this type being used pervasively which brings more code inside the Wasm module(s) in contact with it, which means more code locations needing to deal directly with conversions, which brings inefficiency, code bloat, and most importantly, since these unknown string representations can generate arbitrary errors, more code that needs to be error aware, and potentially has to deal with errors it wasn't tested with. The advantage of arrays of i8 or i16 is exactly that they're binary, they can't error while in that format. In fact, they may actual non-text binary which is often useful. For an unknown string representation you have no guarantee what "bits" it is going to preserve. This will bring non-determinism into Wasm. When it comes to strings, it's important to realize that the bulk of code just stores them or passes them on, and only a small amount of code needs to care about the actual text contained in it, and understanding unicode details. You thus want to optimize for a code structure where the bulk of code does NOT need to deal with encoding errors of implementation formats. A built-in implementation dependent string will do the opposite. Wasm GC should not be an extension of JS, it is a low-level building block for a wide variety of languages that can choose their own representations. Let's not burden them with JS implementation details. |
Yes, not clear to me either. But I assume this would be something like providing the pre-import in JS from
My hope (which @lukewagner can correct) is that that wouldn't be a problem with the pre-import approach. The imported type would have a "get char" but it would be abstract and not make any claims as to speed of access. One could provide a pre-import that is
I suspect you are right, there is a vacuum here that will want to be filled with a string type. But if pre-imports can work the way I outlined, I think it can work out without speccing such a type. On the web we'll pre-import JS strings, and if someone wants to run their wasm off the web too, and have it run the exact same way, they could link and pre-import a wasm module that implements something equivalent to the JS string type (that's why I asked about this before). It's possible that that will lead - in an organic way, without any spec work - to a JS string-like type being popular in the whole wasm ecosystem. Or maybe wasm on the server (say in the CLR or JVM) will prefer another type, who knows. |
@dcodeIO while thinking on that I found myself wondering how important the actual string type is, and am wondering now whether a useful way of thinking is to model strings as immutable arrays of u8 or u16, which can be bridged to host strings via the Interface Types conversions. Arrays already provide the indexed element load instruction needed to iterate or do random access, and the most important operations to optimize are bulk-memory ops like concat & slice that are common with strings. At the Interface Types layer, raising an A UTF-8 conversion would be applied on raising/lowering from/to I think this might give me what I would want for a JS subset if I weren't willing to directly use JS strings via function imports, and if engines optimize it it's potentially very cheap when using 16-bit strings. However there may be other benefits to a more directly shared type that's string-specific. |
I agree that part of the challenges can, theoretically, be solved by interface types, but not all. For instance, we're still left with:
|
@dcodeIO Thanks for writing that up! I wouldn't object to add WTF-8 and WTF-16 types to wasm. But adding two types means you're not giving the ecosystem a clear signal, and likely you'd end up with two ecosystems of strings, which means copies between them. Also, both of those types would not have optimal interop with strings on the CLR, JVM, or Python, for example (as your proposal says, they'd need to perform a check at the boundary - and also copy if there is an 8/16 mismatch), so I guess those are not in the 90% you expect the proposal to cover? But for many people those are very important platforms. I don't have numbers, but I'd strongly suspect the sum of those is far above 10% no matter how you look at it. (To really get something like 90%, I'd guess you need to also add UTF-8 and UTF-16 - but that just helps with some hosts, and still doesn't fully address the checks and copies between the 4 types.) Overall I think that's a reasonable proposal, and I wouldn't have a problem with it. But the benefits seem moderate. I'd lean towards waiting on doing anything to see how the ecosystem evolves, which string types are used in practice, and how bad the overhead of copying etc. is - we can always consider adding these types later. |
I don't think that it is WebAssembly's business to give a clear signal (-> to bias, especially not against JS). We had that discussion earlier, and people agreed iirc. I do not agree with your assessment that we'll end up with two ecosystems, because definitely-not-a-proposal is designed to be exactly inclusive across languages, engines and the Web using either encoding, making the bulk of languages, engines and the Web work together seamlessly.
The check at the boundary is much cheaper than alloc+copy->garbage, since the common case is that strings are well-formed. Still a net-win. This is also noted in the document, as are the reasons for picking WTF. Indeed we should make a list of languages and the encodings these (would) use (in WebAssembly with Universal Strings). For instance, I'm not so certain that all of the CLR, JVM and Python actually/strictly require well-formedness, or if one does, that this cannot be handled more efficiently otherwise within their WebAssembly target's string implementation.
I am happy to discuss a PR adding them. Needs to figures out trapping behavior. For instance, trapping behavior would immediately disqualify some languages, while picking WTF does not disqualify anyone. Previous discussions also indicated that a small set of initial encodings is preferred by some group members, but I (remember my words: I'm the good guy here) am happy to discuss and be convinced! Not gatekeeping my stuff at all.
Have been wondering if you'd make that argument actually. Interesting. |
Very interesting - this may be a crucial point here! I disagree with that statement. Allocation and collection may be pretty cheap in a highly-tuned GC, Maybe I'm making too much of an So in general, I'm worried about
It may be possible to avoid the full check, for example storing a "valid" bit. But that may increase the object size and/or add VM complexity. I guess that's what you refer to here:
I don't know enough about those details, but I'm curious! |
What I can tell is that it isn't that easy in for instance TLSF, which is what I have the most experience with, but that's also more a predictable-performance MM than anything else. Other MMs typically optimize for small block sizes, and often can't guarantee the same fast operation for larger ones. But even then, if a string is small, the overhead of checking is small anyway. Regarding a GC, I guess Immix is very close to just bump allocations, but even there what you say creates fragmentation, and more fragmentation means that it has to obtain a new block from the global allocator more often, and will have to opportunistically defragment the block eventually, again at a cost. As such, oversimplifying to a comparison with an ideal bump allocator doesn't seem like a well-thought-through argument to me. More allocations, more copies and more garbage is pretty much always bad as far as I can tell. I can see if I can gather more intel on this, and if you or your coworkers can as well, that'd be great!
Interesting idea! Might even be possible to set the bit upon construction or re-encoding of the string, making definitely-not-a-proposal even more awesome! 👍 (mind making a PR?) |
There are a few more technical details, but allocation in the nursery really does become essentially a bump, and there is no fragmentation issue:
(of course those benefits only accrue to the nursery; heap allocation in general definitely has fragmentation issues, even in a GC language) While searching I found this quote which I think summarizes it very well:
|
Thanks, these are interesting pieces of information and I'll look into these in more detail. As a first response I can offer
from the first piece, so there's still some GC pressure there by filling up the nursery faster, triggering a scan of the nursery more often which has to promote objects to the tenured region more often and stuff like that. Not saying that this isn't great, because it looks impressive to me in general, but also not |
Gone through it now and made notes: One (more generally) interesting aspect is that GGC assumes that copying is possible (it typically is if everything is JS with no external references), which might not be the case everywhere, especially off the web. Wondering if my observation has impact in practice, though, because copying may be hidden behind the
Then, this summarizes my first impression above quite well: One reduces the amounts of major GCs to do, but still has the cost of minor GC including whatever a minor GC implies, like promoting objects or more work to do overall simply because the concept is there. Might be small, though, but it's not free. Then there are also situations like
indicating not immediately obvious overhead in minor GCs that are more complex than bump allocating, leading to
indicating additional work to be necessary to enable the concept. and that there are other implied drawbacks again. You are still right that
but then the conclusion of the ideal-case (only) micro-benchmark still correctly (and I value that!) states
Hope this writeup, which is really just my impression from reading the pieces (and I only have good enough experience, I'm not an expert), is useful :) |
On another note on WTF, I found this document on unicode.org, and under 2.7 Unicode Strings it says:
So seems Java and C# don't strictly need well-formedness just as JS. Perhaps it's even fair to conclude that this is rather the norm than the exception, so the WTF family of encodings makes perfect sense. Also mentions other benefits I haven't brought up yet. |
Why should there be different encoded strings? There can be different factory instructions for an abstract string type. The internal representation of the string should be a job of the host. Java use internally also different encodings for strings. A |
The idea behind both 8 bit and 16 bit encodings (these are the two used in practice) is a use case like this: If we specify only an 8-bit encoding, and a module using a 16-bit encoding in its language passes a string to another module using a 16-bit encoding in its language, and now the second module wants to know the length of the string in its 16-bit encoding again for instance, then we end up in a situation where the first module has to (implicitly) re-encode, and the second module has to implicitly re-encode again. That's essentially the double re-encoding problem motivating WebAssembly/interface-types#13 years ago, and it also applies if no encoding is specced but the engine picks one. |
The JavaScript language has also different implementations of a string. And there is no re-encode needed to return the length. You think too much in low level programming language. A sting is a high level object on the . It can be a list of 8 bit and 16 bit chunks and more more. A length is only an attribute of it. Important is that a string reference is immutable. A string reference from u8 and from u16 can be equals that a ref.eq must work between both. Also a concat must be possible. If I need to know the encoding from the creating module then every module must re-encode it to its own encoding. Then there are exists more as one string reference type. But last not least. If I pass such string to JavaScript then the call to length() must also work without knowing of the crating encoding. |
Noticed that you are working on JWebAssembly, which I think will have to solve very much the same challenges AssemblyScript has. If you want, I can offer to meet up with you on say Discord so we can discuss this in more detail? Curious to see how your adventure is going, and to have your ideas. (btw, seems we are both from .de) |
@dcodeIO We can discuss this on https://discordapp.com/channels/453584038356058112/731539251698729030/765641907065192460 or any other location. |
Agree with @dcodeIO and @Horcrux7 on a lot of points here. Wasm needs efficient immutable UTF-16-ish strings which are fully compatible with JS/DOM string in both directions. It doesn't make sense to copy strings on Wasm boundary, for languages, like AssemblyScript, Java and Kotlin, which are fine with using JS string encoding. @jakobkummerow I wondering, is some form of this implementable in engines? |
@skuzmich : Sure, we obviously have support for JavaScript strings, and exposing those to Wasm would be easy. We don't have UTF-8/WTF-8 support yet, and adding that likely wouldn't be pleasant in terms of implementation complexity, but it's certainly doable. I think it mostly boils down to a spec question: what kinds of strings and string operations do we want to have in Wasm? I agree that for JS<->Wasm interop in particular, having compatible strings would likely be very handy. I have no particular opinion on what Wasm strings should look like. Following the discussion here with interest. Given the complexities around strings, I'm inclined to think that while they feel related to the GC proposal, they probably deserve to be split out into their own proposal, just to let us focus on one thing at a time. As an example: JavaScript, by spec, requires UCS2/WTF-16, so JS engines support that. It turns out that lots of memory is occupied by strings in many web apps, and many of those strings are one-byte strings, so V8 actually uses a one-byte ("latin1") encoding internally when possible. That makes everything much more complex, and requires copying conversions every now and then, but the memory savings are considered worth it. I see this as one data point illustrating how tough the tradeoff space is, and what kinds of ugly compromises Wasm strings may have to make in order to satisfy a number of competing goals. |
My understanding is that 1-byte strings in JS engines are a 1-byte subset of Unicode (eg, ISO 8859-1) which maintains the core invariants:
Copying from a 1-byte to a 2-byte string would only be required if you were already copying it for some reason, such as a concatenation. None of these conditions would be true of UTF-8/WTF-8 strings which seem to be proposed as the alternative to UCS-2/WTF-16. |
@jakobkummerow thanks! People had concern that multiple internal representation of JS strings could prevent Wasm from efficiently using them. It is reassuring to hear that it is doable. JS String operations that Kotlin's String class would care about are:
Last two are easily implemented using others, but if they are optimized for some variants of internal representation, it would be nice to have them in Wasm. |
Btw this issue may relate to discussion about UTF16 vs UTF8 interop here: |
I think the interface types question about UTF-16 vs WTF-16 is quite separate from the question about whether, within a component, we want to allow passing strings across certain language boundaries. Here's my mental model of the situation, but I may be misunderstanding the point of components, so please correct me if this doesn't match others' intentions. At big component boundaries, surrogate-checked UTF-16 makes sense to me, maybe with an opt-in for WTF-16 omitting surrogate checks. Coming from JS, this regularity will cost a linear-time scan, in addition to probably flattening (of the internal rope data structure) and maybe copying the string (depending what it ends up linking to). I'd be fine with including an unchecked WTF-16 domstring type initially, but I have no strong opinions, but I don't really see the rush. Within a component, it will often be important to mix multiple programming languages, but it is even more important to avoid linear-time copies, checks, conversions between 8-bit and 16-bit string representations, and flattening of ropes. The importance is greater because there is more frequent expected back-and-forth within small modules of a component. This back-and-forth is amplified by how there can be reference circularities within a component, whereas there is a hierarchical relationship between components. One example of such tight back-and-forth interaction which @dcodeIO provides is interaction is between AssemblyScript and JavaScript. In Igalia, we're interested in providing good cross-language support between C/C++ and JavaScript. In both cases, copying, flattening, converting between 8-bit and 16-bit, or doing a linear-time check over the string to detect missing surrogate pairs would add significant cost, due to the intention to enable lots and lots of strings crossing the boundary. For the within-component case, the ideal solution would allow not just malformed surrogate sequences and 8-bit strings as V8 has, but also would allow JS Strings implemented as ropes (e.g., as the result of concatenation) to be passed to Wasm, possibly manipulated, and passed back, without verifying paired surrogates or even flattening the string. To add a new type: my own personal aesthetics would be to start simpler, with a single built-in type and instructions to manipulate it, rather than the more complex pre-import strategy, but either option could be OK. (I'm OK with a built-in type being a little opinionated; I'm not sure if true neutrality is ever possible.) I agree with others on this thread that a solution in this space is related to the Wasm GC proposal, but probably makes sense to pursue separately from this repository. In principle, such a separate proposal wouldn't actually depend on the GC proposal, since it could be refcounted, but I think we'd want to make it a subtype of externref (so, overall, it's very similar to typed function references). This new string type (or pre-import scheme) could provide a lot of value for both @dcodeIO 's and Igalia's use cases. I suggest that we continue with the current MVP approach, to start out both Wasm GC and interface types without a solution for zero-copy, fully expressive string sharing across languages and components, while working in parallel on further proposals in this area such as the domstring interface type and a built-in Wasm string type (or pre-import scheme). |
Now that we have https://github.com/WebAssembly/stringref, I'll go ahead and close this issue. Without that additional proposal, the options for strings include importing them as anyrefs or using byte arrays. |
As of the MVP document, strings can be expressed as either an
(array i8)
or(array i16)
per a language's string encoding, but with only one character at a time being accessible witharray.get
and no compatibility with JS strings I'd imagine that the MVP will not be very useful beyond pure experimentation if a language wants to target GC primarily. Think comparing, hashing, substrings, etc.The Post-MVP document for instance mentions bulk-operations, in particular
array.copy
which is a start but won't help with strings, and to my surprise doesn't mention strings so far.Hence I have questions:
The text was updated successfully, but these errors were encountered: