-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display list serialization / safety. #1800
Comments
@gankro opened an issue so we have a single place to discuss DL serialization :) |
cc @jrmuizel |
Yeah, I was always worried about removing the flat arrays. I wonder if what we really want is something like Cap'n Proto, which is a zero copy serialization framework. I had looked into it before but had concerns about overengineering things. But if it turns out to improve maintainability, it might be worth it... |
BackgroundAs far as correctness is concerned, the largest constraint is it needs to support being sent over IPC. This means data must be serialized into an array of bytes in some way (no pointers). And for security reasons, if one process is compromised, it shouldn't be able to trigger Undefined Behaviour in the other process. So basically bad DL's should either fail to parse, fail to process, or produce a (probably junk) picture. (gpu attacking content through the DL should be ~impossible regardless of what we do) Currently all that bincode really buys us here is enum/bool tag validation. The current design of the display lists is two arrays: DisplayItems followed by a GlyphCache. The GlyphCache is a basically a list of every glyph that will be found somewhere in the DisplayItems ( During construction we store the DisplayItems in serialized form, and the GlyphCache as a The reason DisplayItems are always serialized, at this point, is because our current serilization strategy (bincode) significantly compresses them, saving on memory usage (which is ultimately time savings at this scale). Compression is so significant because the SpecificDisplayItem enum is lopsided. Some DisplayItems also come with auxiliary arrays. For instance, a TextDisplayItem effectively has a During consumption, we deserialize contents "on demand". This avoids allocating an entire array for the deserialized results (which would be huge). Problems With The Current DesignBincode is pretty slow. A lot of this is terrible codegen. @jrmuizel is looking into this. The glyph cache is wasteful in several ways We copy on "both sides" of the IPC. Copying on the backend is probably necessary for safety. The frontend is more complicated (more on that later). For auxiliary arrays, we have to jump over and potentially deserialize them, even if we don't access them right away. This is really bad for cache usage. Potential Design ChangesSo there's, in my mind, two big ideas we can look at: N-ArraysRather than having a linear display list, we break the display list into n arrays, one for each kind of DisplayItem (and potentially one for Clips?). The actual display list then becomes a In the backend/ipc-channel, these arrays will all be concatenated into one, with a table of byte offsets at the front. Some loose notes: We get some compression "for free" under this system, because some of the lopsidedness is lost. For instance the Clip array won't have to give each clip space for PushStackingContext's two Matrices. But PushStackingContext's array will still be lopsided because one of those is itself Option. Those could also be outlined into their own array, but there's probably diminishing returns (or net losses) taking this too far. We can eliminate serde completely for POD display items. Transmuting their entire array will be safe. The GlyphCache doesn't need to exist anymore, as the backend can just jump to the TextDisplayItem array and process that. This saves us some bandwidth. We don't need to "walk over" auxiliary arrays that we don't want to process right now, as they'll be out of line. However in exchange we may take more cache misses jumping to one of the arrays (but we should generally process the contents of a given array monotonically, so that might be fine?). We lose some bandwidth recording all the array indices. However, maybe not: if a kind of item isn't multiply referenced (everything but clips?), we can just remember where we last were in that item's array! So basically the backend would have n offsets it maintains as it processes. When it sees "TextDisplayItem"'s tag in the DL, it reads out N-arrays can also be done partially. For instance, only outlining clips or other problematic items. This whole thing is a lot more of a maintenance burden than the current linear system, though. IPCBuilderIn theory we could give DisplayListBuilder an actual IPC channel, and have it push/serialize items directly into that. Since the backend needs to copy on its side anyway, and also internally receives the DL in chunks, this is saving us a huge allocation and copy on the frontend with little change to the backend. Copying on the frontend is currently hard to avoid because gecko wants to "hold on" to completed display lists to combine them. It's possible @jrmuizel's work will eliminate this? I'm not sure if there are other complications. Note that this is slightly incompatible with the N-Arrays approach. Although it may just be as simple as "the backend maintains n arrays". But that might lead to too much IPC traffic, since it will be harder to buffer? |
Thanks for writing this up! It looks like your last sentence in the Background section is incomplete. I'll follow up in detail tomorrow. |
So I tried out a rustc with rust-lang/rust#45012 and while it may have helped the serialization code a little, the generated code is still bad. Right now it looks llvm is getting overwhelmed when the structs are big or complicated. I took a quick look at the deserialization code and it still didn't seem great, but I didn't look closely. I'll investigate further. |
Yeah, it looks like if you're serializing a struct that has two Options in it llvm gets all confused and generates bad code. I'll file a rust issue about it and maybe @sunfishcode can help. |
Fascinating. When I looked into this stuff last I pretty quickly ran into fundamental issues around aliasing, which led me to the conclusion that this stuff is better done on MIR. But if there's lower hanging fruit in LLVM that would be great. |
I filed rust-lang/rust#45068 about the serialization codegen issue. |
Now that #1830 has landed our serialization code is decent enough. I've since started looking at the deserialization code. The current serde/bincode approach is going to give llvm a really rough job of generating good code. The generated rust code involves lots of functions returning the values of there fields, because rust doesn't do a good job with return-value-optimization (rust-lang/rust#34861) it looks like we end up copying a lot of data around during deserialization. Once the sub structs get big enough llvm starts using calls to memcpy and are chances of good performance decrease further. An additional impediment to good codegen will branchiness caused by the fine grained error handling. I'm going to try to put together a stand-alone test that shows off the bad codegen and we can see what can be done on the rust side. At the same time we probably need to have a solution that doesn't require heroics from the compiler. One can imagine a serde style macro that instead of returning a value would take an &mut and fill in the fields and only return a bool on success or failure. We could pass in a SpecificDisplayItem initialized to mem::uninitialized(). This should be enough to avoid having all values being moved all over the place. @dtolnay how easy would it be to build an unsafe deserializer like this? |
@jrmuizel Thanks for following up on this stuff. I tend to agree with you in regards to the compiler heroics. From talking to @gankro, the safety guarantees we get from using serde/bincode are that we know that enums / bools are valid values (although they could be modified to a different valid enum value, without detection) and perhaps some extra level of bounds checking. There's nothing that gets validated for ints / floats, since any bit pattern in those is valid. There are, of course, other benefits we get too (such as compression). The other extreme from using serde / bincode seems to be a straight copy / transmute of the DL, and we manually write (unsafe) validation helpers for each struct type in the DL that check the validity of each field type. I think we could reasonably do this validation while we walk the DL during the flatten stage. It should be fairly easy to make this close to optimal in terms of performance while allowing validation? I know it's far from ideal (having to manually audit the validation code, and keep it up to date), but it seems like it could work, as a worst case, if we can't sort out these codegen issues? |
Serde is interested in supporting this style -- tracked in serde-rs/serde#855. We had been looking at it for the csv crate, not to avoid memcpy of the return values but to reuse memory allocations across rows deserialized from the same csv file. The return value is |
This would be fantastic to have!
Though I expect the fine-grained (per-field) error-handling that serde
wants (needs?) to produce will be a bigger bottleneck for us.
…On Sun, Oct 8, 2017 at 10:04 PM, David Tolnay ***@***.***> wrote:
One can imagine a serde style macro that instead of returning a value
would take an &mut and fill in the fields and only return a bool on success
or failure.
Serde is interested in supporting this style -- tracked in
serde-rs/serde#855 <serde-rs/serde#855>. We had
been looking at it for the csv crate, not to avoid memcpy of the return
values but to reuse memory allocations across rows deserialized from the
same csv file. The return value is Result<(), D::Error> which for Bincode
is 8 bytes so it should be just as efficient as a bool. Does this seem like
it would fit the bill?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1800 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABFY4CLs7R5uMhnxdyZ5DBsNCYW4TbN_ks5sqX86gaJpZM4Ps5Yx>
.
|
I've filed bincode-org/bincode#206 to start investigating bincode deserialization performance. |
I have zero confidence in our ability to manually maintain the safety of this over the years to come. Let's try everything else that we can first. |
So both serialization and deserialization are no longer have atrocious code. However looking at profiles and generated code it looks like there's still room for improvement. I've filed https://bugs.llvm.org/show_bug.cgi?id=34911 to help reduce some of the length checks. |
Splitting out the ClipScrollNodes to their own array makes a lot of sense to me. That means that the entire ClipScrollTree could be built in one pass and then the rest of the display list later. |
I think this can be closed now, in favor of more specific issues if there is remaining work. Please re-open if you think otherwise. |
We are experiencing some performance issues with the current display list serialization and deserialization implementation.
We believe that the majority of this is related to codegen issues between serde and rustc. It's possible these will be resolved in the future, but it would be good to have written down what the benefits of the current implementation are. That way, if we consider any changes to the current method, we can ensure we don't miss any of the required functionality.
As an extreme example, consider if we had something similar to the old method - the display list is a set of flat arrays of data, which is simply copied / shared to WR.
The benefits I'm aware of with the current implementation:
The value of (1) seems clear (although we may want to quantify that somehow if we haven't already?). It's less clear to me how (2) works - could we write up an example or two of how this provides safety?
The text was updated successfully, but these errors were encountered: