-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed RFC Feature: Multi-device Resources #120
Comments
Based on the discussions here #32 I think the RFC looks good for the most part. The design around DeviceBuffer and MultiDeviceBuffer sounds good. A few points below ->
|
Thank you for the response! Here are our thoughts to the points you raised:
Is that the part you were looking for, or are we misunderstanding something here?
We think our main goal here should be to get these changes in as quickly as possible so that any optimizations and additional features/improvements can be built on later. This concerns points 4 and 5 mainly. The reason behind this is again, that this whole thing is a considerable amount of changes and the development costs compound the longer it takes, since every rebase takes additional time and more changes/fixes. |
|
Regarding In particular, the following methods and members would need to be updated:
The question is, how should this be realized, since these methods now need to be able to, instead of managing a Proposed solutions
|
Just so I understand correctly let me go over how I am imagining the initial setup of resource binding will work for multi-gpu support at RPI level. Currently RPI has RPI::ShaderResourceGroup (RPI::Srg for short) which has a bunch of functionality but we only need to focus on two variables here -> RHI::ShaderResourceGroup (RHI::Srg for short) and RHI::ShaderResourceGroupData (RHI::SRGdata for short). RPI currently manages RHI::SRGdata and allows passes to populate it with resource views. This object is then fed into RHI::Srg which uses it to compile and bind all the resource to the native RHI backend object (i.e Descriptor set/descriptor table/argument table). With multi gpu support I am assuming that each pass at the RPI level still is only meant to be run on one device. Since one pass can only run on one device then a pass can contain RPI::SRG that is only meant to hold views for one device only. This actually makes code at the RPI level simpler. RPI::SRG can behave the same way where one RPI::SRG can have still only have one RHI::SRGdata and one RHI::Srg. RHI::SRG does not need to be split to RHI::DeviceSRG as it will only ever be used on one device. RPI::SRG api related to Get/Set views can stay the same and it's contents can stay the same. The only thing that will change is the higher level code that is invoking this api. The higher level pass code knows which device it will need to run on (from the render pipeline). Lets say this device index is called m_deviceIndex and lives within Pass class. Now any code that is invoking the api of RPI::SRG it will need to get the resource view via the device index. Based on our last conversation since the view hierarchy did not go through Device split I am assuming it is still DX12::BufferView->RHI::BufferView->RHI::ResourceView->RHI::DeviceObject->RHI::Object and RHI::DeviceResource has cache of all the created RH::ResourceViews. So GetResourceView will need to be expanded to take a device index. It will use the device index to cache the RHI::ResourceView within the correct RHI::DeviceResource. So RPI (i.e the higher level code) will provide the device index when creating/getting resource view. This RHI::ResourceView can then be used by the RPI::SRG and no api changes will be needed. Let me know if this is confusing and if anything needs expanding or if I missed something myself and am completely on the wrong track. |
It's not confusing, but unfortunately, you are on the wrong track. It's true that if passes only run on a single device you would not need multi-device SRGs. But passes are not the only place where SRGs are used. Think of the scene SRG which is created independently from a pass and is very likely needed on multiple devices. Another way you can think of it: if there are no multi-device SRGs, you cannot have multi-device Draw-/DispatchItems and we already had a lengthy discussion about why we need those. So, unfortunately, we need a multi-device SRG. Given this, could you please again consider the API changes proposed (i.e. using a |
Ahh I see your point about multi-device SRG. Ok Let me backtrack a bit. In order to support multi-device RHI::SRGs this is how I would think the new design may look like. Remember that RPI::Srg is the one that populates RHI::SRGdata which is just a container object and passes it to RHI::Srg. Lets say that now RHI::Srg goes through the split to RHI::DeviceSRG and now RHI::SRG can contain multiple DX12::SRGs. Same thing applies to RHI::ShaderResourceGroupPool. RPI::Srg will now need to manage multiple RHI::SRGdata, one per device. So RHI::SRGdata api will not change but RPI::SRG api will need to change. I think your suggestion can work here. So RPI::SRG's Get/Set methods can be modified to take in azstd::pair<MultiDeviceResource, ResourceViewDescriptor> but it needs to know the device index which will tell it which RHI::SRGdata to update. Maybe it can get the device index from the ResourceViewDescriptor? You would also want to keep the current api where Get/Set methods also take in a RHI::ResourceView as the higher level code may want to keep a cached view around and just want to pass that in instead of a pair. So now we have RPI::SRG that is now managing multiple RHI::SRGdata and one multidevice RHI::SRG. Next we will need to change RHI::SRG's Compile api. Currently RHI::SRG takes a RHI::SRGdata and essentially updates the native RHI object from that. We will need change the api to now take multiple RHI::SRGdata and internally it will call Compile on DX12::SRG with the correct RHI::SRGdata. So just to conclude we will need to modify RPI:SRG api and RHI::SRG's compile api. Does this work? |
Thanks for your suggestions! What you are describing is basically what we have in mind for the multi-device SRGD class, except that you plan to implement it in the RPI::SRG instead. This would work, but we think it would still be nice to have a MDSRGD class to separate this functionality from the RPI, have it testable separately and consistent with the other multi-device classes which are implemented on the RHI rather than the RPI level. If you don't have any strong objections to having a separate multi-device SRGD class, we will implement it like this. You are of course right that the API of the RPI::SRG has to be adapted later then as well to reflect these changes in the MDSRGD class, when we port the RPI to use the MD classes. |
yeah we could push the multidevice functionality to RHI::SRGdata. It would keep RPI clean. I am not opposed to pushing the functionality down to RHI and basically have a RHI::MultiDeviceSRGData contian multiple RHI::DeviceSRGdata and RPI is only aware of RHI::MDSRGData. That can work too. |
Since this RFC is accepted please open a PR and move this RFC to this folder - https://github.com/o3de/sig-graphics-audio/tree/main/rfcs where we will track all the new RFCs. Thanks. |
Here you go: #137 |
@moudgils I've been working on using the MD resource classes in the RPI. Doing that I ran into the issue that I needed to use MD classes in the scope attachments. We also do so in our prototype. However, as I was thinking about it more, I realized that we might not want that. The basic idea that we are following is that a scope (and thus a pass) is attached to one device. In the prototype for example, the MD SRGs are submitted for compilation by the passes. But that means depending on the device mask of that MD SRG, multiple SD SRGs are unnecessarily compiled, even though only one of them is required for the specific task. The better solution would be to transition from MD to SD already when submitting for compilation in the passes rather than between compilation and execution. This would mean, that the passes need to be more aware of the device they are running on, i.e., call device specific methods of the RPI classes (passing their device index) that store the MD objects. An example here is the following (bringing us back to the scope attachments that triggered my thinking around this): The What are your thoughts about this? |
RFC: Multi-device Resources
Summary
This proposal outlines how resources (buffers, images, etc.) and related classes and structs are prepared for multi-device support in Atom. This is the first necessary step towards multi-device support and already comprises of a significant amount of changes. The main idea here is to introduce a new layer of data structures in the RHI, which support multiple devices and then forward calls to the RHI backends for each device. A non-goal of this proposal is to already support multi-device execution, although changes to the frame graph already have to be made for this to work as well as the transition from multi-device objects to single-device execution, which currently happens when advancing from frame graph compilation to execution. A side goal is to keep the backend and RPI code as unaffected as possible to (1) avoid having to implement multi-device features multiple times within every backend and (2) avoid all higher-level code (RPI and above, as well as any third-party Gems) having to change much. However, this is not completely possible to avoid.
What is the relevance of this feature?
The ultimate goal is to support multi-device execution within Atom. This could for example be used to render stereoscopic images for virtualy reality headsets on two devices simultaneously, i.e., one device per eye. The first step towards this goal is removing the globally accessible device. When looking into this, one quickly finds that in most places, one does not know on which device a resource should then be allocated on, since there is not enough context for the execution. For example, when mesh data is loaded and uploaded to the device, one does not know which device to use, since one doesn't know which render passes will use that mesh and where they will be executed. It is actually likely, that the data is required on multiple devices. Therefore, support for having resources on multiple devices is necessary and this needs to be established before any execution on multiple devices can be accomplished.
Feature/Technical Design Description
The basic idea of the technical solution is to introduce new classes and structs in the RHI which support multi-device resources. This includes most of the classes derived from
DeviceObject
with few exceptions. The base class for these new classes will be aMultiDeviceObject
, which stores a device mask, denoting on which devices this resource will be present. This device mask is a 32-bit field, where each bit represents one device. TheRHISystem
will support an array of multiple devices where each device can be accessed through an index (corresponding to the bit position in the device mask, e.g., the least significant bit corresponds to the device with index 0). Internally, each multi-device object will then contain a map, which maps from the device index to the correspondingDeviceObject
-derived object. An example is given in the following UML diagram, where we show this for the buffer resources.Any API calls on a multi-device object will be forwarded to all managed device objects, unless otherwise specified, e.g., through providing a mask, selecting which devices should apply the call. The masks may also contain bits for devices that are not actually available, which will be ignored. This allows us to have an
AllDevicesMask
mask, where all bits are set independently of how many are actually available. The default device has device index 0 and will be also available through aDefaultDeviceMask
, where only the least significant bit is set. The following is a list of all the classes, where both a single-device and a multi-device version will be available. All "final" classes, i.e., those that are not further derived from, contain a map in the respective multi-device class, which maps to the corresponding single-device class objects.DeviceObject classes without multi-device version
There are a few classes in the
DeviceObject
hierarchy, which do not get a multi-device class version. The first few classes of these areResourceView
-derived classes, i.e.,BufferView
andImageView
. Outside their RHI backend derivatives, these two only have two members each, the corresponding resource (buffer or image) and a descriptor, which is device independent. Since the memory management regarding views is quite complicated as is, as the resources store references to their views and a layer of multi-device objects would complicate this even more, we opt to change the higher-level interface at this point. Instead of a multi-device view, we use the combination of the resource and descriptor instead of binding them together.The classes
AliasedHeap
,AliasedAttachmentAllocator
andCommandQueue
seem to not need a multi-device version, since they are never used outside of a single-device context. However, the classCommandQueue
will actually handle multi-device classes and translate them to single-device objects, without needing a separate multi-device version, as we do not currently plan to be able to submit work to multiple devices at the same time. More details will be discussed further below.Other classes and structs with both a single-device and multi-device version
Other classes and structs, which do not themselves derive from
DeviceObject
, need multi-device support as well. Most of them are descriptors, requests and items (for copy, dispatch and draw). They need a multi-device version, because one or more of their members reference someDeviceObject
derivative and are used both in the higher-level API, where we now want to use multi-device objects, as well as in the RHI backends, where we need their single-device version. In most cases, those classes and structs are passed as parameters to some methods of the resource classes. Therefore, the conversion happens mostly inside the multi-device classes, when they loop over their single-device object map and forward the call.Some of the multi-device classes need a map as well in order to be able to be used properly or more efficiently. These classes are
ImageSubresourceLayoutPlaced
,IndirectBufferWriter
,PipelineLibraryDescriptor
,ShaderResourceGroupData
and the*Item
classes. The derivation ofImageSubresourceLayoutPlaced
fromImageSubresourceLayout
and their polymorphic usage actually makes it difficult to have a multi-device version of them without using unnecessarily performance-impacting dynamic casts to figure out which one is actually behind a pointer. Therefore, we propose to merge the two and, if necessary at all, add a boolean, whether the offset is used or not.Here is a list of the affected data structures:
Classes:
Structs:
Switching from multi-device to single-device for execution
At some point we have to switch from multi-device resources to single-device resources. The last chance for that is when submitting work to a specific device. Therefore, we propose (at least for now) to use a single frame graph for multiple devices. Doing this should help to synchronize the work between multiple devices, by being able to use already present synchronization mechanisms in the frame graph, just extended for multiple devices. This is not fleshed out in detail yet and not the goal of this proposal, but in order to get this code to work at all, we need to have some point of transition. For now, we basically transition between frame graph compilation and execution. Since we are not actually submitting work to multiple devices yet, our plan is to always transition to the default device at the moment.
As mentioned before, the multi-device versions of
*Item
classes contain a map to their single device version. This is not strictly necessary, since a conversion could simply be done when necessary, but since they are potentially kept for multiple submissions and even multiple frames, it is more efficient to cache the single-device structs as well. The conversion from the multi-device to single-device version happens in theRHI::CommandList::Submit
methods, which forward the calls to the corresponding backend.Planned git history
We intend to divide the pull request for this proposal into four commits, which should simplify the review. Also, looking at the changes as a whole, git will not detect renaming of files properly, which it still does, when looking at the first commit only for example.
The first commit is "simply" a rename of all relevant structs and classes to
Device*
. Since this can mostly be done by search and replace with little manual intervention, there should not be much that could fail here.In a smaller, second commit we actually initialize multiple devices in the
RHISystem
. We introduce a new command line switch (--rhi-multiple-devices
) to initialize all available devices, otherwise only one device is initialized as before. Note again that we will allocate resources on these devices, but only submit commands to the default device.The next commit introduces the new multi-device classes with the original naming of the classes that were renamed in the first commit.
Finally, in the last commit, we will rewrite everything to use the new multi-device RHI classes rather than the
Device*
classes. This concerns mainly the RPI classes of those resources, but also direct uses of the RHI classes, which constitute a considerable amount of the changeset. This will be the critical commit to check, if everything is working.What are the advantages of the feature?
RHI::Device
is not a global anymore and it is possible to use multiple devices.What are the disadvantages of the feature?
How will this be implemented or integrated into the O3DE environment?
The feature will be developed in a branch parallel to the development branch, until properly tested and ready for merging. When merged, third-party projects will have to make some smaller adjustments to the new API changes. During the testing and development phase, we will need help for MacOS/Metal and mobile plattforms, since we do not have the necessary hardware and software available to test this.
Are there any alternatives to this feature?
We tried an approach to introduce multi-device support in the RPI resources, which turned out not to be possible due to some objects needing multi-device support in the RHI. Another approach would be to support multi-device resources directly within the RHI resources without an additional layer at the cost of having to implement everything for each backend which, would probably add unnecessary code duplication at hardly any performance benefit. It would also be possible not to have multi-device resources at all, but that would complicate any further multi-device developments, since resources then have to be held on multiple devices at a much higher level in the code base, leading to the inconvenient case that every feature that should support multiple devices has to keep track of resources on each device manually. This would likely also lead to a lot of code duplication.
How will users learn this feature?
Most of these changes should not affect users at all. Any changes necessary to higher level code should be straightforward and easy to implement. The required changes will be documented properly and changes in o3de itself and the atom-sampleviewer will provide plenty examples.
Are there any open questions?
The naming of the classes and structs is still something that is open for debate. The least amount of changes in the higher level code base would happen if the multi-device versions would take over the names of the single-device versions and single-device versions get a new name. I.e., the current
Buffer
class will be renamed toDeviceBuffer
and the multi-device version of it will simply be namedBuffer
. An alternative would be to call the classesDeviceBuffer
andMultiDeviceBuffer
respectively.The text was updated successfully, but these errors were encountered: