-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some performance bottlenecks for UAV loads #3
Comments
Thanks for the detailed info. CUDA has different memory model as DirectX/OpenGL/Vulkan (raw data vs data fetched using resource descriptors). It makes sense that Nvidia has different hardware paths for these. I am just used to AMDs architecture, which is more generic. The loops should definitely be slightly unrolled, because most GPUs benefit from being able to issue multiple loads at once and then wait for them at once. With loop, you issue each load separately and then wait for them. Because this loop doesn't have any ALU to hide the load latency, it would definitely be better to unroll at least by 2x or 4x. But it obviously increases the register pressure so results are highly GPU and compiler specific. The start address and the address mask are there to avoid compiler merging multiple 1d loads to wider 4d loads. This is because I want to benchmark 1d load performance and wide (2d and 4d) load performance separately. If the compiler would be allowed to merge them, all linear tests become wide 4d load tests. This is obviously not the intention. The big perf changes caused by different masking operators are very strange indeed. ALU should be irrelevant here, because the L1$ latency is much higher than any ALU latency. The memory access pattern doesn't change at all either by masking changes, so it must be that the compiler optimizes the loops differently based on different masking. Maybe it partially unrolls some cases, because the loop length (256) is known at compile time. If this test app was DX12 based, we could use the new DX12 PIX to see Nvidia shader compiler output to see what's happening. The length of the loop should not matter for performance (unless it becomes very short), since the test case is designed to fit fully to the compute unit's L1 cache. Also the group should be wide enough to fulfill all possible coalescing requirements. This is another strange result (for Fermi). Obviously in real workloads, larger loop size would often mean larger working set for each group, potentially trashing the L1$ (depending of course whether there's more data locality in program order vs neighboring groups). DX11 has loose default guarantees of data visibility. Data written by another thread group is not guaranteed to be visible to other thread groups during the same dispatch, unless "globallycoherent" buffer attribute is used and memory barrier instruction is used. This allows AMD to use L1$ for UAV reads safely (common case = not globallycoherent). This works fine, since a single group always runs fully on the same CU (= single L1 cache). However UAV writes are a bit more complicated matter, especially since there's separate K$ for the scalar units that is not coherent with L1$. I would assume that AMD shader compiler marks UAVs that are written and then driver does some pessimistic assumptions. GCN asm also has tags for individual load/store instructions that allow changing the cache protocol. |
GPUs can potentially hide fetch latency even within loops - it cat switch to another warp/wavefront if results for current one are not yet ready (AMD GCN able to keep up to 10 wavefronts in flight per CU). But it is limited by available GPU resources, primarily by GPR count. I am not sure if it is enough to completely hide latency in such synthetic shader, it would be interesting to find it out for different GPUs. Real-world shaders with fetch loops can behave differently because other parts of shader can require many GPRs which can limit occupancy and thus GPU's ability to hide latency in fetch-intensive part.
I understand the purpose of the address masks but it seems that it slows down performance even for 3d/4d loads in some cases. At the same time I noticed other cases where the mask is required to prevent scalar loads merging (SRV texture loads). So for now I have not found single solution that work well for all cases. I suppose that performance dependency on loop length on Kepler is not related to access pattern. Initially I started experiments with loop length after I had fully unrolled loop and noticed poor performance for 3d/4d raw UAV loads. I suspected that the reason was instruction cache overflow and tried to reduce iteration count. But later I observed that loop length also affects performance for loop without unrolling. Maybe compiler chooses different optimization strategies regarding loop unrolling or register allocation. This behavior is specific for Kepler, other GPUs behaves more predictably. Address mask options also affects how different loop techniques performs relatively to each other. Some results for raw buffer loads on Kepler:
|
Thanks for useful tool.
I added support for UAV loads in fork https://github.com/ash3D/perftest/tree/UAV_load (branch UAV_load). The results turned out to be somewhat slower than SRV on NVIDIA Kepler (GeForce GTX 760M). Previously I obtained higher performance with UAV compared to SRV under certain conditions in similar benchmark. So I started to experiment with shaders and eventually came up to about 2X speedup. The things I tried:
Loop unrolling.
This improved 1d and 2d raw buffer loads but significantly worsened 3d and 4d loads on Kepler. Unrolling typed UAV buffer and texture loads resulted in crashes during benchmark execution on Kepler (but Intel worked well).
I also tried partial loop unrolling. This eliminated big slowdown for 3d/4d loads on Kepler but in general partial unrolling performance was closer to original one (without unrolling), often slower. Different unroll factors worked best for different conditions (load width, access pattern) in somewhat unpredicted way.
Loop iteration count reduction.
Big 3d/4d loads slowdown on Kepler with unrolled loop suggested to reduce iteration count. This unexpectedly also improved 1d/2d performance (2X scaledown led to >2X performance). Even more unexpectedly it improved performance of original loop without unrolling.
Such behavior was detected on Kepler only. I tested Intel and Fermi a little bit and found mostly linear performance scaling there.
Remove read start address and address mask.
Reading start address from cbuffer (used for unaligned tests) harmed UAV performance even if value is 0. Address mask which is intended to prevent compiler from merging multiple narrow loads also affected wide loads performance. It seems than NVIDIA GPUs perform wide raw buffer loads sequentially anyway so performance gains from removing address mask here apparently comes from something other. There are other places though where address mask apparently actually prevents narrow loads merging on Kepler (e.g. scalar 8/16/32 bit texture SRV loads).
Removing address mask also fixed big slowdown for 3d/4d loads on Kepler with unrolled loop.
I experimented with other address mask application -
&= ~mask
instead of|= mask
. It unexpectedly improved performance. In some specific cases performance oddly became better than even without mask at all.The modifications I mentioned also affected SRV performance in some extent but UAV performance was much more sensitive.
The results ultimately became close to expected theoretical peak rates of Kepler architecture. NVIDIA GPUs implements SRV loads in read-only TMU pipeline thus performance is different compared to CUDA which uses read/write LSU pipeline. It also differs significantly from AMD GCN. All of the 4 32-bit fetch units used for bilinear texture sampling can be utilized for buffer accesses in GCN (for wide loads/stores or coalesced 1d access). NVIDIA TMU fetch units are 64-bit beginning with Fermi (it able to filter 64-bit RGBA16F textures at full rate) but apparently only 1 of 4 used for buffer reads. I have observed similar behavior before with GT200 except its' fetch units are 32-bit.
UAV accesses are served by LSU pipeline on NVIDIA GPUs. Kepler has 2:1 LD/ST to TMU ratio but UAV cached in L2 only. Initially UAV loads was slower then SRV in the benchmark but after shaders modifications I described above UAV performance became faster then SRV for invariant loads. Ratio is still not 2X but close to it. Linear and random UAV load performance varied in wide range (probably due to increased L2 access rate) and can be much faster or slower then SRV in different cases. SRV performance is very stable (it is the same for invariant/linear/random reads).
I also tested NVIDIA Fermi2 (GeForce GTX 460) a little. Fermi has L1 cache for LSU pipeline (combined with shared memory) thus UAV performance appeared to be better. Invariant UAV reads are 2X faster then SRV ones. Linear and random UAV read performance still not as stable as SRV one but much better then on Kepler. Also Fermi is not subject to big performance drop for 3d/4d UAV raw loads with long unrolled loop.
The text was updated successfully, but these errors were encountered: