-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform performance tests with updated GLADOS #29
Comments
Yes of course! Next year will be fine. I just added that request to keep it in mind :-) |
Did you change also the stable version 0.2.0? |
Don't need to. GLADOS was updated internally, the function call itself didn't change. The stable version thus automatically profits from this change as well. |
I recently profiled PARIS on my laptop GPU, the backprojection kernel now swallows 50% of computational time. This is a major improvement compared to the 98% measured in November. |
Wau!!! What have you changed? Does it belong to the GPU performance difference? Is the performing time now faster (on our GTX cards)??? When will you be the next time at the HZDR? |
Two things have changed since then:
I believe 1. to be the major performance benefactor as all that GLADOS threading overhead shouldn't affect GPU execution time. I'll try to come on Wednesday, depending on how early I can leave university. If that doesn't work out I'll be there Thursday. |
So, projection data buffering becomes again more importance? |
Maybe, I'd have to profile wait times on the CPU to see that. |
Jan, I would like to analyze and to profile the current code together with you (and also with Stephan and Tobias). Can we spent at least 1h on Wednesday? To be honest, I do not understand both mentioned points! I always have Tobias's solution for RISA in mind - and there are obviously significant differences. |
The first point relates to CUDA's execution strategy: A block is always executed in groups of 32 threads, even if there are not enough threads (It's called a warp). This is why a block should always be a multiple of 32. Otherwise, execution units are definitely not used on the GPU. |
Thanks for the explanations. My misunderstanding is more related to #2. I've become sensitive to while watching the timing profiles of both cuda students recognizing many idle areas. Am I right Jan? |
see CPT_2016_Extend-FDK/doc/pres_backhaus_stelzig.pdf |
Yes, there are idle areas. I believe* this issue has been resolved by eliminating the GLADOS pipeline from PARIS as there is no more waiting involved between stages (i.e. weighting, filtering, backprojection). Previously, the projections had to be transferred from one stage to the next which lead to blocking if both stages wanted to access the shared buffer simultaneously. In short: GPU kernel execution is bit too fast for the host. The management of GPU data (i.e. transferring it between stages) takes longer than processing that data, resulting in the gaps seen in the presentation. By eliminating the data transfer between stages those gaps should disappear. * I didn't have time to actually profile it / my laptop GPU seemingly doesn't support this type of profiling. |
As for Wednesday: I can't promise I'll actually make it to the HZDR. If we want to profile with four persons present we should rather do it on Thursday. |
I really do not understand why GLADOS has/had a pipeline??? Many stages can be stacked together to a data pipeline, isn't it? |
What is the difference between 1 or 4 persons??? |
GLADOS pipelines are not useless in general. However, most stages in PARIS are executing so fast that the host management routines are actually taking longer than the GPU kernels. In this special case the GLADOS pipeline doesn't offer any benefit. In all other cases the pipeline pattern is still useful. For the sake of an example, let us imagine that we want to execute the backprojection kernel five times in a row (independently) for a number of volumes. In this case, GPU execution per stage would consume considerably more time than the overhead involved by using the GLADOS pipeline, effectively masking it. Why did I build the pipeline structure? Because when I first came up with the idea I didn't know about the execution time per stage, I believed them to be roughly equal. If that had been the case the GLADOS pipeline would still be useful. However, the backprojection consumes most of the time, resulting in busy-waiting for the other stages. |
If we want to be four persons present, we have a meeting. If we have a meeting, we have to fix a date - which I can't guarantee for Wednesday. |
Okay .. then Thursday - 14:00 o'clock? |
Fine by me. |
Tobias, Stephan, Micha: fine for you? |
ok |
Ok! |
Very well. So Jan, could you please prepare yourself with information for both mentioned points? Additionally, could you please provide a timing schedule (profiling) for let's say 2-4 projections being processed and back projected into a "full-size" volume? As example, Tobias's presented profile in the Diploma defense and in the CPC paper shell be used as reference ( profiler.pdf ). Thx. |
Ok
From: AB [mailto:notifications@github.com]
Sent: Montag, 6. Februar 2017 16:40
To: HZDR-FWDF/PARIS
Cc: BodenS; Assign
Subject: Re: [HZDR-FWDF/PARIS] Perform performance tests with updated GLADOS (#29)
Okay .. then Thursday - 14:00 o'clock?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub <#29 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/ATP0xyB_PKr_26Khq3duunX29eGeaWfpks5rZz7pgaJpZM4LNFHy> . <https://github.com/notifications/beacon/ATP0x6EvV9jsS2-B-Lcyl9ro-nP2dKkyks5rZz7pgaJpZM4LNFHy.gif>
|
As described in the (poetic) title
The text was updated successfully, but these errors were encountered: