Add gaussian broadening to SLIT source #214

lkeegan · 2024-03-08T09:32:30Z

param2.x: width of gaussian broadening in direction perpendicular to both slit and v
param2.y: width of gaussian broadening in direction of slit
resolves Gaussian divergence for slit source #213

- `param2.x`: width of gaussian broadening in direction perpendicular to both slit and v - `param2.y`: width of gaussian broadening in direction of slit

fangq · 2024-03-09T00:08:41Z

@lkeegan, thanks for the patch. I think this is a great extension to the slit source and I'd be happy to include it.

I do have a few minor requests

can you also update the documentation on the slit source in mcxlab.m with the new options?
I noticed that you defined 9 new registers in this new code block. by using the make register make command that I just added in the Makefile, I noticed that there is an increase of register counts in 4 out of 16 kernel templates. Here is the register count diff

fangq@taote:~$ diff reg_original.txt reg_newslit.txt 
42c42
< ptxas info    : Used 103 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 106 registers, 472 bytes cmem[0], 324 bytes cmem[2]
50c50
< ptxas info    : Used 105 registers, 472 bytes cmem[0], 388 bytes cmem[2]
---
> ptxas info    : Used 110 registers, 472 bytes cmem[0], 368 bytes cmem[2]
54c54
< ptxas info    : Used 94 registers, 472 bytes cmem[0], 336 bytes cmem[2]
---
> ptxas info    : Used 98 registers, 472 bytes cmem[0], 336 bytes cmem[2]
58c58
< ptxas info    : Used 94 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 92 registers, 472 bytes cmem[0], 324 bytes cmem[2]
62c62
< ptxas info    : Used 128 registers, 472 bytes cmem[0], 332 bytes cmem[2]
---
> ptxas info    : Used 126 registers, 472 bytes cmem[0], 332 bytes cmem[2]
66c66
< ptxas info    : Used 92 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 100 registers, 472 bytes cmem[0], 324 bytes cmem[2]

the increase of register counts ranges from 3-5. It is not a huge deal, although currently MCX's speed is largely capped by the register counts, so any optimization of registers would give a better chance to run more simultaneous blocks.

I understand that the use of new registers is for improving readability, which is also important. I am wondering if it is possible for you to try reusing some of your registers, (of course, add comments to indicate the updated meaning during multiple uses).

If you pull again, and add the register target in your Makefile, you can also use make register to compare the register savings.

let me know if you have any trouble doing this.

thanks again

fangq · 2024-03-10T16:15:03Z

@lkeegan, I managed to cut down the registers from 9 to only 2. let me merge this patch first and add my modifications, please test if the function still work as expected.

fangq · 2024-03-10T16:24:06Z

here is a diff of register counts compared to pre-merged mcx

fangq@taote:~/space/git/Temp/mcx/src$ diff p1.txt p2.txt 
45c45
< ptxas info    : Used 130 registers, 472 bytes cmem[0], 332 bytes cmem[2]
---
> ptxas info    : Used 129 registers, 472 bytes cmem[0], 332 bytes cmem[2]
49c49
< ptxas info    : Used 105 registers, 472 bytes cmem[0], 388 bytes cmem[2]
---
> ptxas info    : Used 106 registers, 472 bytes cmem[0], 368 bytes cmem[2]
53c53
< ptxas info    : Used 94 registers, 472 bytes cmem[0], 336 bytes cmem[2]
---
> ptxas info    : Used 98 registers, 472 bytes cmem[0], 336 bytes cmem[2]
61c61
< ptxas info    : Used 128 registers, 472 bytes cmem[0], 332 bytes cmem[2]
---
> ptxas info    : Used 122 registers, 472 bytes cmem[0], 332 bytes cmem[2]
65c65
< ptxas info    : Used 92 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 100 registers, 472 bytes cmem[0], 324 bytes cmem[2]

3 templates still see growth of registers (1, 4, 8), compensated by drop of registers in 2 other templates (-1 and -6). tested on CUDA 10.2. other cuda versions may see different numbers.

fangq · 2024-03-10T19:53:38Z

src/mcx_core.cu

@@ -1462,6 +1462,29 @@ __device__ inline int launchnewphoton(MCXpos* p, MCXdir* v, Stokes* s, MCXtime*
                        rotatevector(v, 1.f, 0.f, sphi, cphi);
                    }

+                    if (MCX_SRC_SLIT && (launchsrc->param2.x > 0.f || launchsrc->param2.y > 0.f)) {


the first condition should be gcfg->srctype == MCX_SRC_SLIT. I will post my fix, just leave here for a note.

…ia-uvm

lkeegan · 2024-03-11T09:16:12Z

@fangq many thanks for the quick review and the performance improvements & fixes to the PR!

I tested the master branch and it all works great, thanks!

If you are interested I also tried make register locally with cuda 11.8 and 12.4, and for these versions there was no change in the number of registers used between any of the variants of the code.

Also, if reducing the number of registers used is key for good performance then it looks like separating the parallel and perpendicular contributions to v allows more register re-use (using cuda 10.2, sm_35):

master branch version using 22 registers: https://godbolt.org/z/8hs9bvY9G
alternative using 17 registers: https://godbolt.org/z/bhYhehnf5

fangq · 2024-03-12T18:10:26Z

thanks, I wasn't aware of this compiler explorer feature, thanks for sharing it!

yes, if you run nsight-compute to profile mcx kernels, you can see that the current throughput is largely limited by register counts (similarly for mcxcl and mmc). mcx has minimal memory overhead and thread divergence.

from the godbolt page, it seems the output was generated from -Xptxas -v, which is the same way as make register. I am afraid implementing just this function alone won't report the same register use as having it as part of the larger mcx kernel, because the number of registers that can be reused or optimized is dependent on the context. I also use -O3 in the compilation which could also be a bit aggressive. if you see other key factors that can reduce registers in the larger mcx kernels, I would love to hear about it.

on the flip side, I would not go to the other extreme of destroy readability to trade small gains in registers either.

Add gaussian broadening to SLIT source

411a007

- `param2.x`: width of gaussian broadening in direction perpendicular to both slit and v - `param2.y`: width of gaussian broadening in direction of slit

lkeegan mentioned this pull request Mar 8, 2024

Use mcx SLIT source for MSOTAcuity and MSOTInvision sources IMSY-DKFZ/simpa#266

Merged

3 tasks

fangq merged commit 56eca8a into fangq:master Mar 10, 2024
27 checks passed

fangq added a commit that referenced this pull request Mar 10, 2024

[optimize] reduce gaussian slit register use from 9 to 2, #214

a2279d6

fangq added a commit that referenced this pull request Mar 10, 2024

[optimize] cut hyperboloid gaussian register use from 15 to 3, #127,#214

7b6e0e0

fangq reviewed Mar 10, 2024

View reviewed changes

fangq added a commit that referenced this pull request Mar 10, 2024

[ci] revert to windows-2019, add help info for #214, add note on nvid…

4e7f404

…ia-uvm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gaussian broadening to SLIT source #214

Add gaussian broadening to SLIT source #214

lkeegan commented Mar 8, 2024

fangq commented Mar 9, 2024

fangq commented Mar 10, 2024

fangq commented Mar 10, 2024

fangq Mar 10, 2024

lkeegan commented Mar 11, 2024

fangq commented Mar 12, 2024 •

edited

Loading

Add gaussian broadening to SLIT source #214

Add gaussian broadening to SLIT source #214

Conversation

lkeegan commented Mar 8, 2024

fangq commented Mar 9, 2024

fangq commented Mar 10, 2024

fangq commented Mar 10, 2024

fangq Mar 10, 2024

Choose a reason for hiding this comment

lkeegan commented Mar 11, 2024

fangq commented Mar 12, 2024 • edited Loading

fangq commented Mar 12, 2024 •

edited

Loading