Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create simulation volumes on the GPU #243

Merged
merged 10 commits into from
Sep 21, 2023

Conversation

lkeegan
Copy link
Contributor

@lkeegan lkeegan commented Sep 20, 2023

  • use gpu in create_simulation_volume() if available
  • construct structures as needed instead of all at once
  • empty cache between each structure to reduce GPU ram usage
  • use float32 type throughout when constructing arrays
  • increase allowed tolerance for so2 in test from 1e-8 to 1e-7 due to gpu float32 precision

Please check the following before creating the pull request (PR):

  • Did you run automatic tests?
  • Did you run manual tests?
  • Is the code provided in the PR still backwards compatible to previous SIMPA versions?

Runtime of create_simulation_volume() in test script reduces from 30.5s -> 7.2s

- use gpu in `create_simulation_volume()` if available
- construct structures as needed instead of all at once
- empty cache between each structure to reduce GPU ram usage
- use float32 type throughout when constructing arrays
- increase allowed tolerance for so2 in test from 1e-8 to 1e-7 due to gpu float32 precision
Copy link
Collaborator

@kdreher kdreher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the way we handle float32 and float64, I gues we have the options of:

  1. Doing all the calculations in torch and numpy in float32
  2. Doing the calculations in torch in float32 but then cast them to float64 for the rest of the pipeline.

What do you think?

torch.cuda.empty_cache()
# convert volumes back to CPU
for key in volumes.keys():
volumes[key] = volumes[key].cpu().numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might cause problems later in the pipeline as the array that are returned here are of the type np.float32 and later they will be used in conjunction with other np array that are natively generated as np.float64. So for me, the acoustic simulation crashes because of that but if we put here np.float64(...), then it runs about as fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you running the generate_in_silico_data script you sent me earlier? I'd like to reproduce the crash (I agree with your comment, but when I run your script the acoustic part seems to work as before)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that scripts breaks in Matlab when I don't cast the volumes to float64 with this error message:

                        < M A T L A B (R) >
              Copyright 1984-2023 The MathWorks, Inc.
         R2023a Update 5 (9.14.0.2337262) 64-bit (glnxa64)
                           July 24, 2023

Warning: Unrecognized command line option: automation.
Warning: Unrecognized command line option: wait.

To get started, type doc.
For product information, visit www.mathworks.com.

639

750

0.3823

Error using kWaveGrid/set.t_array
t_array must be evenly spaced.

Error in simulate_2D (line 94)
kgrid.t_array = makeTime(kgrid, medium.sound_speed, 0.3);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, I can't reproduce it:

2023-09-21 14:55:03,421 - INFO - ['/usr/local/bin/matlab', '-nodisplay', '-nosplash', '-automation', '-wait', '-r', "addpath('/export/home/lkeegan/simpa/simpa/core/simulation_modules/acoustic_forward_module');simulate_2D('/tmp/tmp_simpa/Forearm_10000/Forearm_10000_Wavelength_800.hdf5.mat');exit;"]

                                                                                                         < M A T L A B (R) >
                                                                                               Copyright 1984-2020 The MathWorks, Inc.
                                                                                               R2020a (9.8.0.1323502) 64-bit (glnxa64)
                                                                                                          February 25, 2020

Warning: Unrecognized command line option: automation. 
Warning: Unrecognized command line option: wait. 
 
To get started, type doc.
For product information, visit www.mathworks.com.
 
   639

   750

    0.3823

Making time!
Running k-Wave simulation...
  start time: 21-Sep-2023 14:55:07
  reference sound speed: 1624m/s
Warning: Support for ver('distcomp') will be removed in a future release.  Use ver('parallel') instead. 
> In ver>locGetSingleToolboxInfo (line 283)
  In ver (line 56)
  In verLessThan (line 39)
  In kspaceFirstOrder_inputChecking (line 1306)
  In kspaceFirstOrder2D (line 537)
  In kspaceFirstOrder3DC (line 532)
  In kspaceFirstOrder2DG (line 76)
  In simulate_2D (line 157) 
  WARNING: visualisation plot scale may not be optimal for given source.
  dt: 18.4729ns, t_end: 66.4655us, time steps: 3599
  input grid size: 639 by 750 grid points (63.9 by 75mm)
  maximum supported frequency: 7.3999MHz by 7.4115MHz
  expanding computational grid...
  computational grid size: 675 by 800 grid points
  precomputation completed in 0.69861s
  saving input files to disk...
  completed in 0.78956s
┌───────────────────────────────────────────────────────────────┐
│                  kspaceFirstOrder-CUDA v1.3                   │
├───────────────────────────────────────────────────────────────┤
│ Reading simulation configuration:                        Done │
│ Selected GPU device id:                                     0 │
│ GPU device name:                        NVIDIA A100-PCIE-40GB │
│ Number of CPU threads:                                     64 │
│ Processor name: AMD EPYC 7452 32-Core Processor               │
├───────────────────────────────────────────────────────────────┤
│                      Simulation details                       │
├───────────────────────────────────────────────────────────────┤
│ Domain dimensions:                                  675 x 800 │
│ Medium type:                                               2D │
│ Simulation time steps:                                   3599 │
├───────────────────────────────────────────────────────────────┤
│                        Initialization                         │
├───────────────────────────────────────────────────────────────┤
│ Memory allocation:                                       Done │
│ Data loading:                                            Done │
│ Elapsed time:                                           0.02s │
├───────────────────────────────────────────────────────────────┤
│ FFT plans creation:                                      Done │
│ Pre-processing phase:                                    Done │
│ Elapsed time:                                           0.43s │
├───────────────────────────────────────────────────────────────┤
│                    Computational resources                    │
├───────────────────────────────────────────────────────────────┤
│ Current host memory in use:                             329MB │
│ Current device memory in use:                          2236MB │
│ Expected output file size:                              178MB │
├───────────────────────────────────────────────────────────────┤
│                          Simulation                           │
├──────────┬────────────────┬──────────────┬────────────────────┤
│ Progress │  Elapsed time  │  Time to go  │  Est. finish time  │
├──────────┼────────────────┼──────────────┼────────────────────┤
│     0%   │        0.001s  │      1.803s  │  21/09/23 14:55:11 │
│     5%   │        0.112s  │      2.110s  │  21/09/23 14:55:12 │
│    10%   │        0.226s  │      2.026s  │  21/09/23 14:55:12 │
│    15%   │        0.327s  │      1.846s  │  21/09/23 14:55:11 │
│    20%   │        0.423s  │      1.687s  │  21/09/23 14:55:11 │
│    25%   │        0.519s  │      1.553s  │  21/09/23 14:55:11 │
│    30%   │        0.614s  │      1.431s  │  21/09/23 14:55:11 │
│    35%   │        0.711s  │      1.318s  │  21/09/23 14:55:12 │
│    40%   │        0.807s  │      1.208s  │  21/09/23 14:55:12 │
│    45%   │        0.903s  │      1.101s  │  21/09/23 14:55:12 │
│    50%   │        0.998s  │      0.997s  │  21/09/23 14:55:11 │
│    55%   │        1.094s  │      0.894s  │  21/09/23 14:55:11 │
│    60%   │        1.190s  │      0.792s  │  21/09/23 14:55:11 │
│    65%   │        1.286s  │      0.691s  │  21/09/23 14:55:11 │
│    70%   │        1.382s  │      0.591s  │  21/09/23 14:55:11 │
│    75%   │        1.477s  │      0.491s  │  21/09/23 14:55:11 │
│    80%   │        1.573s  │      0.392s  │  21/09/23 14:55:11 │
│    85%   │        1.670s  │      0.293s  │  21/09/23 14:55:12 │
│    90%   │        1.766s  │      0.195s  │  21/09/23 14:55:12 │
│    95%   │        1.861s  │      0.097s  │  21/09/23 14:55:12 │
├──────────┴────────────────┴──────────────┴────────────────────┤
│ Elapsed time:                                           1.96s │
├───────────────────────────────────────────────────────────────┤
│ Sampled data post-processing:                            Done │
│ Elapsed time:                                           0.00s │
├───────────────────────────────────────────────────────────────┤
│                            Summary                            │
├───────────────────────────────────────────────────────────────┤
│ Peak host memory in use:                                329MB │
│ Peak device memory in use:                             2238MB │
├───────────────────────────────────────────────────────────────┤
│ Total execution time:                                   3.48s │
├───────────────────────────────────────────────────────────────┤
│                       End of computation                      │
└───────────────────────────────────────────────────────────────┘

time_step =

   1.8473e-08


number_time_steps =

        3599

2023-09-21 14:55:18,334 - INFO - Simulating the acoustic forward process...[Done]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the other structures will also be adjusted like I mentioned in PR #242

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point, I'll add them to this PR

@lkeegan
Copy link
Contributor Author

lkeegan commented Sep 21, 2023

Concerning the way we handle float32 and float64, I gues we have the options of:

  1. Doing all the calculations in torch and numpy in float32
  2. Doing the calculations in torch in float32 but then cast them to float64 for the rest of the pipeline.

What do you think?

I think it depends what benefit you get from using float64:

  • if all the pipeline steps use float32 (which at this point it looks like they mostly do?) then using float32 everywhere probably makes sense
  • but if you have (or may in the future have) pipeline steps that use float64 (and where you notice the loss of precision if you use float32), then option 2 is probably better
  • as you said there's not much difference in terms of run-time, but using float64 does double the disk space you need to store them (and increases the time taken to read/write them to disk)

If you do choose option 1, I guess that changing the format you write data to disk is probably a breaking change(?) which should be done in a separate PR, so either way this PR should be fixed to cast them back to float64 as you suggested.

@kdreher
Copy link
Collaborator

kdreher commented Sep 21, 2023

I don't think that we actually need the precision of float64 anywhere so I don't see a reason why we shouldn't have everything in float32

Copy link
Collaborator

@kdreher kdreher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@kdreher kdreher merged commit ed02703 into IMSY-DKFZ:main Sep 21, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants