Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omp target offload #1212

Merged
merged 44 commits into from
May 5, 2023
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
98e185e
Add GPU-accelerated version of accumulate
jamesp-epcc Feb 18, 2022
fe95922
Map data independently from code offload (not working yet)
jamesp-epcc Mar 3, 2022
af168f2
Implement pixel distortion update on GPU
jamesp-epcc Mar 18, 2022
f014a0c
Prevent crash in GPU distortion update loop
jamesp-epcc Apr 1, 2022
ffe777c
Fix a bug in GPU pixel bounds computation
jamesp-epcc Apr 1, 2022
38da021
Fix null pointer error with certain compilers when some photon data n…
jamesp-epcc Jun 10, 2022
b12e813
Fix various bugs so that output from GPU matches CPU version
jamesp-epcc Jul 22, 2022
6dddbb2
Add finalizeGPU method to clean everything up and copy back result
jamesp-epcc Jul 22, 2022
94c9318
Remove some unused and commented out stuff
jamesp-epcc Jul 22, 2022
3bd5298
Tidy up function arguments and fix diffStepRandom size bug
jamesp-epcc Jul 25, 2022
7acb4c4
Remove some unneeded GPU pointers. Support single precision images on…
jamesp-epcc Aug 1, 2022
c915d0b
Remove GPU-specific bounds data and use same structures as CPU
jamesp-epcc Aug 1, 2022
5a12cbc
Remove separate GPU arrays for boundary points and distortions
jamesp-epcc Aug 1, 2022
be0a14b
Copy data back from GPU in addDelta and subtractDelta
jamesp-epcc Aug 1, 2022
3b237e0
Release GPU memory after completion
jamesp-epcc Aug 19, 2022
1d77aa9
Remove manual memory management from GPU code (still needs tested!)
jamesp-epcc Aug 22, 2022
45bb1da
Remove custom GPU data structures and use same ones as on CPU
jamesp-epcc Aug 29, 2022
863af63
Allow GPU version to build with setup.py
jamesp-epcc Sep 30, 2022
6724bef
Add GPU support to setup script
jamesp-epcc Oct 21, 2022
6bad511
Get GPU sensor model to pass tests
jamesp-epcc Dec 2, 2022
903e5b9
Make resume feature work with GPU
jamesp-epcc Dec 9, 2022
8fadc6c
Tidy up code. Remove redundant non-GPU versions of functions
jamesp-epcc Jan 20, 2023
550af7a
Work around incorrect Clang definition of __CUDA_ARCH__
jamesp-epcc Feb 6, 2023
ae8b6ef
Ensure -lgalsim is passed to linker only once and is before -lfftw3
jamesp-epcc Feb 6, 2023
2e8028e
Wrap all GPU directives in #ifdefs
jamesp-epcc Feb 20, 2023
0423cf1
Fix function ordering, whitespace and comments. Turn off debug mode
jamesp-epcc Mar 17, 2023
c31a68a
Merge random arrays into one to get around Clang runtime argument limit
jamesp-epcc Mar 24, 2023
cdaa15b
Add const getter methods to PhotonArray instead of using const_cast
jamesp-epcc Mar 24, 2023
496530e
Move pixel distortion update back out into separate method (still nee…
jamesp-epcc Mar 31, 2023
62e11ca
Remove hardcoded abs_length_table and use values passed in from Pytho…
jamesp-epcc Mar 31, 2023
1107b33
Use GPU version of updatePixelDistortions in initialisation
jamesp-epcc Mar 31, 2023
cf8ff66
Get rid of separate GPU and CPU versions of updatePixelDistortions
jamesp-epcc Mar 31, 2023
e10f9cd
Fix various issues with comments
jamesp-epcc Apr 10, 2023
64dfa4b
Use simpler method to calculate size of image data arrays
jamesp-epcc Apr 14, 2023
49e7dc9
Call calculateConversionDepth instead of inlining
jamesp-epcc Apr 14, 2023
b7d08de
Add comment explaining that GPU requires raw pointers rather than obj…
jamesp-epcc Apr 14, 2023
895253c
Call Bounds::includes instead of inlining
jamesp-epcc Apr 14, 2023
f5c7b19
Use same version of updatePixelBounds on CPU and GPU
jamesp-epcc Apr 14, 2023
e0882cf
Use unique_ptr<bool[]> to ensure array gets deleted properly
jamesp-epcc Apr 21, 2023
6290a4e
Move updatePixelBounds back to original location and fix formatting
jamesp-epcc Apr 24, 2023
0502f08
Use += operator in updatePixelBounds to make code clearer
jamesp-epcc Apr 28, 2023
4454bf0
Add back old code in insidePixel for documentation purposes and impro…
jamesp-epcc Apr 28, 2023
cd83a3c
Move one-time data initialisation to constructor
jamesp-epcc Apr 28, 2023
39f941e
Define integer min and max functions and use them for clarity
jamesp-epcc May 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions include/galsim/Laguerre.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@
#if defined(__GNUC__) && __GNUC__ >= 6
#pragma GCC diagnostic ignored "-Wint-in-bool-context"
#endif

// Clang incorrectly defines __CUDA_ARCH__ in host code when building for
// OpenMP target offload, so we have to undefine it or Eigen gets confused
#undef __CUDA_ARCH__
#include "Eigen/Dense"
using Eigen::VectorXd;
using Eigen::MatrixXd;
Expand Down
6 changes: 6 additions & 0 deletions include/galsim/PhotonArray.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,12 @@ namespace galsim {
double* getDXDZArray() { return _dxdz; }
double* getDYDZArray() { return _dydz; }
double* getWavelengthArray() { return _wave; }
const double* getXArray() const { return _x; }
const double* getYArray() const { return _y; }
const double* getFluxArray() const { return _flux; }
const double* getDXDZArray() const { return _dxdz; }
const double* getDYDZArray() const { return _dydz; }
const double* getWavelengthArray() const { return _wave; }
bool hasAllocatedAngles() const { return _dxdz != 0 && _dydz != 0; }
bool hasAllocatedWavelengths() const { return _wave != 0; }
/**
Expand Down
40 changes: 33 additions & 7 deletions include/galsim/Silicon.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,20 +34,25 @@

namespace galsim
{

class PUBLIC_API Silicon
{
public:
Silicon(int numVertices, double numElec, int nx, int ny, int qDist,
double diffStep, double pixelSize, double sensorThickness, double* vertex_data,
const Table& tr_radial_table, Position<double> treeRingCenter,
const Table& abs_length_table, bool transpose);

template <typename T>
bool insidePixel(int ix, int iy, double x, double y, double zconv,
ImageView<T> target, bool* off_edge=0) const;

void scaleBoundsToPoly(int i, int j, int nx, int ny,
~Silicon();

bool insidePixel(int ix, int iy, double x, double y, double zconv,
Bounds<int>& targetBounds, bool* off_edge,
int emptypolySize,
Bounds<double>* pixelInnerBoundsData,
Bounds<double>* pixelOuterBoundsData,
Position<float>* horizontalBoundaryPointsData,
Position<float>* verticalBoundaryPointsData,
Position<double>* emptypolyData) const;

void scaleBoundsToPoly(int i, int j, int nx, int ny,
const Polygon& emptypoly, Polygon& result,
double factor) const;

Expand All @@ -72,6 +77,8 @@ namespace galsim
template <typename T>
void initialize(ImageView<T> target, Position<int> orig_center);

void finalize();

template <typename T>
double accumulate(const PhotonArray& photons, int i1, int i2,
BaseDeviate rng, ImageView<T> target);
Expand Down Expand Up @@ -249,6 +256,12 @@ namespace galsim

void updatePixelBounds(int nx, int ny, size_t k);

void updatePixelBoundsGPU(int nx, int ny, size_t k,
Bounds<double>* pixelInnerBoundsData,
Bounds<double>* pixelOuterBoundsData,
Position<float>* horizontalBoundaryPointsData,
Position<float>* verticalBoundaryPointsData);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have two ways to do this please. Rewrite the existing updatePixelBounds to work with the GPU. Don't repeat everything in a second function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not fixed yet. You still have both of these functions.


Polygon _emptypoly;
mutable std::vector<Polygon> _testpoly;

Expand All @@ -265,6 +278,19 @@ namespace galsim
Table _abs_length_table;
bool _transpose;
ImageAlloc<double> _delta;
std::shared_ptr<bool> _changed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is wrong. This used to be a local vector. And _changed.get() is still being used as a bare C array. This might explain the crashes you were seeing. So after fixing this, probably could try to put back the min/max usage, which IMO is more readable than the two step you switched to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved _changed to be a member variable rather than local to avoid the overhead of allocating it on the GPU every time update is called. The reason it's now a shared_ptr to a bare array instead of a vector is that bool vectors have an optimised implementation that packs multiple bools into each byte, so you can't get a pointer to the raw data, which the GPU requires.

I agree that std::min and std::max would be more readable, but surprisingly the GPU doesn't support them (I was using them initially but they caused weird bugs that took a long time to track down). We could implement custom, GPU-friendly min and max functions rather than writing out the algorithm though.


// GPU data
std::vector<double> _abs_length_table_GPU;
std::vector<Position<double> > _emptypolyGPU;
double _abs_length_arg_min, _abs_length_arg_max;
double _abs_length_increment;
int _abs_length_size;

// need to keep a pointer to the last target image's data and its data type
// so we can release it on the GPU later
void* _targetData;
bool _targetIsDouble;
};

PUBLIC_API int SetOMPThreads(int num_threads);
Expand Down
41 changes: 34 additions & 7 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ def all_files_from(dir, ext=''):
'-Wno-shorten-64-to-32','-fvisibility=hidden','-stdlib=libc++'],
'clang w/ manual OpenMP' : ['-O2','-std=c++11','-Xpreprocessor','-fopenmp',
'-Wno-shorten-64-to-32','-fvisibility=hidden','-stdlib=libc++'],
'clang w/ GPU' : ['-O2','-msse2','-std=c++11','-fopenmp','-fopenmp-targets=nvptx64-nvidia-cuda',
'-Wno-openmp-mapping','-Wno-unknown-cuda-version',
'-Wno-shorten-64-to-32','-fvisibility=hidden', '-DGALSIM_USE_GPU'],
'unknown' : [],
}
lopt = {
Expand All @@ -90,6 +93,8 @@ def all_files_from(dir, ext=''):
'clang w/ OpenMP' : ['-stdlib=libc++','-fopenmp'],
'clang w/ Intel OpenMP' : ['-stdlib=libc++','-liomp5'],
'clang w/ manual OpenMP' : ['-stdlib=libc++','-lomp'],
'clang w/ GPU' : ['-fopenmp','-fopenmp-targets=nvptx64-nvidia-cuda',
'-Wno-openmp-mapping','-Wno-unknown-cuda-version'],
'unknown' : [],
}

Expand Down Expand Up @@ -143,7 +148,11 @@ def get_compiler_type(compiler, check_unknown=True, output=False):
# with the openmp flag and see if it works.
if output:
print('Compiler is Clang. Checking if it is a version that supports OpenMP.')
if try_openmp(compiler, 'clang w/ OpenMP'):
if supports_gpu(compiler):
if output:
print("Yay! This version of clang supports GPU!")
return 'clang w/ GPU'
elif try_openmp(compiler, 'clang w/ OpenMP'):
if output:
print("Yay! This version of clang supports OpenMP!")
return 'clang w/ OpenMP'
Expand Down Expand Up @@ -193,6 +202,23 @@ def get_compiler_type(compiler, check_unknown=True, output=False):
else:
return 'unknown'

# Check whether this build of Clang supports offloading to GPU via OpenMP
def supports_gpu(compiler):
# Print out compiler's targets
cc = compiler.compiler_so[0]
if cc == 'ccache':
cc = compiler.compiler_so[1]
cmd = [cc,'-print-targets']
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
lines = p.stdout.readlines()
# Look for 'nvptx' in the output. May need a more general check in future to support
# other GPU architectures
for line in lines:
line = line.decode().strip()
if 'nvptx' in line:
return True
return False

# Check for the fftw3 library in some likely places
def find_fftw_lib(output=False):
import distutils.sysconfig
Expand Down Expand Up @@ -435,7 +461,7 @@ def try_compile(cpp_code, compiler, cflags=[], lflags=[], prepend=None, check_wa
cpp_name = cpp_file.name

# Just get a named temporary file to write to:
with tempfile.NamedTemporaryFile(delete=False, suffix='.os', dir=local_tmp) as o_file:
with tempfile.NamedTemporaryFile(delete=False, suffix='.o', dir=local_tmp) as o_file:
o_name = o_file.name

# Another named temporary file for the executable
Expand Down Expand Up @@ -486,7 +512,8 @@ def try_compile(cpp_code, compiler, cflags=[], lflags=[], prepend=None, check_wa
print('Trying link command:')
print(' '.join(cmd))
print('Output was:')
print(' ',b' '.join(lines).decode())
#print(' ',b' '.join(lines).decode())
print(' ',b' '.join(lines))
returncode = p.returncode
except OSError as e:
if debug:
Expand Down Expand Up @@ -538,7 +565,8 @@ def try_compile(cpp_code, compiler, cflags=[], lflags=[], prepend=None, check_wa
print('Trying link command:')
print(' '.join(cmd))
print('Output was:')
print(' ',b' '.join(lines).decode())
#print(' ',b' '.join(lines).decode())
print(' ',b' '.join(lines))
returncode = p.returncode
except OSError as e:
if debug:
Expand Down Expand Up @@ -823,8 +851,6 @@ def add_dirs(builder, output=False):
if hasattr(builder, 'library_dirs'):
if fftw_libpath != '':
builder.library_dirs.append(fftw_libpath)
builder.libraries.append('galsim') # Make sure galsim comes before fftw3
builder.libraries.append(os.path.split(fftw_lib)[1].split('.')[0][3:])
fftw_include = os.path.join(os.path.split(fftw_libpath)[0], 'include')
if os.path.isfile(os.path.join(fftw_include, 'fftw3.h')):
print('Include directory for fftw3 is ',fftw_include)
Expand Down Expand Up @@ -1276,7 +1302,8 @@ def run_tests(self):
ext=Extension("galsim._galsim",
py_sources,
depends = cpp_sources + headers + inst,
undef_macros = undef_macros)
undef_macros = undef_macros,
extra_link_args = ["-lfftw3"])

build_dep = ['setuptools>=38', 'pybind11>=2.2', 'numpy>=1.17']
run_dep = ['astropy', 'LSSTDESC.Coord']
Expand Down
4 changes: 4 additions & 0 deletions src/RealGalaxy.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@
#if defined(__GNUC__) && __GNUC__ >= 6
#pragma GCC diagnostic ignored "-Wint-in-bool-context"
#endif

// Clang incorrectly defines __CUDA_ARCH__ in host code when building for
// OpenMP target offload, so we have to undefine it or Eigen gets confused
#undef __CUDA_ARCH__
#include "Eigen/Dense"

#include "RealGalaxy.h"
Expand Down
Loading