Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omp target offload #1212

Merged
merged 44 commits into from
May 5, 2023
Merged

Conversation

jamesp-epcc
Copy link
Contributor

This is my GPU acceleration work on the sensor model, rebased against the current main branch. Obviously this is a very extensive change, so I'm open to discussing how it's implemented and making changes to better fit the existing code. There is a single unified code base for both GPU and CPU implementations; when offloading is disabled (which can be done via an environment variable or by building with a non-GPU-aware compiler), the offloaded regions act like OpenMP parallel regions, so the loops still take advantage of multiple CPU cores as before. A version of Clang with offloading support is required to build this for GPU.

@jamesp-epcc jamesp-epcc added optimization/performance Related to the speed and/or memory consumption of some aspect of the code desc Of possible interest to LSST DESC members looking for a project labels Feb 13, 2023
@jamesp-epcc
Copy link
Contributor Author

I see the CI is failing. Apparently gcc doesn't like me mapping this via OpenMP. Unfortunately this is needed to build a working GPU version with Clang. I had thought it should be harmless to have the offload directives there even for CPU builds, but it might be necessary to wrap them in #ifdefs so that the compiler only sees them when actually building for GPU.

@jamesp-epcc
Copy link
Contributor Author

I've wrapped the GPU directives in #ifdefs and the checks now pass. The original code already had similar checks in place for the OpenMP parallel directives so it's not too big a departure from how we previously did things. I still need to investigate whether actually building and running the GPU version in CI is feasible.

Copy link
Member

@rmjarvis rmjarvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is pretty hard to review right now. The src/Silicon.cpp file in particular.

You moved some of the functions around, which means they don't show up as a useful diff currently. There is a full function removed, and then a very similar function added in a different location. You also removed some inline comments, which makes git think blocks of code that are otherwise identical are actually different. There are also some new functions, which are just old functions with some slight changes to data structures, not actually (I think!) any new code.

Please reorganize this so that the Silicon.cpp file show on github as a reasonable diff, so I can review what the real changes are. Thanks!

setup.py Outdated
@@ -47,7 +47,7 @@
raise

# Turn this on for more verbose debugging output about compile attempts.
debug = False
debug = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be turned back off.

src/Silicon.cpp Outdated
}

bool Silicon::insidePixel(int ix, int iy, double x, double y, double zconv,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move this and searchNeighbors back up to where they used to be, so git will show me what the differences actually are? Putting them down here means I have to manually compare the two versions to check that nothing substantive changed. Likewise make sure addDelta and subtractDelta show up here as regular diffs, since they're also in different places now.

src/Silicon.cpp Outdated
}

template <typename T>
double Silicon::accumulate(const PhotonArray& photons, int i1, int i2,
BaseDeviate rng, ImageView<T> target)
BaseDeviate rng, ImageView<T> target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you maybe have tabs in the new version? There shouldn't be any whitespace change here.

src/Silicon.cpp Outdated
185528756957.328400, 182815356489.945160, 263157894736.842072,
398406374501.992065, 558659217877.094971, 469483568075.117371,
833333333333.333374, 917431192660.550415, 1058201058201.058228
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. We pass the abs_length table from python. This shouldn't be hard-coded into the C++ layer.

src/Silicon.cpp Outdated
if (tableIdx < 0) tableIdx = 0;
int tableIdx1 = tableIdx + 1;
if (tableIdx > 239) tableIdx = 239;
if (tableIdx1 > 239) tableIdx1 = 239;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many special numbers here. 255.0, 5.0, 239. These all need to be computed from the input abs_length_table you get from python. Don't assume these are necessarily the values forever.

src/Silicon.cpp Outdated
double* photonsDXDZ = photonsMutable.getDXDZArray();
double* photonsDYDZ = photonsMutable.getDYDZArray();
double* photonsFlux = photonsMutable.getFluxArray();
double* photonsWavelength = photonsMutable.getWavelengthArray();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need pointers to mutable values? If not, you should just add const versions of these in the PhotonArray class that returns const double* pointers rather than using const_cast.

src/Silicon.cpp Outdated
y0 += dydz * dz_pixel; // dy in pixels
}
xdbg<<" => "<<x0<<','<<y0;
// This is the reverse of depth. zconv is how far above the substrate the e- converts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got rid of most of these inline comments in your refactoring. Please add them back.

Bounds<double>* pixelInnerBoundsData,
Bounds<double>* pixelOuterBoundsData,
Position<float>* horizontalBoundaryPointsData,
Position<float>* verticalBoundaryPointsData);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have two ways to do this please. Rewrite the existing updatePixelBounds to work with the GPU. Don't repeat everything in a second function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not fixed yet. You still have both of these functions.

src/Silicon.cpp Show resolved Hide resolved
src/Silicon.cpp Outdated
for (int i = 0; i < imageDataSize; i++) {
targetData[i] += deltaData[i];
deltaData[i] = 0.0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this should be a call to addDelta, right? Looks like an oversight in the previous code, but it's more important now that addDelta isn't just a one-liner.

@rmjarvis rmjarvis added this to the v2.5 milestone Mar 28, 2023
@jamesp-epcc
Copy link
Contributor Author

I have addressed Mike's feedback and rebased against the current main branch.

  • The function re-ordering, the whitespace changes and the loss of comments were an artefact of me adding GPU versions of the functions initially alongside the originals, then later renaming them and removing the originals, but hopefully all those issues are resolved now.
  • The values from _abs_length_table are now taken from the table passed in from the Python layer rather than being hardcoded (this was meant to be a temporary change but I forgot about it, apologies).
  • Const getter methods have been added to PhotonArray instead of const_casting it.
  • The separate CPU version of updatePixelDistortions has now been removed. By making a few changes I was able to use the GPU-enabled version for all purposes instead.
  • The pixel distortion update has been moved back out of the update method, as it was originally. addDelta is not being used in update, but I have added a comment explaining why.

src/Silicon.cpp Show resolved Hide resolved
@@ -265,6 +278,19 @@ namespace galsim
Table _abs_length_table;
bool _transpose;
ImageAlloc<double> _delta;
std::shared_ptr<bool> _changed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is wrong. This used to be a local vector. And _changed.get() is still being used as a bare C array. This might explain the crashes you were seeing. So after fixing this, probably could try to put back the min/max usage, which IMO is more readable than the two step you switched to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved _changed to be a member variable rather than local to avoid the overhead of allocating it on the GPU every time update is called. The reason it's now a shared_ptr to a bare array instead of a vector is that bool vectors have an optimised implementation that packs multiple bools into each byte, so you can't get a pointer to the raw data, which the GPU requires.

I agree that std::min and std::max would be more readable, but surprisingly the GPU doesn't support them (I was using them initially but they caused weird bugs that took a long time to track down). We could implement custom, GPU-friendly min and max functions rather than writing out the algorithm though.

if (y < ny) changed[(x * ny) + y] = true; // pixel above
if (y > 0) changed[(x * ny) + (y - 1)] = true; // pixel below
if (y < ny) changedData[(x * ny) + y] = true; // pixel above
if (y > 0) changedData[(x * ny) + (y - 1)] = true; // pixel below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where you use changeData, but this is a pointer that you got from _changed.get(), and _changed is just a single shared_ptr, not a vector. So this looks like undefined behavior here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_changed.get() will return the raw bool pointer from inside the shared pointer, so this works as intended. See above for explanation of why this is no longer a vector.

Bounds<double>* pixelInnerBoundsData,
Bounds<double>* pixelOuterBoundsData,
Position<float>* horizontalBoundaryPointsData,
Position<float>* verticalBoundaryPointsData);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not fixed yet. You still have both of these functions.

src/Silicon.cpp Outdated Show resolved Hide resolved
src/Silicon.cpp Outdated Show resolved Hide resolved
src/Silicon.cpp Show resolved Hide resolved
src/Silicon.cpp Outdated Show resolved Hide resolved
src/Silicon.cpp Show resolved Hide resolved
src/Silicon.cpp Outdated Show resolved Hide resolved
@jamesp-epcc
Copy link
Contributor Author

The CI tests are failing as they are unable to install codecov. However I think this is unrelated to my changes (see here: home-assistant/core#91283 ).

@rmjarvis
Copy link
Member

rmjarvis commented Apr 17, 2023

Just remove codecov from the ci script.

We actually haven't been using it for a while, since they now recommend a bash uploader. And they recently removed codecov from pypi, so pip can't find it anymore. Probably to help discourage people from using the obsolete uploader.

src/Silicon.cpp Outdated Show resolved Hide resolved
src/Silicon.cpp Outdated Show resolved Hide resolved
Copy link
Member

@rmjarvis rmjarvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I'm mostly happy with this now. Just a couple minor comments.

But I do still want Josh to take a look at some of the OpenMP GPU directives to see if he has any comments there.

src/Silicon.cpp Outdated Show resolved Hide resolved
src/Silicon.cpp Outdated Show resolved Hide resolved
_emptypolyGPU[i].x = _emptypoly[i].x;
_emptypolyGPU[i].y = _emptypoly[i].y;
}
Position<double>* emptypolyData = _emptypolyGPU.data();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my above comment still applies. I also suspect most of this (after setting targetData and deltaData) belongs in the constructor. Not in initialize, since I think they are not going to change for each target image.

Copy link
Member

@jmeyers314 jmeyers314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I left a couple small comments.

src/Silicon.cpp Outdated Show resolved Hide resolved
src/Silicon.cpp Outdated Show resolved Hide resolved
Rename pixelInnerBoundsSize variable to pixelBoundsSize
@rmjarvis rmjarvis merged commit 6d9d0e9 into GalSim-developers:main May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
desc Of possible interest to LSST DESC members looking for a project optimization/performance Related to the speed and/or memory consumption of some aspect of the code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants