-
Notifications
You must be signed in to change notification settings - Fork 4
CARE Algorithms
I'm trying to figure out a consistent design for CARE algorithms and need your feedback. I'm using ArrayFill
(renamed to fill_n
to be more consistent with the std library (ignoring iterators): https://en.cppreference.com/w/cpp/algorithm/fill_n) as an example. Note that these options are not mutually exclusive.
Option 1: CARE has complete control over what the behavior should be
Advantages: Easy to use, few template instantiations, potentially optimized algorithms that don't need to use RAJA
Disadvantages: No/limited knowledge of the application (we can get the last space arr was touched, but we don't know if there is a pathological pattern that moves data back and forth)
Requirements: Is the data expected to be in the same space it started in?
template <typename T>
void fill_n(care::host_device_ptr<T> arr, int n, T val) {
CARE_STREAM_LOOP(i, 0, n) {
arr[i] = val;
} CARE_STREAM_LOOP_END
}
Option 2: Application decides behavior based on RAJA policies
Advantages: Easy to use, application has complete control
Disadvantages: More template instantiations, algorithms must use RAJA and therefore do not allow platform-specific optimizations
Requirements: Algorithms must use RAJA. Compile-time choice.
// https://raja.readthedocs.io/en/v0.12.0/feature/policies.html
template <typename ExecutionPolicy, typename T>
void fill_n(ExecutionPolicy policy, care::host_device_ptr<T> arr, int n, T val) {
CARE_LOOP(policy, i, 0, n) {
arr[i] = val;
} CARE_LOOP_END
}
Option 3: Application decides what platform to run on, but the implementation details are left to CARE
Advantages: Application has some control but doesn't have to worry about the implementation details, few template instantiations, run-time choice and compile-time choice since enum values can be used as template arguments)
Disadvantages: Figuring out platform-specific optimizations is a hard challenge (we can leverage the std library on the host, though, and potentially algorithms provided by CUDA in the future, and we can always fall back to RAJA). These platforms are really software platforms, not hardware platforms
// https://github.com/LLNL/camp/blob/master/include/camp/resource/platform.hpp#L21
enum class Platform {
undefined = 0,
host = 1,
cuda = 2,
omp_target = 4,
hip = 8,
sycl = 16
};
template <typename T>
void fill_n(Platform platform, care::host_device_ptr<T> arr, int n, T val) {
if (platform == Platform::host) {
std::fill_n(arr.data(), n, val);
}
else if (platform == Platform::cuda) {
fill_n<<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>(arr.data(GPU), n, val);
}
// etc...
}
Option 4: Write our own policies: sequential, openmp, cuda, hip
Advantages: CARE has control over what algorithms to implement
Disadvantages: Increased complexity for users and implementers
Option 5: Write our own enum of hardware platforms: host, device, any others?
Option 6: Provide a resource set (i.e. we have 1 GPU, or 1 CPU, or 36 CPU threads, or a combination).
Advantages: Application tells CARE what resources to use, but leaves the implementation up to CARE.
Disadvantages: Highly complex and architecture dependent.