CARE Algorithms

I'm trying to figure out a consistent design for CARE algorithms and need your feedback. I'm using ArrayFill (renamed to fill_n to be more consistent with the std library (ignoring iterators): https://en.cppreference.com/w/cpp/algorithm/fill_n) as an example. Note that these options are not mutually exclusive.

Option 1: CARE has complete control over what the behavior should be

Advantages: Easy to use, few template instantiations, potentially optimized algorithms that don't need to use RAJA

Disadvantages: No/limited knowledge of the application (we can get the last space arr was touched, but we don't know if there is a pathological pattern that moves data back and forth)

Requirements: Is the data expected to be in the same space it started in?

template <typename T>
void fill_n(care::host_device_ptr<T> arr, int n, T val) {
    CARE_STREAM_LOOP(i, 0, n) {
        arr[i] = val;
    } CARE_STREAM_LOOP_END
}

Option 2: Application decides behavior based on RAJA policies

Advantages: Easy to use, application has complete control

Disadvantages: More template instantiations, algorithms must use RAJA and therefore do not allow platform-specific optimizations

Requirements: Algorithms must use RAJA. Compile-time choice.

// https://raja.readthedocs.io/en/v0.12.0/feature/policies.html

template <typename ExecutionPolicy, typename T>
void fill_n(ExecutionPolicy policy, care::host_device_ptr<T> arr, int n, T val) {
    CARE_LOOP(policy, i, 0, n) {
        arr[i] = val;
    } CARE_LOOP_END
}

Option 3: Application decides what platform to run on, but the implementation details are left to CARE

Advantages: Application has some control but doesn't have to worry about the implementation details, few template instantiations, run-time choice and compile-time choice since enum values can be used as template arguments)

Disadvantages: Figuring out platform-specific optimizations is a hard challenge (we can leverage the std library on the host, though, and potentially algorithms provided by CUDA in the future, and we can always fall back to RAJA). These platforms are really software platforms, not hardware platforms

// https://github.com/LLNL/camp/blob/master/include/camp/resource/platform.hpp#L21
enum class Platform {
    undefined = 0,
    host = 1,
    cuda = 2,
    omp_target = 4,
    hip = 8,
    sycl = 16
};

template <typename T>
void fill_n(Platform platform, care::host_device_ptr<T> arr, int n, T val) {
    if (platform == Platform::host) {
        std::fill_n(arr.data(), n, val);
    }
    else if (platform == Platform::cuda) {
        fill_n<<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>(arr.data(GPU), n, val);
    }
    // etc...
}

Option 4: Write our own policies: sequential, openmp, cuda, hip

Advantages: CARE has control over what algorithms to implement

Disadvantages: Increased complexity for users and implementers

Option 5: Write our own enum of hardware platforms: host, device, any others?

Option 6: Provide a resource set (i.e. we have 1 GPU, or 1 CPU, or 36 CPU threads, or a combination).

Advantages: Application tells CARE what resources to use, but leaves the implementation up to CARE.

Disadvantages: Highly complex and architecture dependent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CARE Algorithms

Clone this wiki locally