Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop stream 2024-06-26 #575

Merged
merged 174 commits into from
Aug 3, 2024

Conversation

Snektron
Copy link
Contributor

@Snektron Snektron commented Jun 26, 2024

@Naraenda is taking this over since @Snektron is on holiday.

This PR adds the following merge commits. Emphasis on important changes.

  • 46faf89 test: make tests pass with hardened libstc++ (resolves Failing tests with libstdc++ assertions due to unsigned overflow #570)
  • 39c9d9b ci(gitlab-ci.yml): re-enable most of the debug builds
  • c6ea6aa test: improved test of device histogram_even for half/bfloat16
  • a40cbdb fix: fix warnings when building rocprim from rocthrust when using hipcc
  • 10c6175 feat: add partial_sort and partial_sort_copy algorithms
  • 4a26848 feat: add deterministic versions of rocprim::inclusive_scan, rocprim::exclusive_scan, rocprim::inclusive_scan_by_key, rocprim::exclusive_scan_by_key, and rocprim::reduce_by_key
  • 9c7c297 style: format attributes on a separate line
  • fc2de52 perf: remove oracle in nth element
  • a4ae29c ci(gitlab-ci.yml): compress autotuning artifacts before uploading them
  • fab7fd6 refactor: improve code-reuse of rocprim::detail::merge_path
  • f586d3e test: add extra tests to nth-element
  • d09bae6 feat: add nth-element algorithm
  • 43c7ce7 perf: add auto tuning to reduce by key
  • d63cd11 feat: add large index support to device segmented reduce
  • 110f2d0 test: fix sizes generated by benchmark utility for random segments
  • c523e13 ci: build everything using cmake-minimum and cmake-latest workflows
  • d4aec9d test: fix benchmark minmax underoverflow
  • 941984e test: disable naive batch memcpy in benchmarks
  • 94e17b4 docs: document thread level methods
  • f90e4e7 refactor: only autotune sleep-based lookback scan for gfx908
  • 7b6b04b docs: add documentation for autotuning
  • e2aec7e test: improve test logging
  • 1617f92 refactor: deprecate raw_storage and replace it by uninitialized_array due to possible undefined behavior with raw_storage
  • f7209de refactor: deprecate use of rocprim::thread_load and rocprim::thread_store
  • 917b151 fix: fix block histogram test failures
  • 441a934 docs: document thread types
  • 2455a40 ci(gitlab-ci.yml): disable debug builds due to long compilation times
  • 9fc5d18 fix: remove memory fence workaround on gfx10 and gfx11
  • c07e212 fix: fix unroll warnings in debug min size releases
  • ad746b4 perf: add tuning to device partition
  • 9148c77 test: add benchmark flag to specify seed
  • 9e849db docs: add contribution guidelines
  • 71ba9ce fix: prevent rocprim::adjacent_difference benchmarks from using more than 8 gb vram
  • d3b3441 test: fix windows test assertions
  • c5a9d07 fix: disable force inline on windows
  • 58c96d9 refactor: remove device device warpsize usage in host function in radix sort benchmark generator
  • fc2ac73 feat: add an overload of rocprim::match_any which accepts the maximum number of bits as a run-time parameter
  • 61ca0d1 test: change default size in adjacent difference benchmark
  • 246fe37 fix: various fixes for batch memcpy
  • de58f35 perf: add auto tuning to transform

@Naraenda
Copy link
Member

Naraenda commented Jul 9, 2024

Rebased and added a fix that resolves #570

@Naraenda
Copy link
Member

Naraenda commented Jul 9, 2024

Fixed doxygen errors in 8d5b5e8

@Naraenda Naraenda force-pushed the develop_stream_20240626 branch from 8d5b5e8 to 4651b07 Compare July 18, 2024 16:22
@Naraenda
Copy link
Member

Naraenda commented Jul 18, 2024

4651b07 Noticed CI failures after targeting a later commit in develop. rocm-cmake was recently-ish changed and now requires building. This commit resolves that when the rocm-cmake is freshly downloaded through FetchContent.

6edfe13 Should fix one of the test failures on MI200 with very large inputs.

Also rebased it, again.

CC @stanleytsang-amd

@Naraenda Naraenda mentioned this pull request Jul 18, 2024
@stanleytsang-amd
Copy link
Collaborator

4651b07 Noticed CI failures after targeting a later commit in develop. rocm-cmake was recently-ish changed and now requires building. This commit resolves that when the rocm-cmake is freshly downloaded through FetchContent.

6edfe13 Should fix one of the test failures on MI200 with very large inputs.

Also rebased it, again.

CC @stanleytsang-amd

Thanks Nara, I will rerun CI right now.

@Naraenda
Copy link
Member

I managed to test this on a MI200. It turns out, for reasons I'm not fully aware of, that running hipStreamSynchronize after the cooperative kernel launch is required. Moving that statement out of the debug synchronous branch seems to resolve the hanging on MI200 (4314c76). I'm checking the implementation in ROCm/CLR, but nothing out of the ordinary seems to jump out.

CC @stanleytsang-amd

@stanleytsang-amd
Copy link
Collaborator

I managed to test this on a MI200. It turns out, for reasons I'm not fully aware of, that running hipStreamSynchronize after the cooperative kernel launch is required. Moving that statement out of the debug synchronous branch seems to resolve the hanging on MI200 (4314c76). I'm checking the implementation in ROCm/CLR, but nothing out of the ordinary seems to jump out.

CC @stanleytsang-amd

Re-running CI now. It does seem strange that an explicit hipStreamSynchronize is required here since this function only deals with one stream and all kernels are being executed on that stream and I don't see any data race conditions here. Maybe we can make a small reproducer and file a bug report with the HIP team?

@stanleytsang-amd
Copy link
Collaborator

I managed to test this on a MI200. It turns out, for reasons I'm not fully aware of, that running hipStreamSynchronize after the cooperative kernel launch is required. Moving that statement out of the debug synchronous branch seems to resolve the hanging on MI200 (4314c76). I'm checking the implementation in ROCm/CLR, but nothing out of the ordinary seems to jump out.

CC @stanleytsang-amd

It seems as though the hang still persists on MI200 on CI, even with the hipStreamSychronize().

@Naraenda
Copy link
Member

Rebased & resolved conflicts

Copy link
Collaborator

@stanleytsang-amd stanleytsang-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naraenda
Copy link
Member

Can you update the packaging version at https://github.com/ROCm/rocPRIM/pull/575/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL133 to 3.3.0 as well? Thanks.

Done in 8d06855.

@stanleytsang-amd
Copy link
Collaborator

Can you update the packaging version at https://github.com/ROCm/rocPRIM/pull/575/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL133 to 3.3.0 as well? Thanks.

Done in 8d06855.

Thanks.

@stanleytsang-amd stanleytsang-amd self-requested a review July 31, 2024 15:59
@Beanavil Beanavil force-pushed the develop_stream_20240626 branch from 8d06855 to 5f20e73 Compare August 2, 2024 14:13
NB4444 and others added 26 commits August 2, 2024 14:21
…ontinuity

These tests take extremely long time to build with clang from ROCm 6.1+.
@Beanavil Beanavil force-pushed the develop_stream_20240626 branch from 5f20e73 to 87abd8c Compare August 2, 2024 14:22
@stanleytsang-amd stanleytsang-amd self-requested a review August 3, 2024 01:03
@stanleytsang-amd stanleytsang-amd merged commit dbb52d5 into ROCm:develop Aug 3, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing tests with libstdc++ assertions due to unsigned overflow