Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional unit testing for run-to-run bitwise reproducibility #435

Merged
merged 1 commit into from
Aug 9, 2024

Conversation

umfranzw
Copy link
Collaborator

This change requires rocPRIM PR #575
I've rebased it on top of rocThrust PR #431.
You can check out this commit db784c4 for a more specific diff.

Currently, we have tests that verify that the deterministic algorithms
for scan and reduce (available via the det and par__det_nosync
execution policies) produce the same results when run twice in a row
within the same test executable.

This change adds new bitwise reproducibility testing that is performed
across multiple test executable runs.

To do this, the test executable must be run twice. On the first run,
we hash all the inputs and outputs to/from the deterministic function
calls, and store those hashes in an SQLite database. We also insert
information about the ROCm version, rocThrust version, and GPU architecture,
into the database, since the results the deterministic algorithms produce
are allowed to vary if any those factors change.

On the second run, we hash the inputs/output again and check to see
if a row matching them (plus the ROCm version, rocThrust version etc.) exists
in the database.

This change:

  • Adds SQLite as a dependency in the CMake files. SQLite code is public domain.
  • Adds a library called CRCpp as a dependency in the CMake files. This library
    provides a checksum algorithm that we can use for hashing. It uses a BSD 3-clause
    license.
  • Add the database-based testing to the existing reproducibility tests in
    test/test_reproducibility.cpp
  • Adds classes for database creation/manipulation and building hashes in
    tests/bitwise_repro/

Because the run-to-run testing accesses a disk-based database file, it's pretty
slow, so I've disabled it by default. To turn it on, you must define an
environment variable called ROCTHRUST_BWR_PATH and set it to the path to
the SQLite database file to use for the testing.

If the database does not exist, to generate it, you must also define an
environment variable called ROCTHRUST_BWR_GENERATE and set it to 1.
This creates the database file given by ROCTHRUST_BWR_PATH and inserts
data into it as the run-to-run reproducibility tests are executed (the tests
will all pass in this case).

If you want to take an existing database and update it for a new
ROCm version/rocThrust version/GPU architecture, you can set
ROCTHRUST_BWR_GENERATE=1 and point ROCTHRUST_BWR_PATH to
the existing database file.

@umfranzw umfranzw requested a review from NguyenNhuDi July 22, 2024 20:22
@Snektron Snektron assigned Snektron and unassigned Snektron Jul 23, 2024
@Snektron Snektron self-requested a review July 23, 2024 19:42
Copy link
Collaborator

@Snektron Snektron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems way too much over engineered to me

  • Why do we need to store the hashes for all ROCm versions if the hash is allowed to change between versions? Can this not be a simple file thats committed to the repository and regenerated when the implementation is changed? This way any changes can be reviewed and we don't need this over engineered database.
  • Why do we need an entire dependency for computing a simple crc? a custom crc32 implementation is like 20 lines https://rosettacode.org/wiki/CRC-32

@umfranzw
Copy link
Collaborator Author

umfranzw commented Jul 25, 2024

Thanks for the feedback @Snektron.

  • Why do we need to store the hashes for all ROCm versions if the hash is allowed to change between versions? Can this not be a simple file that's committed to the repository and regenerated when the implementation is changed? This way any changes can be reviewed and we don't need this over engineered database.

As I understand it, the requirement is that the deterministic algorithms must produce the same result for every unique combination of (ROCm version, rocThrust version, GPU architecture). So suppose that we went with the approach of storing the hashes in a simple file that's committed to the repository, and updated that file on every rocThrust version change. If I'm understanding the problem and your suggestion correctly, then I think that the file would need to contain result hashes for all possible versions of ROCm that the current version of rocThrust can be run on, for all possible GPU architectures that are supported. This is basically the same as what the SQLite database file is currently. Storing the info in a database saves the trouble of writing code to locate the hash for a given call signature in the file and compare it with the current one (though I guess we could use separate simple files for each combination, name them in a way that makes lookup easy, and store the hashes in a consistent order). Feel free to let me know if I'm misinterpreting your suggestion - I'm not particularly attached to the database approach, and I'm totally fine with switching to something else if it's simpler.

That's fair. I actually tied to write a version myself initially, but it was too slow without the precomputed table optimization that's used in the link you posted. I'll have to check if we have the license permission to use the Rosetta code version directly - if not, I'll educate myself on that optimization, and see if I can roll our own implementation.

@Snektron
Copy link
Collaborator

As I understand it, the requirement is that the deterministic algorithms must produce the same result for every unique combination of (ROCm version, rocThrust version, GPU architecture). So suppose that we went with the approach of storing the hashes in a simple file that's committed to the repository, and updated that file on every rocThrust version change. If I'm understanding the problem and your suggestion correctly, then I think that the file would need to contain result hashes for all possible versions of ROCm that the current version of rocThrust can be run on, for all possible GPU architectures that are supported.

The idea was that because the file with the hashes are committed into the repository, they are automatically tied to a particular version of rocThrust. By extension, a particular version of ROCm contains a particular version of rocThrust. Similarly, a particular version of rocThrust is tied to a particular version of ROCm and by extension a parituclar version of clang that it contains. In theory its all one big package: when for example clang is updated and reorders the operations (its allowed to do this when --fast-math is enabled), that change only makes it over to the customer when a new version of ROCm is released, at which point we have already updated the hashes: they would always be those for the latest version of the compiler, for the latest version of ROCm, and for the latest version of rocThrust and rocPRIM. I guess that does depend on a relatively tight integration loop between releases of the compiler and updating rocPRIM and rocThrust.

Are you expecting that people mix and match ROCm/ROCm clang and rocThrust versions?

@umfranzw
Copy link
Collaborator Author

umfranzw commented Jul 26, 2024

I guess I had assumed that people would be able to mix and match ROCm and rocThrust versions - for example, if they install rocThrust or ROCm via the Github repos, they are free to checkout whatever release tags they want, provided they can build and run together. I'm not sure how likely this is for the customers who've requested the bitwise reproducibility testing, but I thought it might be nice to cover the case.

Aside from the mix-and-match issue, there are a couple other less-significant factors I just wanted to mention as well - feel free to let me know what you think about them. Both rocFFT and rocSPARSE have implemented similar bitwise reproducibility testing in the past using databases, so I checked in with some of the devs on those projects. The reasons they gave for choosing to use a database mainly included:

  • The ability to query and compare results without having to do any text parsing.
  • The database stores a history that allows you to see how environmental changes affect reproducibility over time.
  • The bitwise repro tests are pretty slow. If a CI test machine has multiple GPUs installed, then multiple tests can be run in parallel, and they can read/write to the same database file simultaneously.

@Snektron
Copy link
Collaborator

Aside from the mix-and-match issue, there are a couple other less-significant factors I just wanted to mention as well - feel free to let me know what you think about them. Both rocFFT and rocSPARSE have implemented similar bitwise reproducibility testing in the past using databases, so I checked in with some of the devs on those projects. The reasons they gave for choosing to use a database mainly included:

If there is prior art on this topic I'm not as much against it.

* The ability to query and compare results without having to do any text parsing.
* The database stores a history that allows you to see how environmental changes affect reproducibility over time.

This is true, and you also don't need to run a bunch of git commands. I guess the downside is that one needs additional CI tooling for this to be visible. Out of curiosity, do rocFFT and rocSPARSE have tooling available for this?

@umfranzw
Copy link
Collaborator Author

This is true, and you also don't need to run a bunch of git commands. I guess the downside is that one needs additional CI tooling for this to be visible. Out of curiosity, do rocFFT and rocSPARSE have tooling available for this?

I don't believe there are currently any existing CI tools that have been built specifically for looking at the history of bitwise reproducibilty - I got the sense that the devs' idea was that, if such a history needed to be produced, it could be done pretty easily with some SQL queries (or a Python script that runs them).

@umfranzw umfranzw force-pushed the test_bitwise_repro_rebased branch 2 times, most recently from 0e5d030 to 99d1b99 Compare July 29, 2024 20:30
@umfranzw
Copy link
Collaborator Author

I've removed the CRCpp library and replaced it with a custom implementation.

@umfranzw umfranzw force-pushed the test_bitwise_repro_rebased branch from 99d1b99 to fa03072 Compare August 2, 2024 17:42
@umfranzw
Copy link
Collaborator Author

umfranzw commented Aug 2, 2024

Added documentation to describe how to invoke the new tests.

Currently, we have tests that verify that the deterministic algorithms
for scan and reduce (available via the det and par__det_nosync
execution policies) produce the same results when run twice in a row
within the same test executable.

This change adds new bitwise reproducibility testing that is performed
across multiple test executable runs.

To do this, the test executable must be run twice. On the first run,
we hash all the inputs and outputs to/from the deterministic function
calls, and store those hashes in an SQLite database. We also insert
information about the ROCm version, rocThrust version, and GPU architecture,
into the database, since the results the deterministic algorithms produce
are allowed to vary if any those factors change.

On the second run, we hash the inputs/output again and check to see
if a row matching them (plus the ROCm version, rocThrust version etc.) exists
in the database.

This change:
- Adds SQLite as a dependency in the CMake files. SQLite code is public domain.
- Add the database-based testing to the existing reproducibility tests in
test/test_reproducibility.cpp
- Adds classes for database creation/manipulation and building hashes in
tests/bitwise_repro/
- Adds documentation that describes how to invoke the new tests

Because the run-to-run testing accesses a disk-based database file, it's pretty
slow, so I've disabled it by default. To turn it on, you must define an
environment variable called ROCTHRUST_BWR_PATH and set it to the path to
the SQLite database file to use for the testing.

If the database does not exist, to generate it, you must also define an
environment variable called ROCTHRUST_BWR_GENERATE and set it to 1.
This creates the database file given by ROCTHRUST_BWR_PATH and inserts
data into it as the run-to-run reproducibility tests are executed (the tests
will all pass in this case).

If you want to take an existing database and update it for a new
ROCm version/rocThrust version/GPU architecture, you can set
ROCTHRUST_BWR_GENERATE=1 and point ROCTHRUST_BWR_PATH to
the existing database file.
@umfranzw umfranzw force-pushed the test_bitwise_repro_rebased branch from fa03072 to 4fc0c44 Compare August 6, 2024 15:54
@umfranzw
Copy link
Collaborator Author

umfranzw commented Aug 6, 2024

Rebased to resolve conflicts and refactored a bit to remove compiler warnings when building other tests.

Copy link
Collaborator

@stanleytsang-amd stanleytsang-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failures seem unrelated to the PR.

@umfranzw umfranzw dismissed Snektron’s stale review August 9, 2024 20:33

Hi @Snektron, apologies for not requesting a re-review earlier. I need to merge this change today, and I'll need to dismiss this review in order to do that. If you have any further concerns about this change, please don't hesitate to reach out and we can chat.

@umfranzw umfranzw merged commit 5883c6a into ROCm:develop Aug 9, 2024
8 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants