Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestCUDASimulation.AllDeviceIdValues causes test failures on multi-gpu systems #395

Closed
Robadob opened this issue Oct 16, 2020 · 5 comments · Fixed by #418
Closed

TestCUDASimulation.AllDeviceIdValues causes test failures on multi-gpu systems #395

Robadob opened this issue Oct 16, 2020 · 5 comments · Fixed by #418
Labels

Comments

@Robadob
Copy link
Member

Robadob commented Oct 16, 2020

Tests haven't been run recently on mavericks. When running with NO_SEATBELTS in either configuration 26 tests fail. This is unusual given all test pass for both @ptheywood and I on our local machines. This was discovered on PR #371, however testing has shown the same to apply to master branch.

Seemingly in common the tests all make use of environment properties, however I haven't dug into it further than that.

Full list of failing tests:

[  FAILED  ] 26 tests, listed below:
[  FAILED  ] AgentEnvironmentTest.Get_float
[  FAILED  ] AgentEnvironmentTest.Get_double
[  FAILED  ] AgentEnvironmentTest.Get_int8_t
[  FAILED  ] AgentEnvironmentTest.Get_uint8_t
[  FAILED  ] AgentEnvironmentTest.Get_int16_t
[  FAILED  ] AgentEnvironmentTest.Get_uint16_t
[  FAILED  ] AgentEnvironmentTest.Get_int32_t
[  FAILED  ] AgentEnvironmentTest.Get_uint32_t
[  FAILED  ] AgentEnvironmentTest.Get_int64_t
[  FAILED  ] AgentEnvironmentTest.Get_uint64_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_float
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_double
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_int8_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_uint8_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_int16_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_uint16_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_uint32_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_int32_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_uint64_t
[  FAILED  ] AgentEnvironmentTest.Get_arrayElement_int64_t
[  FAILED  ] DeviceAPITest.getStepCounter
[  FAILED  ] DeviceAPITest.getStepCounterFunctionCondition
[  FAILED  ] SubEnvironmentManagerTest.SubDeviceAPIGetDefault
[  FAILED  ] SubEnvironmentManagerTest.SubDeviceAPIGetMasterChange
[  FAILED  ] SubEnvironmentManagerTest.SubSubDeviceAPIGetDefault
[  FAILED  ] SubEnvironmentManagerTest.SubSubDeviceAPIGetMasterChange

@Robadob
Copy link
Member Author

Robadob commented Oct 18, 2020

It seems to be linked to having multiple devices available.
CUDA_VISIBLE_DEVICES=0 ./tests, CUDA_VISIBLE_DEVICES=1 ./tests (and the same for 2 and 3), all tests pass.
CUDA_VISIBLE_DEVICES=0,1 ./tests 26 tests fail.

@Robadob
Copy link
Member Author

Robadob commented Oct 19, 2020

If the test TestCUDASimulation.AllDeviceIdValues is disabled, all tests pass. This test does not do anything in single device systems, hence the inconsistency.

This test runs an empty model on each device, then returns to cuda set device 0.
Not immediately clear of the problem, potentially fixed by making CUDASimulation perform the device change, or adding cudaDeviceReset().

@Robadob Robadob changed the title Tests suite failing on Mavericks TestCUDASimulation.AllDeviceIdValues causes test failures on multi-gpu systems Oct 19, 2020
@ptheywood
Copy link
Member

Running applyConfig again with device 0 seems a lot more sensible than the manual setDevice(0) I used when i originally wrote the test now that F2 has the better destruction logic.

@ptheywood
Copy link
Member

ptheywood commented Oct 19, 2020

Adding a cudaDeviceReset to the test after setting the device back to 0 (either explicitly or via another cudaSimulation) allows the test to pass on multi-gpu systems.

This is not a true fix however, F2 in general needs to be a bit more device aware (singletons, init, device memory in general).

We shouldn't support anyone calling cudaSetDevice/CUDADeviceReset within host functions - subsequent failures would be on them. Once (if) we go multi-gpu though it will become more relevant, but that's not an urgent concern.

@Robadob
Copy link
Member Author

Robadob commented Oct 19, 2020

Unconfirmed, however it appears to be a case that the singletons/device-reset-detection are getting confused over device switching.

  • Run some models on Dev0, occupy singletons
  • Switch to Dev1
  • CUDASim detects what it thinks is a device reset
  • Actioning a device reset, it purges host singletons
  • Switch back to Dev0
  • Host singletons were reset, so no longer match device singletons

Fix would be to add some level of multi device awareness to singletons/CUDASim

@Robadob Robadob mentioned this issue Nov 6, 2020
29 tasks
Robadob added a commit that referenced this issue Nov 18, 2020
Make Curve singleton thread-safe and unique per-device
Make EnvironmentManager singleton thread-safe and unique per-device
Add mutex locks around agent function execution, to prevent environment data getting changed by defragment during before execution).
Add nvcc arg --default-stream per-thread to CMake (and then comment it out as it would require far more testing)
Fix double inclusion of common.cmake if configuring with an example as root.
Modify initialisation of CUDASimulation so that EnvironmentManager is not used before device has been selected.
Add tests for multi-threading and multi-device CUDASimulation execution.
Adds a reduced set of multi-thread and multi-device tests for RTC.
Fix a bug, where rtc_offsets were reset to 0 when EnvironmentManager:defragment() was called.
Required for #245
Closes #395
mondus pushed a commit that referenced this issue Nov 18, 2020
Make Curve singleton thread-safe and unique per-device
Make EnvironmentManager singleton thread-safe and unique per-device
Add mutex locks around agent function execution, to prevent environment data getting changed by defragment during before execution).
Add nvcc arg --default-stream per-thread to CMake (and then comment it out as it would require far more testing)
Fix double inclusion of common.cmake if configuring with an example as root.
Modify initialisation of CUDASimulation so that EnvironmentManager is not used before device has been selected.
Add tests for multi-threading and multi-device CUDASimulation execution.
Adds a reduced set of multi-thread and multi-device tests for RTC.
Fix a bug, where rtc_offsets were reset to 0 when EnvironmentManager:defragment() was called.
Required for #245
Closes #395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants