Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-GPU unit test environment #3741

Merged
merged 5 commits into from
Sep 29, 2018
Merged

Add multi-GPU unit test environment #3741

merged 5 commits into from
Sep 29, 2018

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Sep 28, 2018

  • Add a new group of agents to Jenkins CI to run multi-GPU workload. The agents will use a p2.8xlarge instance.
  • Add a simple test for gpu_hist.
  • Add cpp test for multi-GPU. Any test prefixed with MGPU_ will run on the multi-GPU instance.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 28, 2018

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 29, 2018

@RAMitchell @trivialfis The test tests/python-mgpu/test_gpu_updaters.py has failed due to difference between CPU and multi-GPU results:

======================================================================
FAIL: test_gpu_hist (test_gpu_updaters.TestGPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/tests/python-mgpu/test_gpu_updaters.py", line 31, in test_gpu_hist
    assert_gpu_results(cpu_results, gpu_results)
  File "/workspace/tests/python-mgpu/test_gpu_updaters.py", line 15, in assert_gpu_results
    .format(gpu_res["eval"][-1], cpu_res["eval"][-1])
AssertionError: GPU result don' match with CPU result (GPU: 42.155769, CPU: 41.519295)
-------------------- >> begin captured stdout << ---------------------

Should we be concerned with this error?

@trivialfis
Copy link
Member

@hcho3 I will look into gpu hist after #3643. For now I don't have much confidence in multi-gpu setup.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 29, 2018

@trivialfis Would you like to add a multi-GPU test to #3643? If so, I can disable the failing test for now and merge this PR to master.

@trivialfis
Copy link
Member

@hcho3 I will add cpp test for that.

@RAMitchell
Copy link
Member

@hcho3 This is amazing thanks. My view is we should disable any failing tests, merge this, then work to re-enable the tests.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 29, 2018

@RAMitchell Got it. Let me go ahead and disable the failing test.
@trivialfis For CPP test, do you think we should run the same suite of tests on multi-GPU machine or a different suite?

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 29, 2018

@trivialfis One idea I have is to prefix name of Google test cases with MGPU_ to indicate that they should run on the multi-GPU machine. Something like

TEST(FooBarTest, MGPU_RunBar) { 
    // Use device ordinals 0, 1, 2, ...
}

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 29, 2018

Actually, I kind of like the idea of keeping all tests together. For Python tests, we can add mgpu attribute to every test that's to be run with multiple GPUs.

@trivialfis
Copy link
Member

Currently I put the multi-gpu test code inside (at the end of) normal cuda test files, wrapped with:

#if defined(XGBOOST_USE_NCCL)
TEST(Foo, MultiGpuBar) {
  // do something.
}
#endif

But there are other options too, I would love to hear about other opinions.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 29, 2018

@trivialfis I don't think that would work, because one of the test configurations compile with NCCL (hence XGBOOST_USE_NCCL is defined) but runs on a single GPU.

I'm open to suggestions, but for now I am using MGPU_ prefix (for cpp) and mgpu attribute (for python) to indicate that a given test should use more than one GPU. Why don't you look over this pull request and comment on aesthetics.

@trivialfis
Copy link
Member

@hcho3 I agree with keeping all tests together. But I don't think running multi-GPU tests on single GPU would introduce problem, since any multi-GPU code should also run on single GPU. If such difficulties do arise we can pass --gtest_filter="-*.MGPU*" flag to the test executable.

Another option for not spiting up tests would be making a definition like #define TEST_WITH_MULTI_GPU, but to me it feels like a little bit redundant.

I don't have much experience in this and it's still in early stage.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 29, 2018

@trivialfis It is certainly true that one should be able to run multi-GPU code on a single-GPU machine. However, clearly indicating the target machine of each test is still a good idea. Personally, I prefer MGPU_ prefix than XGBOOST_USE_NCCL macro, since the former is more concise and communicates semantics more clearly.

My rationale for separating single- and multi-GPU tests is that provisioning a large instance like p2.8xlarge takes siginficantly longer than provisioning p2xlarge. So to reduce test backlog, we want to run on the multi-GPU instance only those tests that would necessitate the plurality of GPU devices.

But at any rate, this is an early attempt. We can always improve it later as needs arise.

@hcho3 hcho3 merged commit b50bc2c into dmlc:master Sep 29, 2018
@hcho3 hcho3 deleted the add_mgpu_test branch September 29, 2018 18:21
alois-bissuel pushed a commit to criteo-forks/xgboost that referenced this pull request Dec 4, 2018
* Add multi-GPU unit test environment

* Better assertion message

* Temporarily disable failing test

* Distinguish between multi-GPU and single-GPU CPP tests

* Consolidate Python tests. Use attributes to distinguish multi-GPU Python tests from single-CPU counterparts
@lock lock bot locked as resolved and limited conversation to collaborators Dec 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants