Add multi-GPU unit test environment #3741

hcho3 · 2018-09-28T22:15:53Z

Add a new group of agents to Jenkins CI to run multi-GPU workload. The agents will use a p2.8xlarge instance.
Add a simple test for gpu_hist.
Add cpp test for multi-GPU. Any test prefixed with MGPU_ will run on the multi-GPU instance.

hcho3 · 2018-09-28T22:19:23Z

hcho3 · 2018-09-29T00:37:35Z

@RAMitchell @trivialfis The test tests/python-mgpu/test_gpu_updaters.py has failed due to difference between CPU and multi-GPU results:

======================================================================
FAIL: test_gpu_hist (test_gpu_updaters.TestGPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/tests/python-mgpu/test_gpu_updaters.py", line 31, in test_gpu_hist
    assert_gpu_results(cpu_results, gpu_results)
  File "/workspace/tests/python-mgpu/test_gpu_updaters.py", line 15, in assert_gpu_results
    .format(gpu_res["eval"][-1], cpu_res["eval"][-1])
AssertionError: GPU result don' match with CPU result (GPU: 42.155769, CPU: 41.519295)
-------------------- >> begin captured stdout << ---------------------

Should we be concerned with this error?

trivialfis · 2018-09-29T01:48:16Z

@hcho3 I will look into gpu hist after #3643. For now I don't have much confidence in multi-gpu setup.

hcho3 · 2018-09-29T02:01:43Z

@trivialfis Would you like to add a multi-GPU test to #3643? If so, I can disable the failing test for now and merge this PR to master.

trivialfis · 2018-09-29T02:55:20Z

@hcho3 I will add cpp test for that.

RAMitchell · 2018-09-29T02:59:56Z

@hcho3 This is amazing thanks. My view is we should disable any failing tests, merge this, then work to re-enable the tests.

hcho3 · 2018-09-29T04:59:19Z

@RAMitchell Got it. Let me go ahead and disable the failing test.
@trivialfis For CPP test, do you think we should run the same suite of tests on multi-GPU machine or a different suite?

hcho3 · 2018-09-29T05:09:49Z

@trivialfis One idea I have is to prefix name of Google test cases with MGPU_ to indicate that they should run on the multi-GPU machine. Something like

TEST(FooBarTest, MGPU_RunBar) { 
    // Use device ordinals 0, 1, 2, ...
}

hcho3 · 2018-09-29T05:45:11Z

Actually, I kind of like the idea of keeping all tests together. For Python tests, we can add mgpu attribute to every test that's to be run with multiple GPUs.

trivialfis · 2018-09-29T05:52:49Z

Currently I put the multi-gpu test code inside (at the end of) normal cuda test files, wrapped with:

#if defined(XGBOOST_USE_NCCL)
TEST(Foo, MultiGpuBar) {
  // do something.
}
#endif

But there are other options too, I would love to hear about other opinions.

…hon tests from single-CPU counterparts

hcho3 · 2018-09-29T06:00:02Z

@trivialfis I don't think that would work, because one of the test configurations compile with NCCL (hence XGBOOST_USE_NCCL is defined) but runs on a single GPU.

I'm open to suggestions, but for now I am using MGPU_ prefix (for cpp) and mgpu attribute (for python) to indicate that a given test should use more than one GPU. Why don't you look over this pull request and comment on aesthetics.

trivialfis · 2018-09-29T06:12:55Z

@hcho3 I agree with keeping all tests together. But I don't think running multi-GPU tests on single GPU would introduce problem, since any multi-GPU code should also run on single GPU. If such difficulties do arise we can pass --gtest_filter="-*.MGPU*" flag to the test executable.

Another option for not spiting up tests would be making a definition like #define TEST_WITH_MULTI_GPU, but to me it feels like a little bit redundant.

I don't have much experience in this and it's still in early stage.

hcho3 · 2018-09-29T07:06:32Z

@trivialfis It is certainly true that one should be able to run multi-GPU code on a single-GPU machine. However, clearly indicating the target machine of each test is still a good idea. Personally, I prefer MGPU_ prefix than XGBOOST_USE_NCCL macro, since the former is more concise and communicates semantics more clearly.

My rationale for separating single- and multi-GPU tests is that provisioning a large instance like p2.8xlarge takes siginficantly longer than provisioning p2xlarge. So to reduce test backlog, we want to run on the multi-GPU instance only those tests that would necessitate the plurality of GPU devices.

But at any rate, this is an early attempt. We can always improve it later as needs arise.

* Add multi-GPU unit test environment * Better assertion message * Temporarily disable failing test * Distinguish between multi-GPU and single-GPU CPP tests * Consolidate Python tests. Use attributes to distinguish multi-GPU Python tests from single-CPU counterparts

Add multi-GPU unit test environment

64309cc

Better assertion message

a8b619c

hcho3 mentioned this pull request Sep 29, 2018

Implement devices to devices reshard. #3721

Merged

hcho3 requested a review from RAMitchell September 29, 2018 00:36

hcho3 added 2 commits September 28, 2018 22:24

Temporarily disable failing test

ef17f72

Distinguish between multi-GPU and single-GPU CPP tests

be7f26f

Consolidate Python tests. Use attributes to distinguish multi-GPU Pyt…

6c49d8b

…hon tests from single-CPU counterparts

hcho3 merged commit b50bc2c into dmlc:master Sep 29, 2018

hcho3 deleted the add_mgpu_test branch September 29, 2018 18:21

lock bot locked as resolved and limited conversation to collaborators Dec 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-GPU unit test environment #3741

Add multi-GPU unit test environment #3741

hcho3 commented Sep 28, 2018 •

edited

Loading

hcho3 commented Sep 28, 2018

hcho3 commented Sep 29, 2018 •

edited

Loading

trivialfis commented Sep 29, 2018

hcho3 commented Sep 29, 2018

trivialfis commented Sep 29, 2018

RAMitchell commented Sep 29, 2018

hcho3 commented Sep 29, 2018

hcho3 commented Sep 29, 2018

hcho3 commented Sep 29, 2018

trivialfis commented Sep 29, 2018

hcho3 commented Sep 29, 2018 •

edited

Loading

trivialfis commented Sep 29, 2018

hcho3 commented Sep 29, 2018 •

edited

Loading

Add multi-GPU unit test environment #3741

Add multi-GPU unit test environment #3741

Conversation

hcho3 commented Sep 28, 2018 • edited Loading

hcho3 commented Sep 28, 2018

hcho3 commented Sep 29, 2018 • edited Loading

trivialfis commented Sep 29, 2018

hcho3 commented Sep 29, 2018

trivialfis commented Sep 29, 2018

RAMitchell commented Sep 29, 2018

hcho3 commented Sep 29, 2018

hcho3 commented Sep 29, 2018

hcho3 commented Sep 29, 2018

trivialfis commented Sep 29, 2018

hcho3 commented Sep 29, 2018 • edited Loading

trivialfis commented Sep 29, 2018

hcho3 commented Sep 29, 2018 • edited Loading

hcho3 commented Sep 28, 2018 •

edited

Loading

hcho3 commented Sep 29, 2018 •

edited

Loading

hcho3 commented Sep 29, 2018 •

edited

Loading

hcho3 commented Sep 29, 2018 •

edited

Loading