Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I turn on the “per thread stream” about libcudf ? [QST] #5596

Closed
chenrui17 opened this issue Jun 28, 2020 · 4 comments
Closed

Can I turn on the “per thread stream” about libcudf ? [QST] #5596

chenrui17 opened this issue Jun 28, 2020 · 4 comments
Labels
CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested

Comments

@chenrui17
Copy link
Contributor

chenrui17 commented Jun 28, 2020

I found that cudf default bulid option about multi-strema is "default stream", so I rebuild the libcudf like this way to support multi stream about operation on gpu :
cmake .. -DCMAKE_INSTALL_PREFIX=/opt/rapids -DCMAKE_CXX11_ABI=ON -DPER_THREAD_DEFAULT_STREAM=ON
and I have 3 questions about stream:

  1. when I use "per-thread-stream" , I found that read_parquet performance is better , speed up is up to 1.2X average. my question is Can I just simply modify the option "default stream" to "per-thread-stream" to support one cpu thread to one stream ? and is this way sure to work and improve performance ?
  2. when I use "per-thread-stream", I found that the performance of "read_parquet + groupby_aggregate" is not as expected .the performace is getting worse compared to read_parquet only . and at the same time, there is a 0.7% default stream ,I don't know why .the nsight profile is :
    image
  3. when I use "per-thread-stream", I found that the overlap about kernel is not bad, but the overlap of HostToDevice and the DeviceToHost is not as expected ,like this :
    image
    please give me some advice
@chenrui17 chenrui17 added Needs Triage Need team to review and classify question Further information is requested labels Jun 28, 2020
@chenrui17
Copy link
Contributor Author

this is qdrep files about per-thread-stream and legacy-default-stream.
and my cpu thread num is 12, gpu thread num is 3, and I use semaphore to control the cpu thread to require device 。

nsight-profile.zip

@jrhemstad
Copy link
Contributor

PTDS is largely untested and should be considered largely experimental.

You will find answers to a few of your questions about per-thread default stream here: https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

@kkraus14 kkraus14 added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 30, 2020
@harrism
Copy link
Member

harrism commented Jul 20, 2020

PTDS support is in progress. It should work successfully for cuDF now in 0.15 (current development branch), as long as you use the RMM default memory resource, cnmem_memory_resource, or pool_memory_resource. You will probably get better overlap with pool_memory_resource since it synchronizes the device less.

@kkraus14
Copy link
Collaborator

Closing as this was implemented in #4995

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants