Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Adds DeviceBatchMemcpy algorithm and tests #359

Merged
merged 2 commits into from
Dec 30, 2022

Conversation

elstehle
Copy link
Collaborator

@elstehle elstehle commented Aug 18, 2021

Algorithm Overview

The DeviceBatchMemcpy takes N input buffers and N output buffers and copies buffer_size[i] bytes from the i-th input buffer to the i-th output buffer. If any input buffer aliases memory from any output buffer the behavior is undefined. If any output buffer aliases memory of another output buffer the behavior is undefined. Input buffers can alias one another.

Implementation Details

We distinguish each buffer by its size and assign it to one of three size classes:

  1. Thread-level buffer (TLEV buffer). A buffer that is processed by one or more threads but not a whole warp (e.g., up to 32 bytes).
  2. Warp-level buffer (WLEV buffer). A buffer that is processed by a whole warp (e.g., above 32 bytes but only up to 1024 bytes).
  3. Block-level buffer (BLEV buffer). A buffer that is processed by one or more thread blocks. The number of thread blocks assigned to such a buffer is proportional to its size (e.g., all buffers above 1024 bytes).

Step 1: Partitioning Buffers by Size

  1. Each thread block loads a tile of buffer_size[i].
  2. Threads compute a three-bin histogram over their assigned buffer_size[ITEMS_PER_THREAD] chunk. Binning buffers by the size class they fall into
  3. An exclusive prefix sum is computed over the histograms. The prefix sum's aggregate reflects the number of buffers that fall into each size class. The prefix sum of each thread corresponds to the relative offset within each partition.
  4. Scatter the buffers into their partition. For each buffer, we scatter the tuple: {tile_buffer_id, buffer_size}, where tile_buffer_id is the buffer id, relative to the tile (i.e., from the interval [0, TILE_SIZE)). buffer_size is only defined for buffers that belong to the tlev partition and corresponds to the buffer's size (number of bytes) in that case.
tile_buffer_id 0 1 2 3 4 5 6 7
tile_buffer_sizes 3 37 17 4 9 4242 11 2000
T T T T T W B B
tile_buffer_id 0 2 3 4 6 1 5 7
tile_buffer_size 3 17 4 9 11 - - -

Note, the partitioning does not necessarily need to be stable. It may be desired if we expect neighbouring buffers to hold neighbouring byte segments.

After the partitioning, each partition represents all the buffers that belong to the respective size class (i.e., one of TLEV, WLEV, BLEV). Depending on the size class, a different logic is applied. We process each partition separately.

Step 2.a: Copying TLEV Buffers

Usually, TLEV buffers are buffers of only a few bytes. Vectorised loads and stores do not really pay off here, as there's only few bytes that can actually be read from a four byte-aligned address. It does not pay off to have the two different code paths for (a) loading individual bytes from non-aligned adrresses and (b) doing vectorised loads from aligned addresses.

Instead, we use the BlockRunLengthDecode algorithm to both (a) coalesce reads and writes as well as (b) load balance the number of bytes copied by each thred. Specifically, we are able to assign neighbouring bytes to neighbouring threads.

The following tables illustrates how the first 8 bytes from the TLEV buffers are getting assigned to threads.

T T T T T
tile_buffer_id 0 2 3 4 6
tile_buffer_size 3 17 4 9 11 - - -
[1] run_length_decode
t0 t1 t2 t3 t4
buffer_id 0 0 0 2 2 2 2 2
byte_of_buffer 0 1 2 0 1 2 3 4

[1] Use BlockRunLengthDecode using the tile_buffer_id as the "unique_items" and each buffer's size as the respective run's length. The result from the run-length decode yields the assignment from threads to the buffer along with the specific byte from that buffer.

Step 2.b: Copying WLEV Buffers

A full warp is assigned to each WLEV buffer. Loads from the input buffer are vectorised (aliased to a wider data type), loading 4, 8 or even 16 bytes at a time from the input buffer's first address that is aligned to such aliased data type. The implementation for the vectorised copy is based on @gaohao95's (thanks!) string gather improvement in https://github.com/rapidsai/cudf/pull/7980/files

I think we want to have the vectorised copy as a reusable component. But I wanted to coordinate on what exactly that would look like first. Should this be (a) a warp-/block-level copy or should we (b) separate it into a warp-&block-level vectorised load (which will also have the async copy, maybe) and a warp-&block-level vectorised store?

Step 2.c: Enqueueing BLEV Buffers

These are buffers that may be very large. We want to avoid a scenario where there's potentially one very large buffer that a single thread block is copying while other thread blocks are sitting idle. To avoid this, BLEV buffers will be put into a queue that will be picked up in a subsequent kernel. In the subsequent kernel, the number of thred blocks getting assigned to each buffer is proportional to the buffer's size.

@elstehle elstehle changed the title Feature/device batch memcpy Adds DeviceBatchMemcpy algorithm and tests Aug 18, 2021
@gevtushenko gevtushenko self-requested a review August 23, 2021 09:42
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! You've done tremendous work here. It's also a vital algorithm to have. I was a bit concerned with the complexity of implementation, though. I've decided to benchmark this algorithm and found a strange performance drop. For buffers of 256 std::uint32_t items the performance seems quite impressive.

image

CUB here denotes your implementation, and the memcpy represents the bandwidth of cudaMemcpyAsync applied to the sum of all buffer sizes.

But when I've changed the underlying type to std::uint64_t (that is, increased the buffer size twice), I've observed the following.

image

The code produces the correct result, so I'm not sure what's the reason. At this point, I've decided to check a different approach. I applied a three-way partition which produced reorderings for small/medium/large segments. Then I used existing facilities to copy data.

image

To handle large buffers by multiple thread blocks, I used atomic operations. I increment a value assigned to a large buffer to get the tile position in that buffer. Here are the results where I vary the number of buffers with a fixed size of 64 MB.

image

The simple approach seems to perform a bit better in this particular case. I also checked the same test for 1GB segments, and it's still better there.

image

Here's the bandwidth for copying extremely small buffers - 2 items of std::uint32_t type:
image

Another interesting question is the temporary storage size. You currently require four num_buffers. I've managed to use only two num_buffers.

To conclude the data above, the simple implementation:

  • uses two times less memory
  • is faster in some cases
  • almost fits into the screen 😄
  • uses existing facilities.

It's just a proof of concept. Therefore I haven't considered unaliased buffers or sizes that are not a multiple of 32 bits. I don't expect this to change the results significantly, though. Anyway, we might consider requiring this if this is the source of performance issues. Padding arrays is quite common practice.

Could you consider the simple algorithm and check how it can be helpful in your case? Even if the algorithm I've mentioned happens to be slower, I hope you'll be able to incorporate some of the ideas as building blocks of the proposed PR. For example, it could be used to deal with small numbers of buffers. Here is the CUB branch I've used for testing, and here is the benchmark and partition-based implementation.

I am looking forward to checking your results on the second stage of review!

cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
test/test_block_run_length_decode.cu Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
@elstehle
Copy link
Collaborator Author

elstehle commented Aug 25, 2021

Thanks for the feedback and the preliminary evaluation, @senior-zero 👍

Fundamentally, our ideas are quite similar. You do a three-way partition on all the problems. I proposed to have a kernel-fused version of the "three-way partitioning" that is fused with the implementation for copying small and medium buffers. The goal being that we can solve small and medium buffers straight in the kernel instead of having to write them into a "queue" first and later read them back in. I wanted to circumvent the extra reads of the problems' sizes and writing their id out, as well as another extra read of the partitioned id. This definitely makes the implementation more complex and, I totally agree, I'm not sure if that complexity is worth it.

When I had conceived this, I assumed the "worst case" scenario. In theory, let's assume these type sizes: buffer_src, buffer_dst, buffer_size is each 4 bytes. The average buffer size is 4 bytes. For N buffers the fused version incurs (4 + 4 + 4 + 2 * 4) * N memory transfers. If we did a preliminary three-way partition upfront, it would be (4 + 4) * N + (4 + 4 + 4 + 4 + 2 * 4) * N. So 20 bytes per buffer versus 32 bytes per buffer, basically an extra read of buffer_size, a write of buffer_id, and another read of buffer_id.

Now, we also see that, unfortunately, we cannot sustain anywhere near peak memory bandwidth for such tiny buffers. So question is whether we want to take the theoretical model into consideration at all.

I see the three decisions we need to make:
(1) I think whether to kernel-fuse or to not kernel-fuse is the key decision we have to make. We'll probably need an apples-to-apples comparison that has identical implementations for the "small" buffer logic to see the performance difference. I'll try to evaluate this in the coming days. Then we can make an educated decision about code complexity versus performance.

(2) What I also like is using atomics for the scheduling/load-balancing of large buffers. The performance drop you see going from 1KB to 2KB buffers is a combination of a configuration discrepancy (my bad) and general performance regression when the tile size (or "task" size, i.e., the most granular unit getting assigned to thread blocks) is too small. The binary search seems to dominate in that case. I also want to see if streaming reads and writes will alleviate this. So we'll also need to compare these two mechanisms and factor out other side effects too.

(3) What is left, is the actual implementation of how we're copying small buffers, medium buffers, and large buffers, respectively. I think this it is easy to exchange one for the other. Once we figured out the former two decisions, this will be easy.

So I would proceed in that order. Does that sound good?

As for:

uses two times less memory

That can easily be done for the kernel-fused version too, right? It's just a matter of trading memory for more coalesced accesses. I.e., I'm materialising the buffer's source and destination pointers for large buffers instead of having the indirection. I'm also fine to have indirection in this particular case.

is faster in some cases

I'm all in for fast 😁 We just need to have a more differentiated and elaborate evaluation to track down where the difference actually comes from.

almost fits into the screen 😄

💯

uses existing facilities.

I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be able to vectorise some loads/stores. If we can get the performance from existing building blocks, let's go for that. Otherwise let's make it a reusable building block.

@jrhemstad
Copy link
Collaborator

I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be able to vectorise some loads/stores. If we can get the performance from existing building blocks, let's go for that. Otherwise let's make it a reusable building block.

I've long wanted a cuda::memcpy that would handle runtime determined alignment as well as take a CG parameter to use multiple threads to perform the copy. That seems like the best place to put such a building block as it could have widespread applicability.

@elstehle elstehle changed the title Adds DeviceBatchMemcpy algorithm and tests [WIP] Adds DeviceBatchMemcpy algorithm and tests Oct 13, 2021
@elstehle
Copy link
Collaborator Author

I'm currently gathering results of a few more benchmarks that hopefully will help us make an informed decision about which of the scheduling mechanisms to pursue (preliminary three-way partition vs. single-pass prefix scan-based). I'll post the results shortly.

In the meanwhile, PR #354, on which this PR builds, should be ready for review.

@alliepiper
Copy link
Collaborator

FYI, I'm starting the 1.15 RC next week so I'm bumping this to 1.16. I'll try to get to NVIDIA/cccl#1006 before the release.

@alliepiper alliepiper added this to the 1.16.0 milestone Oct 14, 2021
@alliepiper alliepiper added the helps: rapids Helps or needed by RAPIDS. label Oct 14, 2021
@alliepiper alliepiper marked this pull request as draft October 14, 2021 20:27
@elstehle
Copy link
Collaborator Author

elstehle commented Oct 18, 2021

So I ran the first batch of benchmarks. I'll add more throughout the week.

Methodology

  • Benchmarks ran on V100
  • We allocate two large buffers on device memory: one for the input, one for the output
  • We generate an array of buffer_sizes. Buffer sizes are uniform random in the interval [<Min. buffer size>, <max. buffer size>]
  • We generate an offsets array for the input buffer batch, which will alias into the input memory allocation, and an offsets array for the output buffer batch, which will alias into the output memory allocation.
  • These offsets can be generated one of two ways (depending on the experiment):
    • CONSECUTIVE (C): offset[0] = 0; offset[i] = offset[i-1] + buffer_sizes[i-1];
    • SHUFFLE (S): the offsets are "somewhat" similar to CONSECUTIVE but then the offsets are being shuffled. That makes sure that bytes of buffer[i] are at a different location than buffer[j] for i!=j.
  • Further, offsets and sizes would be made to comply with a configurable AtomicT. That is, offsets will be aligned to integer multipes of AtomicT and buffer_sizes will as well be integer multiples of AtomicT.
  • The charts label the achieve memory throughput on the y-axis (i.e., all the required memory transfers, such as reading buffer sizes, reading buffer offsets, reading the bytes to be copied, and writing the bytes to be copied, divided by the total run time)
  • The charts label the input on the x-axis: <INPUT-OFFSET-GEN>_<OUTPUT-OFFSET-GEN>_<Min. buffer size>_<max. buffer size>
    • For instance, the label C_S_1_8 means:
    • C: the input will be consecutive buffers in
    • S: the output buffers are shuffled ("random" writes)
    • 1: the minimum buffer size is 1
    • 8: the maximum buffer size is 8
  • Generally, we compared different aspects of the three-way partition implementation (TWP) (see here) versus the single-pass prefix scan-based implementation (SPPS) (this PR, originally).
  • This the branch that benchmarks are run on:
Compilation example / details
nvcc -DTWP_TLEV_CPY=0 -DLD_VECTORIZED=0 -DATOMIC_CPY_TYPE=uint8_t -DTLEV_ALIAS_TYPE=uint8_t -DWLEV_MIN_SIZE=17000 -DBLEV_MIN_SIZE=17000 -Xptxas -v -lineinfo --generate-code arch=compute_70,code=sm_70 -DTHRUST_IGNORE_CUB_VERSION_CHECK -I<your-thrust-path> -I<your-cub-path> test_device_batch_memcpy.cu -o test_memcpy && ./test_memcpy 
  • TWP_TLEV_CPY whether to use TWP's small buffer copying logic inside of SPPS
  • LD_VECTORIZED whether to enable CUB vectorized loads inside TWP's copy logic
  • ATOMIC_CPY_TYPE buffers will be aligned and their size an integer multiple of this type
  • TLEV_ALIAS_TYPE the data type being copied. This may not exceed ATOMIC_CPY_TYPE

Copying of small buffers logic

  • For these tests the thresholds for medium (aka "WLEV" buffers) and large (aka "BLEV") buffers was set so high that all buffers would be copied by the copy small buffer or copy TLEV buffer logic, respectively.
  • The benchmarks serve two purposes:
    • identify which implementation to choose for the copy small buffer logic.
    • get an initial idea of the "scheduling overhead" (the "scheduling" being the logic that partitions the buffers into "small", "medium", and "large" buffers.
  • Both implementations were adapted and made configurable to perform aliased loads of TLEV_ALIAS_TYPE. Various TLEV_ALIAS_TYPE were tested.
  • The "copy small buffer logic" from TWP was ported into SPPS. This allowed to factor out performance differences due to scheduling differences. Similarly, it allowed to compare the scheduling overhead.

No Aliased Loads, No Buffer Size Variance

using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;

cpy1_alias1_tlv_no_variance

Data
Min. buffer size: max. buffer size: in_gen: out_gen: src size: dst size: sizes size: data_size: total: duration (SPPS): BW (SPPS): duration (TWP): BW (TWP): relative performance
2 2 CONSECUTIVE CONSECUTIVE 0.357914 0.357914 0.357914 0.333333 1.333333 4.621500 309.781000 9.954430 143.821000 46.43%
4 4 CONSECUTIVE CONSECUTIVE 0.214748 0.214748 0.214748 0.400000 1.000000 3.467490 309.660000 6.699330 160.276000 51.76%
8 8 CONSECUTIVE CONSECUTIVE 0.119305 0.119305 0.119305 0.444444 0.777778 2.687550 310.741000 3.803490 219.570000 70.66%
16 16 CONSECUTIVE CONSECUTIVE 0.063161 0.063161 0.063161 0.470588 0.647059 2.201060 315.655000 2.124030 327.102000 103.63%
32 32 CONSECUTIVE CONSECUTIVE 0.032538 0.032538 0.032538 0.484848 0.575758 1.669120 370.384000 1.985600 311.349000 84.06%
64 64 CONSECUTIVE CONSECUTIVE 0.016519 0.016519 0.016519 0.492308 0.538462 1.382300 418.264000 2.191580 263.813000 63.07%
128 128 CONSECUTIVE CONSECUTIVE 0.008324 0.008324 0.008324 0.496124 0.519380 1.289440 432.498000 3.027520 184.204000 42.59%
256 256 CONSECUTIVE CONSECUTIVE 0.004178 0.004178 0.004178 0.498054 0.509727 1.540160 355.363000 3.386910 161.597000 45.47%
512 512 CONSECUTIVE CONSECUTIVE 0.002093 0.002093 0.002093 0.499024 0.504872 1.684000 321.914000 3.484130 155.592000 48.33%
1024 1024 CONSECUTIVE CONSECUTIVE 0.001048 0.001048 0.001048 0.499512 0.502439 1.548160 348.471000 5.698080 94.679100 27.17%
4096 4096 CONSECUTIVE CONSECUTIVE 0.000262 0.000262 0.000262 0.499878 0.500610 3.697090 145.392000 5.926180 90.703700 62.39%
2 2 CONSECUTIVE SHFL 0.357914 0.357914 0.357914 0.333333 1.333333 49.238400 29.076000 56.410400 25.379300 87.29%
4 4 CONSECUTIVE SHFL 0.214748 0.214748 0.214748 0.400000 1.000000 29.612200 36.260200 34.172600 31.421100 86.65%
8 8 CONSECUTIVE SHFL 0.119305 0.119305 0.119305 0.444444 0.777778 17.402500 47.989100 20.556500 40.626100 84.66%
16 16 CONSECUTIVE SHFL 0.063161 0.063161 0.063161 0.470588 0.647059 9.428540 73.688400 10.772400 64.495900 87.53%
32 32 CONSECUTIVE SHFL 0.032538 0.032538 0.032538 0.484848 0.575758 3.217150 192.162000 8.831460 70.001500 36.43%
64 64 CONSECUTIVE SHFL 0.016519 0.016519 0.016519 0.492308 0.538462 1.791100 322.800000 8.575740 67.419100 20.89%
128 128 CONSECUTIVE SHFL 0.008324 0.008324 0.008324 0.496124 0.519380 1.414910 394.145000 8.726940 63.903200 16.21%
256 256 CONSECUTIVE SHFL 0.004178 0.004178 0.004178 0.498054 0.509727 1.517700 360.623000 8.655330 63.234500 17.53%
512 512 CONSECUTIVE SHFL 0.002093 0.002093 0.002093 0.499024 0.504872 1.620580 334.512000 8.639520 62.746800 18.76%
1024 1024 CONSECUTIVE SHFL 0.001048 0.001048 0.001048 0.499512 0.502439 1.554850 346.972000 8.712640 61.920300 17.85%
4096 4096 CONSECUTIVE SHFL 0.000262 0.000262 0.000262 0.499878 0.500610 3.682690 145.960000 8.768610 61.301200 42.00%
2 2 SHFL CONSECUTIVE 0.357914 0.357914 0.357914 0.333333 1.333333 18.644400 76.787500 26.246000 54.547600 71.04%
4 4 SHFL CONSECUTIVE 0.214748 0.214748 0.214748 0.400000 1.000000 12.255600 87.612000 16.030800 66.979900 76.45%
8 8 SHFL CONSECUTIVE 0.119305 0.119305 0.119305 0.444444 0.777778 7.418370 112.576000 9.342270 89.392900 79.41%
16 16 SHFL CONSECUTIVE 0.063161 0.063161 0.063161 0.470588 0.647059 4.461890 155.713000 5.186690 133.953000 86.03%
32 32 SHFL CONSECUTIVE 0.032538 0.032538 0.032538 0.484848 0.575758 2.757440 224.199000 3.560380 173.637000 77.45%
64 64 SHFL CONSECUTIVE 0.016519 0.016519 0.016519 0.492308 0.538462 1.771780 326.322000 3.488540 165.734000 50.79%
128 128 SHFL CONSECUTIVE 0.008324 0.008324 0.008324 0.496124 0.519380 1.511550 368.945000 4.146560 134.492000 36.45%
256 256 SHFL CONSECUTIVE 0.004178 0.004178 0.004178 0.498054 0.509727 1.510910 362.242000 4.460800 122.694000 33.87%
512 512 SHFL CONSECUTIVE 0.002093 0.002093 0.002093 0.499024 0.504872 1.647780 328.990000 4.812100 112.654000 34.24%
1024 1024 SHFL CONSECUTIVE 0.001048 0.001048 0.001048 0.499512 0.502439 1.550660 347.910000 6.096960 88.485000 25.43%
4096 4096 SHFL CONSECUTIVE 0.000262 0.000262 0.000262 0.499878 0.500610 3.689700 145.683000 5.451840 98.595400 67.68%
2 2 SHFL SHFL 0.357914 0.357914 0.357914 0.333333 1.333333 55.553100 25.770900 62.612900 22.865200 88.72%
4 4 SHFL SHFL 0.214748 0.214748 0.214748 0.400000 1.000000 33.896400 31.677200 41.284300 26.008500 82.10%
8 8 SHFL SHFL 0.119305 0.119305 0.119305 0.444444 0.777778 19.720600 42.348100 22.110000 37.771800 89.19%
16 16 SHFL SHFL 0.063161 0.063161 0.063161 0.470588 0.647059 10.819900 64.212600 12.307200 56.452800 87.92%
32 32 SHFL SHFL 0.032538 0.032538 0.032538 0.484848 0.575758 4.889980 126.425000 11.500200 53.757100 42.52%
64 64 SHFL SHFL 0.016519 0.016519 0.016519 0.492308 0.538462 2.364290 244.542000 11.406400 50.688100 20.73%
128 128 SHFL SHFL 0.008324 0.008324 0.008324 0.496124 0.519380 1.471460 378.999000 11.314400 49.289500 13.01%
256 256 SHFL SHFL 0.004178 0.004178 0.004178 0.498054 0.509727 1.523420 359.267000 11.319400 48.351900 13.46%
512 512 SHFL SHFL 0.002093 0.002093 0.002093 0.499024 0.504872 1.624220 333.761000 11.340800 47.801000 14.32%
1024 1024 SHFL SHFL 0.001048 0.001048 0.001048 0.499512 0.502439 1.547520 348.615000 11.365800 47.465900 13.62%
4096 4096 SHFL SHFL 0.000262 0.000262 0.000262 0.499878 0.500610 3.674820 146.273000 11.286400 47.625900 32.56%

Scheduling: TWP vs. SPPS; No Aliased Loads, No Buffer Size Variance

using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;

Here, the small buffer copying logic from TWP was moved into SPPS. Hence, we aim to limit the difference to be the scheduling (i.e., the partitioning into small, medium, and large buffers).

cpy1_alias1_twp_smb_no_variance

Data
Min. buffer size: max. buffer size: in_gen: out_gen: src size: dst size: sizes size: data_size: total: duration (SPPS): BW (SPPS): duration (TWP): BW (TWP): relative performance
2 2 CONSECUTIVE CONSECUTIVE 0.357914 0.357914 0.357914 0.333333 1.333333 5.986530 239.146000 10.124700 141.403000 59.13%
4 4 CONSECUTIVE CONSECUTIVE 0.214748 0.214748 0.214748 0.400000 1.000000 4.014370 267.475000 6.655810 161.324000 60.31%
8 8 CONSECUTIVE CONSECUTIVE 0.119305 0.119305 0.119305 0.444444 0.777778 2.444420 341.649000 3.843840 217.265000 63.59%
16 16 CONSECUTIVE CONSECUTIVE 0.063161 0.063161 0.063161 0.470588 0.647059 1.522910 456.214000 2.134660 325.474000 71.34%
32 32 CONSECUTIVE CONSECUTIVE 0.032538 0.032538 0.032538 0.484848 0.575758 1.891680 326.807000 2.098430 294.608000 90.15%
64 64 CONSECUTIVE CONSECUTIVE 0.016519 0.016519 0.016519 0.492308 0.538462 2.169700 266.474000 2.306210 250.701000 94.08%
128 128 CONSECUTIVE CONSECUTIVE 0.008324 0.008324 0.008324 0.496124 0.519380 3.181600 175.283000 3.193500 174.629000 99.63%
256 256 CONSECUTIVE CONSECUTIVE 0.004178 0.004178 0.004178 0.498054 0.509727 3.356580 163.058000 3.381950 161.834000 99.25%
512 512 CONSECUTIVE CONSECUTIVE 0.002093 0.002093 0.002093 0.499024 0.504872 4.032830 134.422000 3.437120 157.720000 117.33%
1024 1024 CONSECUTIVE CONSECUTIVE 0.001048 0.001048 0.001048 0.499512 0.502439 4.960060 108.767000 5.690910 94.798400 87.16%
4096 4096 CONSECUTIVE CONSECUTIVE 0.000262 0.000262 0.000262 0.499878 0.500610 4.012350 133.968000 5.090080 105.603000 78.83%
2 2 CONSECUTIVE SHFL 0.357914 0.357914 0.357914 0.333333 1.333333 50.586300 28.301300 56.376300 25.394600 89.73%
4 4 CONSECUTIVE SHFL 0.214748 0.214748 0.214748 0.400000 1.000000 31.578500 34.002300 34.485900 31.135700 91.57%
8 8 CONSECUTIVE SHFL 0.119305 0.119305 0.119305 0.444444 0.777778 18.002200 46.390600 20.469100 40.799700 87.95%
16 16 CONSECUTIVE SHFL 0.063161 0.063161 0.063161 0.470588 0.647059 9.599460 72.376400 10.282200 67.570300 93.36%
32 32 CONSECUTIVE SHFL 0.032538 0.032538 0.032538 0.484848 0.575758 7.941060 77.850500 8.469500 72.993000 93.76%
64 64 CONSECUTIVE SHFL 0.016519 0.016519 0.016519 0.492308 0.538462 8.077950 71.573700 7.940700 72.810800 101.73%
128 128 CONSECUTIVE SHFL 0.008324 0.008324 0.008324 0.496124 0.519380 8.560700 65.144200 8.692510 64.156400 98.48%
256 256 CONSECUTIVE SHFL 0.004178 0.004178 0.004178 0.498054 0.509727 8.261570 66.248400 8.651580 63.261900 95.49%
512 512 CONSECUTIVE SHFL 0.002093 0.002093 0.002093 0.499024 0.504872 8.394620 64.577300 8.646400 62.696900 97.09%
1024 1024 CONSECUTIVE SHFL 0.001048 0.001048 0.001048 0.499512 0.502439 8.386400 64.329100 8.718530 61.878500 96.19%
4096 4096 CONSECUTIVE SHFL 0.000262 0.000262 0.000262 0.499878 0.500610 7.356190 73.071200 8.759580 61.364300 83.98%
2 2 SHFL CONSECUTIVE 0.357914 0.357914 0.357914 0.333333 1.333333 20.325500 70.436500 26.223800 54.593700 77.51%
4 4 SHFL CONSECUTIVE 0.214748 0.214748 0.214748 0.400000 1.000000 13.297300 80.749000 15.982300 67.183200 83.20%
8 8 SHFL CONSECUTIVE 0.119305 0.119305 0.119305 0.444444 0.777778 7.824000 106.740000 9.367580 89.151300 83.52%
16 16 SHFL CONSECUTIVE 0.063161 0.063161 0.063161 0.470588 0.647059 4.461790 155.716000 5.174910 134.258000 86.22%
32 32 SHFL CONSECUTIVE 0.032538 0.032538 0.032538 0.484848 0.575758 3.184740 194.118000 3.553890 173.955000 89.61%
64 64 SHFL CONSECUTIVE 0.016519 0.016519 0.016519 0.492308 0.538462 3.536380 163.491000 3.474500 166.404000 101.78%
128 128 SHFL CONSECUTIVE 0.008324 0.008324 0.008324 0.496124 0.519380 4.170270 133.727000 4.114560 135.538000 101.35%
256 256 SHFL CONSECUTIVE 0.004178 0.004178 0.004178 0.498054 0.509727 5.023330 108.955000 4.459870 122.720000 112.63%
512 512 SHFL CONSECUTIVE 0.002093 0.002093 0.002093 0.499024 0.504872 6.173220 87.815300 4.829820 112.241000 127.81%
1024 1024 SHFL CONSECUTIVE 0.001048 0.001048 0.001048 0.499512 0.502439 5.178560 104.177000 6.105310 88.363900 84.82%
4096 4096 SHFL CONSECUTIVE 0.000262 0.000262 0.000262 0.499878 0.500610 4.300130 125.002000 5.436320 98.876800 79.10%
2 2 SHFL SHFL 0.357914 0.357914 0.357914 0.333333 1.333333 55.162900 25.953200 61.979200 23.099000 89.00%
4 4 SHFL SHFL 0.214748 0.214748 0.214748 0.400000 1.000000 36.166200 29.689100 41.238000 26.037700 87.70%
8 8 SHFL SHFL 0.119305 0.119305 0.119305 0.444444 0.777778 20.368300 41.001600 23.170100 36.043500 87.91%
16 16 SHFL SHFL 0.063161 0.063161 0.063161 0.470588 0.647059 11.429500 60.787900 12.343000 56.289000 92.60%
32 32 SHFL SHFL 0.032538 0.032538 0.032538 0.484848 0.575758 11.688100 52.892800 11.503300 53.742600 101.61%
64 64 SHFL SHFL 0.016519 0.016519 0.016519 0.492308 0.538462 10.376100 55.721000 11.411700 50.664600 90.93%
128 128 SHFL SHFL 0.008324 0.008324 0.008324 0.496124 0.519380 10.765200 51.803900 11.311900 49.300400 95.17%
256 256 SHFL SHFL 0.004178 0.004178 0.004178 0.498054 0.509727 10.185700 53.733700 11.342400 48.253900 89.80%
512 512 SHFL SHFL 0.002093 0.002093 0.002093 0.499024 0.504872 10.719000 50.574100 11.362600 47.709300 94.34%
1024 1024 SHFL SHFL 0.001048 0.001048 0.001048 0.499512 0.502439 10.132100 53.245700 11.375600 47.425100 89.07%
4096 4096 SHFL SHFL 0.000262 0.000262 0.000262 0.499878 0.500610 8.641660 62.201700 11.305000 47.547500 76.44%

No Aliased Loads, Varying Buffer Size

using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;

We now look at varying buffer sizes, where buffer sizes are uniformly distributed in [<Min. buffer size>, <max. buffer size>]. This highlights how resilient a method is to load imbalance.

cpy1_alias1_tlv

Data
Min. buffer size: max. buffer size: in_gen: out_gen: src size: dst size: sizes size: data_size: total: duration (SPPS): BW (SPPS): duration (TWP): BW (TWP): relative performance
1 2 CONSECUTIVE CONSECUTIVE 0.357914 0.357914 0.357914 0.249993 1.249993 4.596700 291.985000 8.607710 155.926000 53.40%
1 4 CONSECUTIVE CONSECUTIVE 0.214748 0.214748 0.214748 0.249994 0.849994 3.141120 290.557000 5.651420 161.495000 55.58%
1 8 CONSECUTIVE CONSECUTIVE 0.119305 0.119305 0.119305 0.250006 0.583340 2.129500 294.132000 3.647010 171.745000 58.39%
1 16 CONSECUTIVE CONSECUTIVE 0.063161 0.063161 0.063161 0.249998 0.426469 1.453890 314.961000 2.061310 222.149000 70.53%
1 32 CONSECUTIVE CONSECUTIVE 0.032538 0.032538 0.032538 0.250025 0.340934 1.083390 337.898000 1.477500 247.766000 73.33%
1 64 CONSECUTIVE CONSECUTIVE 0.016519 0.016519 0.016519 0.249996 0.296150 0.906016 350.974000 1.486850 213.867000 60.94%
1 128 CONSECUTIVE CONSECUTIVE 0.008324 0.008324 0.008324 0.249913 0.273169 0.862752 339.974000 1.688900 173.671000 51.08%
1 256 CONSECUTIVE CONSECUTIVE 0.004178 0.004178 0.004178 0.249916 0.261589 0.882688 318.209000 1.858720 151.115000 47.49%
1 512 CONSECUTIVE CONSECUTIVE 0.002093 0.002093 0.002093 0.249880 0.255728 1.059650 259.130000 2.133860 128.681000 49.66%
1 1024 CONSECUTIVE CONSECUTIVE 0.001048 0.001048 0.001048 0.249926 0.252852 0.846240 320.829000 2.534530 107.120000 33.39%
1 4096 CONSECUTIVE CONSECUTIVE 0.000262 0.000262 0.000262 0.249683 0.250416 1.787710 150.405000 3.003650 89.518400 59.52%
1 2 CONSECUTIVE SHFL 0.357914 0.357914 0.357914 0.249993 1.249993 48.814200 27.495400 56.818300 23.622100 85.91%
1 4 CONSECUTIVE SHFL 0.214748 0.214748 0.214748 0.249994 0.849994 28.978200 31.495200 32.700900 27.909800 88.62%
1 8 CONSECUTIVE SHFL 0.119305 0.119305 0.119305 0.250006 0.583340 16.414600 38.158600 19.218200 32.591700 85.41%
1 16 CONSECUTIVE SHFL 0.063161 0.063161 0.063161 0.249998 0.426469 8.966460 51.070000 9.698660 47.214500 92.45%
1 32 CONSECUTIVE SHFL 0.032538 0.032538 0.032538 0.250025 0.340934 4.959360 73.815100 6.123740 59.779700 80.99%
1 64 CONSECUTIVE SHFL 0.016519 0.016519 0.016519 0.249996 0.296150 3.322850 95.697500 5.237570 60.713000 63.44%
1 128 CONSECUTIVE SHFL 0.008324 0.008324 0.008324 0.249913 0.273169 2.408350 121.790000 4.298780 68.231600 56.02%
1 256 CONSECUTIVE SHFL 0.004178 0.004178 0.004178 0.249916 0.261589 1.752480 160.275000 3.876450 72.458000 45.21%
1 512 CONSECUTIVE SHFL 0.002093 0.002093 0.002093 0.249880 0.255728 1.450690 189.280000 3.607840 76.108200 40.21%
1 1024 CONSECUTIVE SHFL 0.001048 0.001048 0.001048 0.249926 0.252852 0.955200 284.232000 3.463330 78.392300 27.58%
1 4096 CONSECUTIVE SHFL 0.000262 0.000262 0.000262 0.249683 0.250416 1.826110 147.243000 3.522820 76.325800 51.84%
1 2 SHFL CONSECUTIVE 0.357914 0.357914 0.357914 0.249993 1.249993 19.359000 69.330400 26.145700 51.334200 74.04%
1 4 SHFL CONSECUTIVE 0.214748 0.214748 0.214748 0.249994 0.849994 12.373200 73.762500 16.047400 56.873800 77.10%
1 8 SHFL CONSECUTIVE 0.119305 0.119305 0.119305 0.250006 0.583340 7.264740 86.218700 9.101380 68.819900 79.82%
1 16 SHFL CONSECUTIVE 0.063161 0.063161 0.063161 0.249998 0.426469 4.251100 107.717000 4.975070 92.042400 85.45%
1 32 SHFL CONSECUTIVE 0.032538 0.032538 0.032538 0.250025 0.340934 2.534140 144.457000 3.146750 116.334000 80.53%
1 64 SHFL CONSECUTIVE 0.016519 0.016519 0.016519 0.249996 0.296150 1.587200 200.345000 2.541380 125.124000 62.45%
1 128 SHFL CONSECUTIVE 0.008324 0.008324 0.008324 0.249913 0.273169 1.161250 252.584000 2.278080 128.754000 50.97%
1 256 SHFL CONSECUTIVE 0.004178 0.004178 0.004178 0.249916 0.261589 1.019070 275.623000 2.418820 116.123000 42.13%
1 512 SHFL CONSECUTIVE 0.002093 0.002093 0.002093 0.249880 0.255728 1.100930 249.413000 2.571230 106.792000 42.82%
1 1024 SHFL CONSECUTIVE 0.001048 0.001048 0.001048 0.249926 0.252852 0.944928 287.322000 2.851710 95.205300 33.14%
1 4096 SHFL CONSECUTIVE 0.000262 0.000262 0.000262 0.249683 0.250416 2.053470 130.940000 3.184030 84.446900 64.49%
1 2 SHFL SHFL 0.357914 0.357914 0.357914 0.249993 1.249993 55.349800 24.248800 61.491900 21.826800 90.01%
1 4 SHFL SHFL 0.214748 0.214748 0.214748 0.249994 0.849994 33.883500 26.935600 37.047000 24.635500 91.46%
1 8 SHFL SHFL 0.119305 0.119305 0.119305 0.250006 0.583340 19.901200 31.473300 21.161800 29.598400 94.04%
1 16 SHFL SHFL 0.063161 0.063161 0.063161 0.249998 0.426469 10.758600 42.563100 11.525800 39.729700 93.34%
1 32 SHFL SHFL 0.032538 0.032538 0.032538 0.250025 0.340934 5.817920 62.922100 7.820100 46.812200 74.40%
1 64 SHFL SHFL 0.016519 0.016519 0.016519 0.249996 0.296150 3.873250 82.098600 6.578460 48.337800 58.88%
1 128 SHFL SHFL 0.008324 0.008324 0.008324 0.249913 0.273169 2.744930 106.856000 5.650620 51.908100 48.58%
1 256 SHFL SHFL 0.004178 0.004178 0.004178 0.249916 0.261589 1.927710 145.706000 5.295460 53.041600 36.40%
1 512 SHFL SHFL 0.002093 0.002093 0.002093 0.249880 0.255728 1.533060 179.110000 5.134270 53.481100 29.86%
1 1024 SHFL SHFL 0.001048 0.001048 0.001048 0.249926 0.252852 1.016000 267.223000 5.050750 53.754000 20.12%
1 4096 SHFL SHFL 0.000262 0.000262 0.000262 0.249683 0.250416 2.069410 129.932000 5.016030 53.604500 41.26%

16B-aligned buffers, 4B-aliased copies, Varying Buffer Size

using AtomicT=uint4; using TLEV_ALIAS_TYPE=uint32_t;

This experiment analyses the benefit that aliasing loads (i.e., reinterpret_cast<TLEV_ALIAS_TYPE*>(..)) could have. Note that the buffer size of was bumped to be at least 16B here. We also tested what impact vectorised loads would have.

cpy16_alias4

Data
Min. buffer size: max. buffer size: in_gen: out_gen: src size: dst size: sizes size: data_size: total: duration (SPPS): BW (SPPS): duration (TWP): BW (TWP): duration (TWP) uint4 VECT: BW (TWP) uint4 VECT:
16 16 CONSECUTIVE CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 0.787264 468.837000 1.163460 317.243000 1.162980 317.374000
16 16 CONSECUTIVE CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 0.781184 472.486000 1.165380 316.721000 1.161570 317.759000
16 16 CONSECUTIVE CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 0.792992 465.451000 1.168290 315.931000 1.166080 316.530000
16 16 CONSECUTIVE CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 0.793984 464.869000 1.168860 315.776000 1.162880 317.401000
32 32 CONSECUTIVE CONSECUTIVE 0.022370 0.022370 0.022370 0.333333 0.395833 0.769184 552.563000 1.028770 413.138000 0.831776 308.463000
64 64 CONSECUTIVE CONSECUTIVE 0.013422 0.013422 0.013422 0.400000 0.437500 0.802336 585.493000 1.062530 442.117000 0.671552 386.203000
128 128 CONSECUTIVE CONSECUTIVE 0.007457 0.007457 0.007457 0.444444 0.465278 0.867104 576.157000 1.267010 394.305000 0.763072 344.623000
256 256 CONSECUTIVE CONSECUTIVE 0.003948 0.003948 0.003948 0.470588 0.481618 0.997184 518.593000 1.258050 411.060000 0.982496 270.170000
512 512 CONSECUTIVE CONSECUTIVE 0.002034 0.002034 0.002034 0.484848 0.490530 1.165660 451.848000 1.420580 370.767000 1.217820 219.087000
1024 1024 CONSECUTIVE CONSECUTIVE 0.001032 0.001032 0.001032 0.492308 0.495192 1.124000 473.050000 2.384060 223.026000 1.493440 179.165000
4096 4096 CONSECUTIVE CONSECUTIVE 0.000261 0.000261 0.000261 0.498047 0.498776 2.643140 202.622000 2.457500 217.927000 1.751390 152.954000
16 16 CONSECUTIVE SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 4.878300 75.661300 5.574110 66.216600 5.578180 66.168400
16 16 CONSECUTIVE SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 4.876960 75.682100 5.567520 66.295000 5.589150 66.038400
16 16 CONSECUTIVE SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 4.876290 75.692600 5.567840 66.291200 5.579940 66.147500
16 16 CONSECUTIVE SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 4.863940 75.884800 5.579390 66.153900 5.601250 65.895800
32 32 CONSECUTIVE SHFL 0.022370 0.022370 0.022370 0.333333 0.395833 2.298430 184.919000 4.205730 101.058000 3.670780 69.895800
64 64 CONSECUTIVE SHFL 0.013422 0.013422 0.013422 0.400000 0.437500 1.488800 315.531000 2.546590 184.467000 2.840000 91.322400
128 128 CONSECUTIVE SHFL 0.007457 0.007457 0.007457 0.444444 0.465278 0.930208 537.071000 2.278430 219.268000 2.357250 111.559000
256 256 CONSECUTIVE SHFL 0.003948 0.003948 0.003948 0.470588 0.481618 1.015680 509.150000 2.528100 204.554000 2.127580 124.762000
512 512 CONSECUTIVE SHFL 0.002034 0.002034 0.002034 0.484848 0.490530 1.155390 455.865000 2.660350 197.982000 1.935900 137.822000
1024 1024 CONSECUTIVE SHFL 0.001032 0.001032 0.001032 0.492308 0.495192 1.134820 468.542000 2.987360 177.986000 1.912030 139.941000
4096 4096 CONSECUTIVE SHFL 0.000261 0.000261 0.000261 0.498047 0.498776 2.639420 202.907000 2.936930 182.353000 1.874210 142.931000
16 16 SHFL CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 2.227260 165.718000 2.585470 142.759000 2.585380 142.764000
16 16 SHFL CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 2.230660 165.466000 2.594050 142.287000 2.594500 142.262000
16 16 SHFL CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 2.224450 165.928000 2.583010 142.895000 2.594080 142.285000
16 16 SHFL CONSECUTIVE 0.033554 0.033554 0.033554 0.250000 0.343750 2.224380 165.933000 2.587460 142.649000 2.600770 141.919000
32 32 SHFL CONSECUTIVE 0.022370 0.022370 0.022370 0.333333 0.395833 1.671490 254.278000 1.898340 223.892000 1.747550 146.818000
64 64 SHFL CONSECUTIVE 0.013422 0.013422 0.013422 0.400000 0.437500 1.261920 372.260000 1.294340 362.937000 1.144320 226.646000
128 128 SHFL CONSECUTIVE 0.007457 0.007457 0.007457 0.444444 0.465278 0.943456 529.530000 1.437020 347.655000 0.928096 283.345000
256 256 SHFL CONSECUTIVE 0.003948 0.003948 0.003948 0.470588 0.481618 0.990752 521.960000 1.576290 328.070000 1.074750 246.979000
512 512 SHFL CONSECUTIVE 0.002034 0.002034 0.002034 0.484848 0.490530 1.180320 446.237000 1.808800 291.189000 1.290300 206.780000
1024 1024 SHFL CONSECUTIVE 0.001032 0.001032 0.001032 0.492308 0.495192 1.127840 471.440000 2.414750 220.192000 1.533250 174.513000
4096 4096 SHFL CONSECUTIVE 0.000261 0.000261 0.000261 0.498047 0.498776 2.654460 201.757000 2.420610 221.249000 1.748260 153.228000
16 16 SHFL SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 5.552960 66.468800 6.301150 58.576400 6.299970 58.587400
16 16 SHFL SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 5.555360 66.440100 6.294780 58.635600 6.299710 58.589800
16 16 SHFL SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 5.568160 66.287400 6.288480 58.694400 6.313380 58.463000
16 16 SHFL SHFL 0.033554 0.033554 0.033554 0.250000 0.343750 5.553220 66.465800 6.325470 58.351200 6.300960 58.578200
32 32 SHFL SHFL 0.022370 0.022370 0.022370 0.333333 0.395833 3.292060 129.105000 5.389600 78.859800 4.105500 62.494700
64 64 SHFL SHFL 0.013422 0.013422 0.013422 0.400000 0.437500 1.858560 252.756000 3.066980 153.168000 3.226050 80.394200
128 128 SHFL SHFL 0.007457 0.007457 0.007457 0.444444 0.465278 0.987744 505.787000 2.813630 177.560000 2.630940 99.953400
256 256 SHFL SHFL 0.003948 0.003948 0.003948 0.470588 0.481618 0.987456 523.702000 3.006820 171.987000 2.349660 112.970000
512 512 SHFL SHFL 0.002034 0.002034 0.002034 0.484848 0.490530 1.174460 448.462000 3.121380 168.740000 2.081310 128.193000
1024 1024 SHFL SHFL 0.001032 0.001032 0.001032 0.492308 0.495192 1.128220 471.279000 3.200990 166.107000 1.975810 135.424000
4096 4096 SHFL SHFL 0.000261 0.000261 0.000261 0.498047 0.498776 2.652130 201.935000 3.174370 168.713000 1.910080 140.246000

@alliepiper alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022
@elstehle elstehle force-pushed the feature/device-batch-memcpy branch from 8f6d447 to f657812 Compare April 18, 2022 08:40
@elstehle elstehle changed the title [WIP] Adds DeviceBatchMemcpy algorithm and tests Adds DeviceBatchMemcpy algorithm and tests Apr 19, 2022
@elstehle
Copy link
Collaborator Author

Sorry for the wait. I did another clean up pass over the code of this PR.

I've long wanted a cuda::memcpy that would handle runtime determined alignment as well as take a CG parameter to use multiple threads to perform the copy. That seems like the best place to put such a building block as it could have widespread applicability.

Agreed, @jrhemstad. I believe there's a recurring need for it, especially when dealing with string data. I often find myself needing to load string data into shared memory for further processing, so I tried to be transparent to the destination data space. I.e., to support vectorised stores of 4B, 8B, and 16B. Where 4B and 8B stores are friendly to the shared memory space, reducing bank conflicts and the 16B stores are presumably more efficient for global memory stores.

I hope I've been able to take a first step into that direction. In the interest of getting this PR through, I haven't exposed it as a stand-alone CG/block-level algorithm yet and hidden it under the detail namespace for now. I plan to expose it to the public API and add typed tests in a follow-up PR. This is currently the signature (as I believe we don't have CG in CUB yet(?)):

VectorizedCopy(int32_t thread_rank, int32_t group_size, void *dest, ByteOffsetT num_bytes, const void *src)

@alliepiper
Copy link
Collaborator

We don't expose CG in the CUB APIs, this would require some more discussion before we added anything like that. That may be better suited to the senders/receivers based APIs that @senior-zero is working on. For now, let's try to find a way to pass the same info in without adding any dependencies.

cub/device/device_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/device_memcpy.cuh Show resolved Hide resolved
cub/device/device_memcpy.cuh Show resolved Hide resolved
cub/device/device_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Show resolved Hide resolved
cub/device/device_memcpy.cuh Show resolved Hide resolved
Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some issues in the example. Please, check if it's on the algorithm side or not.

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments

cub/device/device_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/device_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/device_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
@alliepiper alliepiper modified the milestones: 2.0.0, 2.1.0 Jul 25, 2022
Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review until the agent

cub/agent/agent_batch_memcpy.cuh Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Show resolved Hide resolved
Comment on lines +126 to +125
if (offset == 0)
{
LoadVectorAndFunnelShiftR<true>(aligned_ptr, bit_shift, data_out);
}
// Otherwise, we need to load extra bytes and perform funnel-shifting
else
{
LoadVectorAndFunnelShiftR<false>(aligned_ptr, bit_shift, data_out);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there would be any advantage to dispatching to where the offset is statically known. It looks like that would allow bit_shift to be known statically as well.

Suggested change
if (offset == 0)
{
LoadVectorAndFunnelShiftR<true>(aligned_ptr, bit_shift, data_out);
}
// Otherwise, we need to load extra bytes and perform funnel-shifting
else
{
LoadVectorAndFunnelShiftR<false>(aligned_ptr, bit_shift, data_out);
}
switch(offset)
case 0: LoadVectorAndFunnelShiftR<0>(...)
case 1: LoadVectorAndFunnelShiftR<1>(...)
case 2: LoadVectorAndFunnelShiftR<2>(...)
case 3: LoadVectorAndFunnelShiftR<3>(...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea! Preliminary results look like it does not cut above the noise, but I'll do a more thorough run and follow up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've ran some more benchmarks on this suggestion with code paths that use immediate bit-shift values. Performance remained unchanged for DeviceMemcpy::Batched.

My hypothesis is that we're bottlenecked by the memory subsystem, as I also don't see significant performance changes for some other changes that I'd expect to otherwise positively impact performance.

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work seems to be in progress, so I'll finish review for now to make comments visible.

cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved

// Ensure the prefix callback has finished using its temporary storage and that it can be reused
// in the next stage
CTA_SYNC();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp_storage.blev_buffer_offset is not used in PartitionBuffersBySize. Since look-back is rather expensive, do you think there's any advantage in overlapping of decoupled look-back with the PartitionBuffersBySize in other warps? BLevBuffScanPrefixCallbackOpT storage should be like 4 ints, so putting it into struct instead of a union shouldn't increase shared memory requirements significantly, but we would get rid of one sync and overlap some operations.

cub/agent/agent_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/agent/agent_batch_memcpy.cuh Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Dec 29, 2022
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Dec 29, 2022
@gevtushenko gevtushenko added testing: gpuCI in progress Started gpuCI testing. testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Dec 29, 2022
@gevtushenko gevtushenko marked this pull request as ready for review December 30, 2022 09:04
@gevtushenko gevtushenko merged commit 423f54e into NVIDIA:main Dec 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
helps: rapids Helps or needed by RAPIDS. testing: gpuCI passed Passed gpuCI testing.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants