Adds DeviceBatchMemcpy algorithm and tests #359

elstehle · 2021-08-18T15:56:17Z

Algorithm Overview

The DeviceBatchMemcpy takes N input buffers and N output buffers and copies buffer_size[i] bytes from the i-th input buffer to the i-th output buffer. If any input buffer aliases memory from any output buffer the behavior is undefined. If any output buffer aliases memory of another output buffer the behavior is undefined. Input buffers can alias one another.

Implementation Details

We distinguish each buffer by its size and assign it to one of three size classes:

Thread-level buffer (TLEV buffer). A buffer that is processed by one or more threads but not a whole warp (e.g., up to 32 bytes).
Warp-level buffer (WLEV buffer). A buffer that is processed by a whole warp (e.g., above 32 bytes but only up to 1024 bytes).
Block-level buffer (BLEV buffer). A buffer that is processed by one or more thread blocks. The number of thread blocks assigned to such a buffer is proportional to its size (e.g., all buffers above 1024 bytes).

Step 1: Partitioning Buffers by Size

Each thread block loads a tile of buffer_size[i].
Threads compute a three-bin histogram over their assigned buffer_size[ITEMS_PER_THREAD] chunk. Binning buffers by the size class they fall into
An exclusive prefix sum is computed over the histograms. The prefix sum's aggregate reflects the number of buffers that fall into each size class. The prefix sum of each thread corresponds to the relative offset within each partition.
Scatter the buffers into their partition. For each buffer, we scatter the tuple: {tile_buffer_id, buffer_size}, where tile_buffer_id is the buffer id, relative to the tile (i.e., from the interval [0, TILE_SIZE)). buffer_size is only defined for buffers that belong to the tlev partition and corresponds to the buffer's size (number of bytes) in that case.

tile_buffer_id	0	1	2	3	4	5	6	7
tile_buffer_sizes	3	37	17	4	9	4242	11	2000

	T	T	T	T	T	W	B	B
tile_buffer_id	0	2	3	4	6	1	5	7
tile_buffer_size	3	17	4	9	11	-	-	-

Note, the partitioning does not necessarily need to be stable. It may be desired if we expect neighbouring buffers to hold neighbouring byte segments.

After the partitioning, each partition represents all the buffers that belong to the respective size class (i.e., one of TLEV, WLEV, BLEV). Depending on the size class, a different logic is applied. We process each partition separately.

Step 2.a: Copying TLEV Buffers

Usually, TLEV buffers are buffers of only a few bytes. Vectorised loads and stores do not really pay off here, as there's only few bytes that can actually be read from a four byte-aligned address. It does not pay off to have the two different code paths for (a) loading individual bytes from non-aligned adrresses and (b) doing vectorised loads from aligned addresses.

Instead, we use the BlockRunLengthDecode algorithm to both (a) coalesce reads and writes as well as (b) load balance the number of bytes copied by each thred. Specifically, we are able to assign neighbouring bytes to neighbouring threads.

The following tables illustrates how the first 8 bytes from the TLEV buffers are getting assigned to threads.

	T	T	T	T	T

tile_buffer_id	0	2	3	4	6
tile_buffer_size	3	17	4	9	11	-	-	-
[1] run_length_decode

	t0	t1	t2	t3	t4
buffer_id	0	0	0	2	2	2	2	2
byte_of_buffer	0	1	2	0	1	2	3	4

[1] Use BlockRunLengthDecode using the tile_buffer_id as the "unique_items" and each buffer's size as the respective run's length. The result from the run-length decode yields the assignment from threads to the buffer along with the specific byte from that buffer.

Step 2.b: Copying WLEV Buffers

A full warp is assigned to each WLEV buffer. Loads from the input buffer are vectorised (aliased to a wider data type), loading 4, 8 or even 16 bytes at a time from the input buffer's first address that is aligned to such aliased data type. The implementation for the vectorised copy is based on @gaohao95's (thanks!) string gather improvement in https://github.com/rapidsai/cudf/pull/7980/files

I think we want to have the vectorised copy as a reusable component. But I wanted to coordinate on what exactly that would look like first. Should this be (a) a warp-/block-level copy or should we (b) separate it into a warp-&block-level vectorised load (which will also have the async copy, maybe) and a warp-&block-level vectorised store?

Step 2.c: Enqueueing BLEV Buffers

These are buffers that may be very large. We want to avoid a scenario where there's potentially one very large buffer that a single thread block is copying while other thread blocks are sitting idle. To avoid this, BLEV buffers will be put into a queue that will be picked up in a subsequent kernel. In the subsequent kernel, the number of thred blocks getting assigned to each buffer is proportional to the buffer's size.

gevtushenko

Thank you! You've done tremendous work here. It's also a vital algorithm to have. I was a bit concerned with the complexity of implementation, though. I've decided to benchmark this algorithm and found a strange performance drop. For buffers of 256 std::uint32_t items the performance seems quite impressive.

CUB here denotes your implementation, and the memcpy represents the bandwidth of cudaMemcpyAsync applied to the sum of all buffer sizes.

But when I've changed the underlying type to std::uint64_t (that is, increased the buffer size twice), I've observed the following.

The code produces the correct result, so I'm not sure what's the reason. At this point, I've decided to check a different approach. I applied a three-way partition which produced reorderings for small/medium/large segments. Then I used existing facilities to copy data.

To handle large buffers by multiple thread blocks, I used atomic operations. I increment a value assigned to a large buffer to get the tile position in that buffer. Here are the results where I vary the number of buffers with a fixed size of 64 MB.

The simple approach seems to perform a bit better in this particular case. I also checked the same test for 1GB segments, and it's still better there.

Here's the bandwidth for copying extremely small buffers - 2 items of std::uint32_t type:

Another interesting question is the temporary storage size. You currently require four num_buffers. I've managed to use only two num_buffers.

To conclude the data above, the simple implementation:

uses two times less memory
is faster in some cases
almost fits into the screen 😄
uses existing facilities.

It's just a proof of concept. Therefore I haven't considered unaliased buffers or sizes that are not a multiple of 32 bits. I don't expect this to change the results significantly, though. Anyway, we might consider requiring this if this is the source of performance issues. Padding arrays is quite common practice.

Could you consider the simple algorithm and check how it can be helpful in your case? Even if the algorithm I've mentioned happens to be slower, I hope you'll be able to incorporate some of the ideas as building blocks of the proposed PR. For example, it could be used to deal with small numbers of buffers. Here is the CUB branch I've used for testing, and here is the benchmark and partition-based implementation.

I am looking forward to checking your results on the second stage of review!

cub/device/dispatch/dispatch_batch_memcpy.cuh

test/test_block_run_length_decode.cu

cub/agent/agent_batch_memcpy.cuh

cub/device/dispatch/dispatch_batch_memcpy.cuh

elstehle · 2021-08-25T14:37:08Z

Thanks for the feedback and the preliminary evaluation, @senior-zero 👍

Fundamentally, our ideas are quite similar. You do a three-way partition on all the problems. I proposed to have a kernel-fused version of the "three-way partitioning" that is fused with the implementation for copying small and medium buffers. The goal being that we can solve small and medium buffers straight in the kernel instead of having to write them into a "queue" first and later read them back in. I wanted to circumvent the extra reads of the problems' sizes and writing their id out, as well as another extra read of the partitioned id. This definitely makes the implementation more complex and, I totally agree, I'm not sure if that complexity is worth it.

When I had conceived this, I assumed the "worst case" scenario. In theory, let's assume these type sizes: buffer_src, buffer_dst, buffer_size is each 4 bytes. The average buffer size is 4 bytes. For N buffers the fused version incurs (4 + 4 + 4 + 2 * 4) * N memory transfers. If we did a preliminary three-way partition upfront, it would be (4 + 4) * N + (4 + 4 + 4 + 4 + 2 * 4) * N. So 20 bytes per buffer versus 32 bytes per buffer, basically an extra read of buffer_size, a write of buffer_id, and another read of buffer_id.

Now, we also see that, unfortunately, we cannot sustain anywhere near peak memory bandwidth for such tiny buffers. So question is whether we want to take the theoretical model into consideration at all.

I see the three decisions we need to make:
(1) I think whether to kernel-fuse or to not kernel-fuse is the key decision we have to make. We'll probably need an apples-to-apples comparison that has identical implementations for the "small" buffer logic to see the performance difference. I'll try to evaluate this in the coming days. Then we can make an educated decision about code complexity versus performance.

(2) What I also like is using atomics for the scheduling/load-balancing of large buffers. The performance drop you see going from 1KB to 2KB buffers is a combination of a configuration discrepancy (my bad) and general performance regression when the tile size (or "task" size, i.e., the most granular unit getting assigned to thread blocks) is too small. The binary search seems to dominate in that case. I also want to see if streaming reads and writes will alleviate this. So we'll also need to compare these two mechanisms and factor out other side effects too.

(3) What is left, is the actual implementation of how we're copying small buffers, medium buffers, and large buffers, respectively. I think this it is easy to exchange one for the other. Once we figured out the former two decisions, this will be easy.

So I would proceed in that order. Does that sound good?

As for:

uses two times less memory

That can easily be done for the kernel-fused version too, right? It's just a matter of trading memory for more coalesced accesses. I.e., I'm materialising the buffer's source and destination pointers for large buffers instead of having the indirection. I'm also fine to have indirection in this particular case.

is faster in some cases

I'm all in for fast 😁 We just need to have a more differentiated and elaborate evaluation to track down where the difference actually comes from.

almost fits into the screen 😄

💯

uses existing facilities.

I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be able to vectorise some loads/stores. If we can get the performance from existing building blocks, let's go for that. Otherwise let's make it a reusable building block.

jrhemstad · 2021-08-25T17:43:33Z

I'm all in for using existing building blocks. The problem is that I didn't assume the pointers to be aligned and so had to devise special treatment to be able to vectorise some loads/stores. If we can get the performance from existing building blocks, let's go for that. Otherwise let's make it a reusable building block.

I've long wanted a cuda::memcpy that would handle runtime determined alignment as well as take a CG parameter to use multiple threads to perform the copy. That seems like the best place to put such a building block as it could have widespread applicability.

cub/device/dispatch/dispatch_batch_memcpy.cuh

elstehle · 2021-10-13T13:18:59Z

I'm currently gathering results of a few more benchmarks that hopefully will help us make an informed decision about which of the scheduling mechanisms to pursue (preliminary three-way partition vs. single-pass prefix scan-based). I'll post the results shortly.

In the meanwhile, PR #354, on which this PR builds, should be ready for review.

alliepiper · 2021-10-14T20:26:50Z

FYI, I'm starting the 1.15 RC next week so I'm bumping this to 1.16. I'll try to get to NVIDIA/cccl#1006 before the release.

elstehle · 2021-10-18T19:56:24Z

So I ran the first batch of benchmarks. I'll add more throughout the week.

Methodology

Benchmarks ran on V100
We allocate two large buffers on device memory: one for the input, one for the output
We generate an array of buffer_sizes. Buffer sizes are uniform random in the interval [<Min. buffer size>, <max. buffer size>]
We generate an offsets array for the input buffer batch, which will alias into the input memory allocation, and an offsets array for the output buffer batch, which will alias into the output memory allocation.
These offsets can be generated one of two ways (depending on the experiment):
- CONSECUTIVE (C): offset[0] = 0; offset[i] = offset[i-1] + buffer_sizes[i-1];
- SHUFFLE (S): the offsets are "somewhat" similar to CONSECUTIVE but then the offsets are being shuffled. That makes sure that bytes of buffer[i] are at a different location than buffer[j] for i!=j.
Further, offsets and sizes would be made to comply with a configurable AtomicT. That is, offsets will be aligned to integer multipes of AtomicT and buffer_sizes will as well be integer multiples of AtomicT.
The charts label the achieve memory throughput on the y-axis (i.e., all the required memory transfers, such as reading buffer sizes, reading buffer offsets, reading the bytes to be copied, and writing the bytes to be copied, divided by the total run time)
The charts label the input on the x-axis: <INPUT-OFFSET-GEN>_<OUTPUT-OFFSET-GEN>_<Min. buffer size>_<max. buffer size>
- For instance, the label C_S_1_8 means:
- C: the input will be consecutive buffers in
- S: the output buffers are shuffled ("random" writes)
- 1: the minimum buffer size is 1
- 8: the maximum buffer size is 8
Generally, we compared different aspects of the three-way partition implementation (TWP) (see here) versus the single-pass prefix scan-based implementation (SPPS) (this PR, originally).
This the branch that benchmarks are run on:

Compilation example / details

nvcc -DTWP_TLEV_CPY=0 -DLD_VECTORIZED=0 -DATOMIC_CPY_TYPE=uint8_t -DTLEV_ALIAS_TYPE=uint8_t -DWLEV_MIN_SIZE=17000 -DBLEV_MIN_SIZE=17000 -Xptxas -v -lineinfo --generate-code arch=compute_70,code=sm_70 -DTHRUST_IGNORE_CUB_VERSION_CHECK -I<your-thrust-path> -I<your-cub-path> test_device_batch_memcpy.cu -o test_memcpy && ./test_memcpy

TWP_TLEV_CPY whether to use TWP's small buffer copying logic inside of SPPS
LD_VECTORIZED whether to enable CUB vectorized loads inside TWP's copy logic
ATOMIC_CPY_TYPE buffers will be aligned and their size an integer multiple of this type
TLEV_ALIAS_TYPE the data type being copied. This may not exceed ATOMIC_CPY_TYPE

Copying of small buffers logic

For these tests the thresholds for medium (aka "WLEV" buffers) and large (aka "BLEV") buffers was set so high that all buffers would be copied by the copy small buffer or copy TLEV buffer logic, respectively.
The benchmarks serve two purposes:
- identify which implementation to choose for the copy small buffer logic.
- get an initial idea of the "scheduling overhead" (the "scheduling" being the logic that partitions the buffers into "small", "medium", and "large" buffers.
Both implementations were adapted and made configurable to perform aliased loads of TLEV_ALIAS_TYPE. Various TLEV_ALIAS_TYPE were tested.
The "copy small buffer logic" from TWP was ported into SPPS. This allowed to factor out performance differences due to scheduling differences. Similarly, it allowed to compare the scheduling overhead.

No Aliased Loads, No Buffer Size Variance

using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;

Data

Min. buffer size:	max. buffer size:	in_gen:	out_gen:	src size:	dst size:	sizes size:	data_size:	total:	duration (SPPS):	BW (SPPS):	duration (TWP):	BW (TWP):	relative performance
2	2	CONSECUTIVE	CONSECUTIVE	0.357914	0.357914	0.357914	0.333333	1.333333	4.621500	309.781000	9.954430	143.821000	46.43%
4	4	CONSECUTIVE	CONSECUTIVE	0.214748	0.214748	0.214748	0.400000	1.000000	3.467490	309.660000	6.699330	160.276000	51.76%
8	8	CONSECUTIVE	CONSECUTIVE	0.119305	0.119305	0.119305	0.444444	0.777778	2.687550	310.741000	3.803490	219.570000	70.66%
16	16	CONSECUTIVE	CONSECUTIVE	0.063161	0.063161	0.063161	0.470588	0.647059	2.201060	315.655000	2.124030	327.102000	103.63%
32	32	CONSECUTIVE	CONSECUTIVE	0.032538	0.032538	0.032538	0.484848	0.575758	1.669120	370.384000	1.985600	311.349000	84.06%
64	64	CONSECUTIVE	CONSECUTIVE	0.016519	0.016519	0.016519	0.492308	0.538462	1.382300	418.264000	2.191580	263.813000	63.07%
128	128	CONSECUTIVE	CONSECUTIVE	0.008324	0.008324	0.008324	0.496124	0.519380	1.289440	432.498000	3.027520	184.204000	42.59%
256	256	CONSECUTIVE	CONSECUTIVE	0.004178	0.004178	0.004178	0.498054	0.509727	1.540160	355.363000	3.386910	161.597000	45.47%
512	512	CONSECUTIVE	CONSECUTIVE	0.002093	0.002093	0.002093	0.499024	0.504872	1.684000	321.914000	3.484130	155.592000	48.33%
1024	1024	CONSECUTIVE	CONSECUTIVE	0.001048	0.001048	0.001048	0.499512	0.502439	1.548160	348.471000	5.698080	94.679100	27.17%
4096	4096	CONSECUTIVE	CONSECUTIVE	0.000262	0.000262	0.000262	0.499878	0.500610	3.697090	145.392000	5.926180	90.703700	62.39%
2	2	CONSECUTIVE	SHFL	0.357914	0.357914	0.357914	0.333333	1.333333	49.238400	29.076000	56.410400	25.379300	87.29%
4	4	CONSECUTIVE	SHFL	0.214748	0.214748	0.214748	0.400000	1.000000	29.612200	36.260200	34.172600	31.421100	86.65%
8	8	CONSECUTIVE	SHFL	0.119305	0.119305	0.119305	0.444444	0.777778	17.402500	47.989100	20.556500	40.626100	84.66%
16	16	CONSECUTIVE	SHFL	0.063161	0.063161	0.063161	0.470588	0.647059	9.428540	73.688400	10.772400	64.495900	87.53%
32	32	CONSECUTIVE	SHFL	0.032538	0.032538	0.032538	0.484848	0.575758	3.217150	192.162000	8.831460	70.001500	36.43%
64	64	CONSECUTIVE	SHFL	0.016519	0.016519	0.016519	0.492308	0.538462	1.791100	322.800000	8.575740	67.419100	20.89%
128	128	CONSECUTIVE	SHFL	0.008324	0.008324	0.008324	0.496124	0.519380	1.414910	394.145000	8.726940	63.903200	16.21%
256	256	CONSECUTIVE	SHFL	0.004178	0.004178	0.004178	0.498054	0.509727	1.517700	360.623000	8.655330	63.234500	17.53%
512	512	CONSECUTIVE	SHFL	0.002093	0.002093	0.002093	0.499024	0.504872	1.620580	334.512000	8.639520	62.746800	18.76%
1024	1024	CONSECUTIVE	SHFL	0.001048	0.001048	0.001048	0.499512	0.502439	1.554850	346.972000	8.712640	61.920300	17.85%
4096	4096	CONSECUTIVE	SHFL	0.000262	0.000262	0.000262	0.499878	0.500610	3.682690	145.960000	8.768610	61.301200	42.00%
2	2	SHFL	CONSECUTIVE	0.357914	0.357914	0.357914	0.333333	1.333333	18.644400	76.787500	26.246000	54.547600	71.04%
4	4	SHFL	CONSECUTIVE	0.214748	0.214748	0.214748	0.400000	1.000000	12.255600	87.612000	16.030800	66.979900	76.45%
8	8	SHFL	CONSECUTIVE	0.119305	0.119305	0.119305	0.444444	0.777778	7.418370	112.576000	9.342270	89.392900	79.41%
16	16	SHFL	CONSECUTIVE	0.063161	0.063161	0.063161	0.470588	0.647059	4.461890	155.713000	5.186690	133.953000	86.03%
32	32	SHFL	CONSECUTIVE	0.032538	0.032538	0.032538	0.484848	0.575758	2.757440	224.199000	3.560380	173.637000	77.45%
64	64	SHFL	CONSECUTIVE	0.016519	0.016519	0.016519	0.492308	0.538462	1.771780	326.322000	3.488540	165.734000	50.79%
128	128	SHFL	CONSECUTIVE	0.008324	0.008324	0.008324	0.496124	0.519380	1.511550	368.945000	4.146560	134.492000	36.45%
256	256	SHFL	CONSECUTIVE	0.004178	0.004178	0.004178	0.498054	0.509727	1.510910	362.242000	4.460800	122.694000	33.87%
512	512	SHFL	CONSECUTIVE	0.002093	0.002093	0.002093	0.499024	0.504872	1.647780	328.990000	4.812100	112.654000	34.24%
1024	1024	SHFL	CONSECUTIVE	0.001048	0.001048	0.001048	0.499512	0.502439	1.550660	347.910000	6.096960	88.485000	25.43%
4096	4096	SHFL	CONSECUTIVE	0.000262	0.000262	0.000262	0.499878	0.500610	3.689700	145.683000	5.451840	98.595400	67.68%
2	2	SHFL	SHFL	0.357914	0.357914	0.357914	0.333333	1.333333	55.553100	25.770900	62.612900	22.865200	88.72%
4	4	SHFL	SHFL	0.214748	0.214748	0.214748	0.400000	1.000000	33.896400	31.677200	41.284300	26.008500	82.10%
8	8	SHFL	SHFL	0.119305	0.119305	0.119305	0.444444	0.777778	19.720600	42.348100	22.110000	37.771800	89.19%
16	16	SHFL	SHFL	0.063161	0.063161	0.063161	0.470588	0.647059	10.819900	64.212600	12.307200	56.452800	87.92%
32	32	SHFL	SHFL	0.032538	0.032538	0.032538	0.484848	0.575758	4.889980	126.425000	11.500200	53.757100	42.52%
64	64	SHFL	SHFL	0.016519	0.016519	0.016519	0.492308	0.538462	2.364290	244.542000	11.406400	50.688100	20.73%
128	128	SHFL	SHFL	0.008324	0.008324	0.008324	0.496124	0.519380	1.471460	378.999000	11.314400	49.289500	13.01%
256	256	SHFL	SHFL	0.004178	0.004178	0.004178	0.498054	0.509727	1.523420	359.267000	11.319400	48.351900	13.46%
512	512	SHFL	SHFL	0.002093	0.002093	0.002093	0.499024	0.504872	1.624220	333.761000	11.340800	47.801000	14.32%
1024	1024	SHFL	SHFL	0.001048	0.001048	0.001048	0.499512	0.502439	1.547520	348.615000	11.365800	47.465900	13.62%
4096	4096	SHFL	SHFL	0.000262	0.000262	0.000262	0.499878	0.500610	3.674820	146.273000	11.286400	47.625900	32.56%

Scheduling: TWP vs. SPPS; No Aliased Loads, No Buffer Size Variance

using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;

Here, the small buffer copying logic from TWP was moved into SPPS. Hence, we aim to limit the difference to be the scheduling (i.e., the partitioning into small, medium, and large buffers).

Data

Min. buffer size:	max. buffer size:	in_gen:	out_gen:	src size:	dst size:	sizes size:	data_size:	total:	duration (SPPS):	BW (SPPS):	duration (TWP):	BW (TWP):	relative performance
2	2	CONSECUTIVE	CONSECUTIVE	0.357914	0.357914	0.357914	0.333333	1.333333	5.986530	239.146000	10.124700	141.403000	59.13%
4	4	CONSECUTIVE	CONSECUTIVE	0.214748	0.214748	0.214748	0.400000	1.000000	4.014370	267.475000	6.655810	161.324000	60.31%
8	8	CONSECUTIVE	CONSECUTIVE	0.119305	0.119305	0.119305	0.444444	0.777778	2.444420	341.649000	3.843840	217.265000	63.59%
16	16	CONSECUTIVE	CONSECUTIVE	0.063161	0.063161	0.063161	0.470588	0.647059	1.522910	456.214000	2.134660	325.474000	71.34%
32	32	CONSECUTIVE	CONSECUTIVE	0.032538	0.032538	0.032538	0.484848	0.575758	1.891680	326.807000	2.098430	294.608000	90.15%
64	64	CONSECUTIVE	CONSECUTIVE	0.016519	0.016519	0.016519	0.492308	0.538462	2.169700	266.474000	2.306210	250.701000	94.08%
128	128	CONSECUTIVE	CONSECUTIVE	0.008324	0.008324	0.008324	0.496124	0.519380	3.181600	175.283000	3.193500	174.629000	99.63%
256	256	CONSECUTIVE	CONSECUTIVE	0.004178	0.004178	0.004178	0.498054	0.509727	3.356580	163.058000	3.381950	161.834000	99.25%
512	512	CONSECUTIVE	CONSECUTIVE	0.002093	0.002093	0.002093	0.499024	0.504872	4.032830	134.422000	3.437120	157.720000	117.33%
1024	1024	CONSECUTIVE	CONSECUTIVE	0.001048	0.001048	0.001048	0.499512	0.502439	4.960060	108.767000	5.690910	94.798400	87.16%
4096	4096	CONSECUTIVE	CONSECUTIVE	0.000262	0.000262	0.000262	0.499878	0.500610	4.012350	133.968000	5.090080	105.603000	78.83%
2	2	CONSECUTIVE	SHFL	0.357914	0.357914	0.357914	0.333333	1.333333	50.586300	28.301300	56.376300	25.394600	89.73%
4	4	CONSECUTIVE	SHFL	0.214748	0.214748	0.214748	0.400000	1.000000	31.578500	34.002300	34.485900	31.135700	91.57%
8	8	CONSECUTIVE	SHFL	0.119305	0.119305	0.119305	0.444444	0.777778	18.002200	46.390600	20.469100	40.799700	87.95%
16	16	CONSECUTIVE	SHFL	0.063161	0.063161	0.063161	0.470588	0.647059	9.599460	72.376400	10.282200	67.570300	93.36%
32	32	CONSECUTIVE	SHFL	0.032538	0.032538	0.032538	0.484848	0.575758	7.941060	77.850500	8.469500	72.993000	93.76%
64	64	CONSECUTIVE	SHFL	0.016519	0.016519	0.016519	0.492308	0.538462	8.077950	71.573700	7.940700	72.810800	101.73%
128	128	CONSECUTIVE	SHFL	0.008324	0.008324	0.008324	0.496124	0.519380	8.560700	65.144200	8.692510	64.156400	98.48%
256	256	CONSECUTIVE	SHFL	0.004178	0.004178	0.004178	0.498054	0.509727	8.261570	66.248400	8.651580	63.261900	95.49%
512	512	CONSECUTIVE	SHFL	0.002093	0.002093	0.002093	0.499024	0.504872	8.394620	64.577300	8.646400	62.696900	97.09%
1024	1024	CONSECUTIVE	SHFL	0.001048	0.001048	0.001048	0.499512	0.502439	8.386400	64.329100	8.718530	61.878500	96.19%
4096	4096	CONSECUTIVE	SHFL	0.000262	0.000262	0.000262	0.499878	0.500610	7.356190	73.071200	8.759580	61.364300	83.98%
2	2	SHFL	CONSECUTIVE	0.357914	0.357914	0.357914	0.333333	1.333333	20.325500	70.436500	26.223800	54.593700	77.51%
4	4	SHFL	CONSECUTIVE	0.214748	0.214748	0.214748	0.400000	1.000000	13.297300	80.749000	15.982300	67.183200	83.20%
8	8	SHFL	CONSECUTIVE	0.119305	0.119305	0.119305	0.444444	0.777778	7.824000	106.740000	9.367580	89.151300	83.52%
16	16	SHFL	CONSECUTIVE	0.063161	0.063161	0.063161	0.470588	0.647059	4.461790	155.716000	5.174910	134.258000	86.22%
32	32	SHFL	CONSECUTIVE	0.032538	0.032538	0.032538	0.484848	0.575758	3.184740	194.118000	3.553890	173.955000	89.61%
64	64	SHFL	CONSECUTIVE	0.016519	0.016519	0.016519	0.492308	0.538462	3.536380	163.491000	3.474500	166.404000	101.78%
128	128	SHFL	CONSECUTIVE	0.008324	0.008324	0.008324	0.496124	0.519380	4.170270	133.727000	4.114560	135.538000	101.35%
256	256	SHFL	CONSECUTIVE	0.004178	0.004178	0.004178	0.498054	0.509727	5.023330	108.955000	4.459870	122.720000	112.63%
512	512	SHFL	CONSECUTIVE	0.002093	0.002093	0.002093	0.499024	0.504872	6.173220	87.815300	4.829820	112.241000	127.81%
1024	1024	SHFL	CONSECUTIVE	0.001048	0.001048	0.001048	0.499512	0.502439	5.178560	104.177000	6.105310	88.363900	84.82%
4096	4096	SHFL	CONSECUTIVE	0.000262	0.000262	0.000262	0.499878	0.500610	4.300130	125.002000	5.436320	98.876800	79.10%
2	2	SHFL	SHFL	0.357914	0.357914	0.357914	0.333333	1.333333	55.162900	25.953200	61.979200	23.099000	89.00%
4	4	SHFL	SHFL	0.214748	0.214748	0.214748	0.400000	1.000000	36.166200	29.689100	41.238000	26.037700	87.70%
8	8	SHFL	SHFL	0.119305	0.119305	0.119305	0.444444	0.777778	20.368300	41.001600	23.170100	36.043500	87.91%
16	16	SHFL	SHFL	0.063161	0.063161	0.063161	0.470588	0.647059	11.429500	60.787900	12.343000	56.289000	92.60%
32	32	SHFL	SHFL	0.032538	0.032538	0.032538	0.484848	0.575758	11.688100	52.892800	11.503300	53.742600	101.61%
64	64	SHFL	SHFL	0.016519	0.016519	0.016519	0.492308	0.538462	10.376100	55.721000	11.411700	50.664600	90.93%
128	128	SHFL	SHFL	0.008324	0.008324	0.008324	0.496124	0.519380	10.765200	51.803900	11.311900	49.300400	95.17%
256	256	SHFL	SHFL	0.004178	0.004178	0.004178	0.498054	0.509727	10.185700	53.733700	11.342400	48.253900	89.80%
512	512	SHFL	SHFL	0.002093	0.002093	0.002093	0.499024	0.504872	10.719000	50.574100	11.362600	47.709300	94.34%
1024	1024	SHFL	SHFL	0.001048	0.001048	0.001048	0.499512	0.502439	10.132100	53.245700	11.375600	47.425100	89.07%
4096	4096	SHFL	SHFL	0.000262	0.000262	0.000262	0.499878	0.500610	8.641660	62.201700	11.305000	47.547500	76.44%

No Aliased Loads, Varying Buffer Size

using AtomicT=uint8_t; using TLEV_ALIAS_TYPE=uint8_t;

We now look at varying buffer sizes, where buffer sizes are uniformly distributed in [<Min. buffer size>, <max. buffer size>]. This highlights how resilient a method is to load imbalance.

Data

Min. buffer size:	max. buffer size:	in_gen:	out_gen:	src size:	dst size:	sizes size:	data_size:	total:	duration (SPPS):	BW (SPPS):	duration (TWP):	BW (TWP):	relative performance
1	2	CONSECUTIVE	CONSECUTIVE	0.357914	0.357914	0.357914	0.249993	1.249993	4.596700	291.985000	8.607710	155.926000	53.40%
1	4	CONSECUTIVE	CONSECUTIVE	0.214748	0.214748	0.214748	0.249994	0.849994	3.141120	290.557000	5.651420	161.495000	55.58%
1	8	CONSECUTIVE	CONSECUTIVE	0.119305	0.119305	0.119305	0.250006	0.583340	2.129500	294.132000	3.647010	171.745000	58.39%
1	16	CONSECUTIVE	CONSECUTIVE	0.063161	0.063161	0.063161	0.249998	0.426469	1.453890	314.961000	2.061310	222.149000	70.53%
1	32	CONSECUTIVE	CONSECUTIVE	0.032538	0.032538	0.032538	0.250025	0.340934	1.083390	337.898000	1.477500	247.766000	73.33%
1	64	CONSECUTIVE	CONSECUTIVE	0.016519	0.016519	0.016519	0.249996	0.296150	0.906016	350.974000	1.486850	213.867000	60.94%
1	128	CONSECUTIVE	CONSECUTIVE	0.008324	0.008324	0.008324	0.249913	0.273169	0.862752	339.974000	1.688900	173.671000	51.08%
1	256	CONSECUTIVE	CONSECUTIVE	0.004178	0.004178	0.004178	0.249916	0.261589	0.882688	318.209000	1.858720	151.115000	47.49%
1	512	CONSECUTIVE	CONSECUTIVE	0.002093	0.002093	0.002093	0.249880	0.255728	1.059650	259.130000	2.133860	128.681000	49.66%
1	1024	CONSECUTIVE	CONSECUTIVE	0.001048	0.001048	0.001048	0.249926	0.252852	0.846240	320.829000	2.534530	107.120000	33.39%
1	4096	CONSECUTIVE	CONSECUTIVE	0.000262	0.000262	0.000262	0.249683	0.250416	1.787710	150.405000	3.003650	89.518400	59.52%
1	2	CONSECUTIVE	SHFL	0.357914	0.357914	0.357914	0.249993	1.249993	48.814200	27.495400	56.818300	23.622100	85.91%
1	4	CONSECUTIVE	SHFL	0.214748	0.214748	0.214748	0.249994	0.849994	28.978200	31.495200	32.700900	27.909800	88.62%
1	8	CONSECUTIVE	SHFL	0.119305	0.119305	0.119305	0.250006	0.583340	16.414600	38.158600	19.218200	32.591700	85.41%
1	16	CONSECUTIVE	SHFL	0.063161	0.063161	0.063161	0.249998	0.426469	8.966460	51.070000	9.698660	47.214500	92.45%
1	32	CONSECUTIVE	SHFL	0.032538	0.032538	0.032538	0.250025	0.340934	4.959360	73.815100	6.123740	59.779700	80.99%
1	64	CONSECUTIVE	SHFL	0.016519	0.016519	0.016519	0.249996	0.296150	3.322850	95.697500	5.237570	60.713000	63.44%
1	128	CONSECUTIVE	SHFL	0.008324	0.008324	0.008324	0.249913	0.273169	2.408350	121.790000	4.298780	68.231600	56.02%
1	256	CONSECUTIVE	SHFL	0.004178	0.004178	0.004178	0.249916	0.261589	1.752480	160.275000	3.876450	72.458000	45.21%
1	512	CONSECUTIVE	SHFL	0.002093	0.002093	0.002093	0.249880	0.255728	1.450690	189.280000	3.607840	76.108200	40.21%
1	1024	CONSECUTIVE	SHFL	0.001048	0.001048	0.001048	0.249926	0.252852	0.955200	284.232000	3.463330	78.392300	27.58%
1	4096	CONSECUTIVE	SHFL	0.000262	0.000262	0.000262	0.249683	0.250416	1.826110	147.243000	3.522820	76.325800	51.84%
1	2	SHFL	CONSECUTIVE	0.357914	0.357914	0.357914	0.249993	1.249993	19.359000	69.330400	26.145700	51.334200	74.04%
1	4	SHFL	CONSECUTIVE	0.214748	0.214748	0.214748	0.249994	0.849994	12.373200	73.762500	16.047400	56.873800	77.10%
1	8	SHFL	CONSECUTIVE	0.119305	0.119305	0.119305	0.250006	0.583340	7.264740	86.218700	9.101380	68.819900	79.82%
1	16	SHFL	CONSECUTIVE	0.063161	0.063161	0.063161	0.249998	0.426469	4.251100	107.717000	4.975070	92.042400	85.45%
1	32	SHFL	CONSECUTIVE	0.032538	0.032538	0.032538	0.250025	0.340934	2.534140	144.457000	3.146750	116.334000	80.53%
1	64	SHFL	CONSECUTIVE	0.016519	0.016519	0.016519	0.249996	0.296150	1.587200	200.345000	2.541380	125.124000	62.45%
1	128	SHFL	CONSECUTIVE	0.008324	0.008324	0.008324	0.249913	0.273169	1.161250	252.584000	2.278080	128.754000	50.97%
1	256	SHFL	CONSECUTIVE	0.004178	0.004178	0.004178	0.249916	0.261589	1.019070	275.623000	2.418820	116.123000	42.13%
1	512	SHFL	CONSECUTIVE	0.002093	0.002093	0.002093	0.249880	0.255728	1.100930	249.413000	2.571230	106.792000	42.82%
1	1024	SHFL	CONSECUTIVE	0.001048	0.001048	0.001048	0.249926	0.252852	0.944928	287.322000	2.851710	95.205300	33.14%
1	4096	SHFL	CONSECUTIVE	0.000262	0.000262	0.000262	0.249683	0.250416	2.053470	130.940000	3.184030	84.446900	64.49%
1	2	SHFL	SHFL	0.357914	0.357914	0.357914	0.249993	1.249993	55.349800	24.248800	61.491900	21.826800	90.01%
1	4	SHFL	SHFL	0.214748	0.214748	0.214748	0.249994	0.849994	33.883500	26.935600	37.047000	24.635500	91.46%
1	8	SHFL	SHFL	0.119305	0.119305	0.119305	0.250006	0.583340	19.901200	31.473300	21.161800	29.598400	94.04%
1	16	SHFL	SHFL	0.063161	0.063161	0.063161	0.249998	0.426469	10.758600	42.563100	11.525800	39.729700	93.34%
1	32	SHFL	SHFL	0.032538	0.032538	0.032538	0.250025	0.340934	5.817920	62.922100	7.820100	46.812200	74.40%
1	64	SHFL	SHFL	0.016519	0.016519	0.016519	0.249996	0.296150	3.873250	82.098600	6.578460	48.337800	58.88%
1	128	SHFL	SHFL	0.008324	0.008324	0.008324	0.249913	0.273169	2.744930	106.856000	5.650620	51.908100	48.58%
1	256	SHFL	SHFL	0.004178	0.004178	0.004178	0.249916	0.261589	1.927710	145.706000	5.295460	53.041600	36.40%
1	512	SHFL	SHFL	0.002093	0.002093	0.002093	0.249880	0.255728	1.533060	179.110000	5.134270	53.481100	29.86%
1	1024	SHFL	SHFL	0.001048	0.001048	0.001048	0.249926	0.252852	1.016000	267.223000	5.050750	53.754000	20.12%
1	4096	SHFL	SHFL	0.000262	0.000262	0.000262	0.249683	0.250416	2.069410	129.932000	5.016030	53.604500	41.26%

16B-aligned buffers, 4B-aliased copies, Varying Buffer Size

using AtomicT=uint4; using TLEV_ALIAS_TYPE=uint32_t;

This experiment analyses the benefit that aliasing loads (i.e., reinterpret_cast<TLEV_ALIAS_TYPE*>(..)) could have. Note that the buffer size of was bumped to be at least 16B here. We also tested what impact vectorised loads would have.

Data

Min. buffer size:	max. buffer size:	in_gen:	out_gen:	src size:	dst size:	sizes size:	data_size:	total:	duration (SPPS):	BW (SPPS):	duration (TWP):	BW (TWP):	duration (TWP) uint4 VECT:	BW (TWP) uint4 VECT:
16	16	CONSECUTIVE	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	0.787264	468.837000	1.163460	317.243000	1.162980	317.374000
16	16	CONSECUTIVE	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	0.781184	472.486000	1.165380	316.721000	1.161570	317.759000
16	16	CONSECUTIVE	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	0.792992	465.451000	1.168290	315.931000	1.166080	316.530000
16	16	CONSECUTIVE	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	0.793984	464.869000	1.168860	315.776000	1.162880	317.401000
32	32	CONSECUTIVE	CONSECUTIVE	0.022370	0.022370	0.022370	0.333333	0.395833	0.769184	552.563000	1.028770	413.138000	0.831776	308.463000
64	64	CONSECUTIVE	CONSECUTIVE	0.013422	0.013422	0.013422	0.400000	0.437500	0.802336	585.493000	1.062530	442.117000	0.671552	386.203000
128	128	CONSECUTIVE	CONSECUTIVE	0.007457	0.007457	0.007457	0.444444	0.465278	0.867104	576.157000	1.267010	394.305000	0.763072	344.623000
256	256	CONSECUTIVE	CONSECUTIVE	0.003948	0.003948	0.003948	0.470588	0.481618	0.997184	518.593000	1.258050	411.060000	0.982496	270.170000
512	512	CONSECUTIVE	CONSECUTIVE	0.002034	0.002034	0.002034	0.484848	0.490530	1.165660	451.848000	1.420580	370.767000	1.217820	219.087000
1024	1024	CONSECUTIVE	CONSECUTIVE	0.001032	0.001032	0.001032	0.492308	0.495192	1.124000	473.050000	2.384060	223.026000	1.493440	179.165000
4096	4096	CONSECUTIVE	CONSECUTIVE	0.000261	0.000261	0.000261	0.498047	0.498776	2.643140	202.622000	2.457500	217.927000	1.751390	152.954000
16	16	CONSECUTIVE	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	4.878300	75.661300	5.574110	66.216600	5.578180	66.168400
16	16	CONSECUTIVE	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	4.876960	75.682100	5.567520	66.295000	5.589150	66.038400
16	16	CONSECUTIVE	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	4.876290	75.692600	5.567840	66.291200	5.579940	66.147500
16	16	CONSECUTIVE	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	4.863940	75.884800	5.579390	66.153900	5.601250	65.895800
32	32	CONSECUTIVE	SHFL	0.022370	0.022370	0.022370	0.333333	0.395833	2.298430	184.919000	4.205730	101.058000	3.670780	69.895800
64	64	CONSECUTIVE	SHFL	0.013422	0.013422	0.013422	0.400000	0.437500	1.488800	315.531000	2.546590	184.467000	2.840000	91.322400
128	128	CONSECUTIVE	SHFL	0.007457	0.007457	0.007457	0.444444	0.465278	0.930208	537.071000	2.278430	219.268000	2.357250	111.559000
256	256	CONSECUTIVE	SHFL	0.003948	0.003948	0.003948	0.470588	0.481618	1.015680	509.150000	2.528100	204.554000	2.127580	124.762000
512	512	CONSECUTIVE	SHFL	0.002034	0.002034	0.002034	0.484848	0.490530	1.155390	455.865000	2.660350	197.982000	1.935900	137.822000
1024	1024	CONSECUTIVE	SHFL	0.001032	0.001032	0.001032	0.492308	0.495192	1.134820	468.542000	2.987360	177.986000	1.912030	139.941000
4096	4096	CONSECUTIVE	SHFL	0.000261	0.000261	0.000261	0.498047	0.498776	2.639420	202.907000	2.936930	182.353000	1.874210	142.931000
16	16	SHFL	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	2.227260	165.718000	2.585470	142.759000	2.585380	142.764000
16	16	SHFL	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	2.230660	165.466000	2.594050	142.287000	2.594500	142.262000
16	16	SHFL	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	2.224450	165.928000	2.583010	142.895000	2.594080	142.285000
16	16	SHFL	CONSECUTIVE	0.033554	0.033554	0.033554	0.250000	0.343750	2.224380	165.933000	2.587460	142.649000	2.600770	141.919000
32	32	SHFL	CONSECUTIVE	0.022370	0.022370	0.022370	0.333333	0.395833	1.671490	254.278000	1.898340	223.892000	1.747550	146.818000
64	64	SHFL	CONSECUTIVE	0.013422	0.013422	0.013422	0.400000	0.437500	1.261920	372.260000	1.294340	362.937000	1.144320	226.646000
128	128	SHFL	CONSECUTIVE	0.007457	0.007457	0.007457	0.444444	0.465278	0.943456	529.530000	1.437020	347.655000	0.928096	283.345000
256	256	SHFL	CONSECUTIVE	0.003948	0.003948	0.003948	0.470588	0.481618	0.990752	521.960000	1.576290	328.070000	1.074750	246.979000
512	512	SHFL	CONSECUTIVE	0.002034	0.002034	0.002034	0.484848	0.490530	1.180320	446.237000	1.808800	291.189000	1.290300	206.780000
1024	1024	SHFL	CONSECUTIVE	0.001032	0.001032	0.001032	0.492308	0.495192	1.127840	471.440000	2.414750	220.192000	1.533250	174.513000
4096	4096	SHFL	CONSECUTIVE	0.000261	0.000261	0.000261	0.498047	0.498776	2.654460	201.757000	2.420610	221.249000	1.748260	153.228000
16	16	SHFL	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	5.552960	66.468800	6.301150	58.576400	6.299970	58.587400
16	16	SHFL	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	5.555360	66.440100	6.294780	58.635600	6.299710	58.589800
16	16	SHFL	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	5.568160	66.287400	6.288480	58.694400	6.313380	58.463000
16	16	SHFL	SHFL	0.033554	0.033554	0.033554	0.250000	0.343750	5.553220	66.465800	6.325470	58.351200	6.300960	58.578200
32	32	SHFL	SHFL	0.022370	0.022370	0.022370	0.333333	0.395833	3.292060	129.105000	5.389600	78.859800	4.105500	62.494700
64	64	SHFL	SHFL	0.013422	0.013422	0.013422	0.400000	0.437500	1.858560	252.756000	3.066980	153.168000	3.226050	80.394200
128	128	SHFL	SHFL	0.007457	0.007457	0.007457	0.444444	0.465278	0.987744	505.787000	2.813630	177.560000	2.630940	99.953400
256	256	SHFL	SHFL	0.003948	0.003948	0.003948	0.470588	0.481618	0.987456	523.702000	3.006820	171.987000	2.349660	112.970000
512	512	SHFL	SHFL	0.002034	0.002034	0.002034	0.484848	0.490530	1.174460	448.462000	3.121380	168.740000	2.081310	128.193000
1024	1024	SHFL	SHFL	0.001032	0.001032	0.001032	0.492308	0.495192	1.128220	471.279000	3.200990	166.107000	1.975810	135.424000
4096	4096	SHFL	SHFL	0.000261	0.000261	0.000261	0.498047	0.498776	2.652130	201.935000	3.174370	168.713000	1.910080	140.246000

elstehle · 2022-04-19T17:01:15Z

Sorry for the wait. I did another clean up pass over the code of this PR.

I've long wanted a cuda::memcpy that would handle runtime determined alignment as well as take a CG parameter to use multiple threads to perform the copy. That seems like the best place to put such a building block as it could have widespread applicability.

Agreed, @jrhemstad. I believe there's a recurring need for it, especially when dealing with string data. I often find myself needing to load string data into shared memory for further processing, so I tried to be transparent to the destination data space. I.e., to support vectorised stores of 4B, 8B, and 16B. Where 4B and 8B stores are friendly to the shared memory space, reducing bank conflicts and the 16B stores are presumably more efficient for global memory stores.

I hope I've been able to take a first step into that direction. In the interest of getting this PR through, I haven't exposed it as a stand-alone CG/block-level algorithm yet and hidden it under the detail namespace for now. I plan to expose it to the public API and add typed tests in a follow-up PR. This is currently the signature (as I believe we don't have CG in CUB yet(?)):

VectorizedCopy(int32_t thread_rank, int32_t group_size, void *dest, ByteOffsetT num_bytes, const void *src)

alliepiper · 2022-04-22T16:22:11Z

We don't expose CG in the CUB APIs, this would require some more discussion before we added anything like that. That may be better suited to the senders/receivers based APIs that @senior-zero is working on. For now, let's try to find a way to pass the same info in without adding any dependencies.

cub/device/dispatch/dispatch_batch_memcpy.cuh

cub/device/device_memcpy.cuh

cub/agent/agent_batch_memcpy.cuh

cub/device/device_memcpy.cuh

gevtushenko

There are some issues in the example. Please, check if it's on the algorithm side or not.

gevtushenko

A few minor comments

cub/device/device_memcpy.cuh

cub/device/dispatch/dispatch_batch_memcpy.cuh

cub/agent/agent_batch_memcpy.cuh

miscco

partial review until the agent

cub/agent/agent_batch_memcpy.cuh

jrhemstad · 2022-12-13T19:23:10Z

cub/agent/agent_batch_memcpy.cuh

+  if (offset == 0)
+  {
+    LoadVectorAndFunnelShiftR<true>(aligned_ptr, bit_shift, data_out);
+  }
+  // Otherwise, we need to load extra bytes and perform funnel-shifting
+  else
+  {
+    LoadVectorAndFunnelShiftR<false>(aligned_ptr, bit_shift, data_out);
+  }


I wonder if there would be any advantage to dispatching to where the offset is statically known. It looks like that would allow bit_shift to be known statically as well.

Suggested change

if (offset == 0)

{

LoadVectorAndFunnelShiftR<true>(aligned_ptr, bit_shift, data_out);

}

// Otherwise, we need to load extra bytes and perform funnel-shifting

else

{

LoadVectorAndFunnelShiftR<false>(aligned_ptr, bit_shift, data_out);

}

switch(offset)

case 0: LoadVectorAndFunnelShiftR<0>(...)

case 1: LoadVectorAndFunnelShiftR<1>(...)

case 2: LoadVectorAndFunnelShiftR<2>(...)

case 3: LoadVectorAndFunnelShiftR<3>(...)

Interesting idea! Preliminary results look like it does not cut above the noise, but I'll do a more thorough run and follow up.

I've ran some more benchmarks on this suggestion with code paths that use immediate bit-shift values. Performance remained unchanged for DeviceMemcpy::Batched.

My hypothesis is that we're bottlenecked by the memory subsystem, as I also don't see significant performance changes for some other changes that I'd expect to otherwise positively impact performance.

test/test_device_batch_memcpy.cu

gevtushenko

The work seems to be in progress, so I'll finish review for now to make comments visible.

cub/device/dispatch/dispatch_batch_memcpy.cuh

gevtushenko · 2022-12-19T13:04:12Z

cub/agent/agent_batch_memcpy.cuh

+
+    // Ensure the prefix callback has finished using its temporary storage and that it can be reused
+    // in the next stage
+    CTA_SYNC();


temp_storage.blev_buffer_offset is not used in PartitionBuffersBySize. Since look-back is rather expensive, do you think there's any advantage in overlapping of decoupled look-back with the PartitionBuffersBySize in other warps? BLevBuffScanPrefixCallbackOpT storage should be like 4 ints, so putting it into struct instead of a union shouldn't increase shared memory requirements significantly, but we would get rid of one sync and overlap some operations.

cub/agent/agent_batch_memcpy.cuh

cub/device/dispatch/dispatch_batch_memcpy.cuh

…h-memcpy

elstehle changed the title ~~Feature/device batch memcpy~~ Adds DeviceBatchMemcpy algorithm and tests Aug 18, 2021

gevtushenko self-requested a review August 23, 2021 09:42

gevtushenko suggested changes Aug 24, 2021

View reviewed changes

elstehle force-pushed the feature/device-batch-memcpy branch from 1e96c33 to d568803 Compare October 4, 2021 17:43

elstehle force-pushed the feature/device-batch-memcpy branch from d568803 to 4f44fae Compare October 11, 2021 12:52

jrhemstad reviewed Oct 11, 2021

View reviewed changes

cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Oct 11, 2021

View reviewed changes

cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved

elstehle changed the title ~~Adds DeviceBatchMemcpy algorithm and tests~~ [WIP] Adds DeviceBatchMemcpy algorithm and tests Oct 13, 2021

alliepiper added this to the 1.16.0 milestone Oct 14, 2021

alliepiper added the helps: rapids Helps or needed by RAPIDS. label Oct 14, 2021

alliepiper marked this pull request as draft October 14, 2021 20:27

jrhemstad mentioned this pull request Nov 18, 2021

[FEA] pack/unpack functions to merge/split (multiple) device_buffer(s) rapidsai/cudf#9726

Open

alliepiper modified the milestones: 1.16.0, 1.17.0 Feb 7, 2022

elstehle force-pushed the feature/device-batch-memcpy branch from 8f6d447 to f657812 Compare April 18, 2022 08:40

elstehle changed the title ~~[WIP] Adds DeviceBatchMemcpy algorithm and tests~~ Adds DeviceBatchMemcpy algorithm and tests Apr 19, 2022

alliepiper requested review from gevtushenko and jrhemstad April 22, 2022 16:22

gevtushenko reviewed May 24, 2022

View reviewed changes

cub/device/dispatch/dispatch_batch_memcpy.cuh Outdated Show resolved Hide resolved

gevtushenko suggested changes May 25, 2022

View reviewed changes

gevtushenko suggested changes Jul 20, 2022

View reviewed changes

alliepiper modified the milestones: 2.0.0, 2.1.0 Jul 25, 2022

miscco reviewed Sep 6, 2022

View reviewed changes

elstehle requested a review from gevtushenko December 13, 2022 18:57

jrhemstad reviewed Dec 13, 2022

View reviewed changes

gevtushenko reviewed Dec 14, 2022

View reviewed changes

test/test_device_batch_memcpy.cu Outdated Show resolved Hide resolved

gevtushenko reviewed Dec 14, 2022

View reviewed changes

test/test_device_batch_memcpy.cu Outdated Show resolved Hide resolved

gevtushenko suggested changes Dec 19, 2022

View reviewed changes

gevtushenko approved these changes Dec 24, 2022

View reviewed changes

adds DeviceMemcpy::Batched

fa9c1bd

elstehle force-pushed the feature/device-batch-memcpy branch from f321135 to fa9c1bd Compare December 28, 2022 08:39

Merge remote-tracking branch 'upstream/main' into feature/device-batc…

9ac15eb

…h-memcpy

elstehle mentioned this pull request Dec 28, 2022

Adds benchmarks for DeviceMemcpy::Batched alliepiper/thrust_benchmark#11

Merged

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Dec 29, 2022

Testing NVIDIA/cub#359

3acdcea

gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Dec 29, 2022

Testing NVIDIA/cub#359

b6ec184

gevtushenko added testing: gpuCI in progress Started gpuCI testing. testing: gpuCI passed Passed gpuCI testing. and removed testing: gpuCI in progress Started gpuCI testing. labels Dec 29, 2022

gevtushenko marked this pull request as ready for review December 30, 2022 09:04

gevtushenko merged commit 423f54e into NVIDIA:main Dec 30, 2022

elstehle mentioned this pull request Jan 10, 2023

[FEA] Multi-buffer copy algorithm #297

Closed

This was referenced Jan 13, 2023

[FEA] Multiple buffer copy kernel rapidsai/cudf#7076

Closed

Using CUB (planned) 2.1.0's batched memcpy cupy/cupy#7329

Open

elstehle mentioned this pull request Apr 24, 2023

[FEA]: Explore potential improvements for DeviceMemcpy::Batched NVIDIA/cccl#59

Open

3 tasks

GregoryKimball mentioned this pull request Aug 24, 2023

[FEA] Story - Improve performance with long strings rapidsai/cudf#13048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds DeviceBatchMemcpy algorithm and tests #359

Adds DeviceBatchMemcpy algorithm and tests #359

elstehle commented Aug 18, 2021 •

edited

Loading

gevtushenko left a comment

elstehle commented Aug 25, 2021 •

edited

Loading

jrhemstad commented Aug 25, 2021

elstehle commented Oct 13, 2021

alliepiper commented Oct 14, 2021

elstehle commented Oct 18, 2021 •

edited

Loading

elstehle commented Apr 19, 2022

alliepiper commented Apr 22, 2022

gevtushenko left a comment

gevtushenko left a comment

miscco left a comment

jrhemstad Dec 13, 2022

elstehle Dec 13, 2022

elstehle Dec 28, 2022

gevtushenko left a comment

gevtushenko Dec 19, 2022

-  if (offset == 0)
-  {
-    LoadVectorAndFunnelShiftR<true>(aligned_ptr, bit_shift, data_out);
-  }
-  // Otherwise, we need to load extra bytes and perform funnel-shifting
-  else
-  {
-    LoadVectorAndFunnelShiftR<false>(aligned_ptr, bit_shift, data_out);
-  }
+switch(offset)
+case 0: LoadVectorAndFunnelShiftR<0>(...)
+case 1: LoadVectorAndFunnelShiftR<1>(...)
+case 2: LoadVectorAndFunnelShiftR<2>(...)
+case 3: LoadVectorAndFunnelShiftR<3>(...)

Adds DeviceBatchMemcpy algorithm and tests #359

Adds DeviceBatchMemcpy algorithm and tests #359

Conversation

elstehle commented Aug 18, 2021 • edited Loading

Algorithm Overview

Implementation Details

Step 1: Partitioning Buffers by Size

Step 2.a: Copying TLEV Buffers

Step 2.b: Copying WLEV Buffers

Step 2.c: Enqueueing BLEV Buffers

gevtushenko left a comment

Choose a reason for hiding this comment

elstehle commented Aug 25, 2021 • edited Loading

jrhemstad commented Aug 25, 2021

elstehle commented Oct 13, 2021

alliepiper commented Oct 14, 2021

elstehle commented Oct 18, 2021 • edited Loading

Methodology

Copying of small buffers logic

No Aliased Loads, No Buffer Size Variance

Scheduling: TWP vs. SPPS; No Aliased Loads, No Buffer Size Variance

No Aliased Loads, Varying Buffer Size

16B-aligned buffers, 4B-aliased copies, Varying Buffer Size

elstehle commented Apr 19, 2022

alliepiper commented Apr 22, 2022

gevtushenko left a comment

Choose a reason for hiding this comment

gevtushenko left a comment

Choose a reason for hiding this comment

miscco left a comment

Choose a reason for hiding this comment

jrhemstad Dec 13, 2022

Choose a reason for hiding this comment

elstehle Dec 13, 2022

Choose a reason for hiding this comment

elstehle Dec 28, 2022

Choose a reason for hiding this comment

gevtushenko left a comment

Choose a reason for hiding this comment

gevtushenko Dec 19, 2022

Choose a reason for hiding this comment

elstehle commented Aug 18, 2021 •

edited

Loading

elstehle commented Aug 25, 2021 •

edited

Loading

elstehle commented Oct 18, 2021 •

edited

Loading