Optimizations for matrix multiplication do not seem to work correctly #205

spisladqo · 2024-11-15T20:25:52Z

What experiment is about

I want to experiment with how vortex responds to optimizations of linear algebra programs for my paper. With built-in vortex profiler, I tested performance of the first two kernels from tutorial by Cedric Nugteren, https://cnugteren.github.io/tutorial/pages/page1.html, the driver code was made similar to one in tests/opencl/sgemm. You can find the code of tests here (branch - optimization_check): https://github.com/spisladqo/vortex/tree/optimization_check. My fork is a little bit outdated but I think there were no major changes to the memory system & profiler, so I hope these experiments are reproducible. I’ve read vortex documentation and inspected the code. I’ve also watched the vortex introduction video on youtube https://www.youtube.com/watch?v=h1xDQILSZnI to understand underlying vortex memory architecture for conducting correct experiments. But nevertheless I’ve encountered numerous issues, which I’ll explain later. I hope for your quick reply.

Briefly about the tests from tutorial:

First test uses naive implementation, where one thread computes one element in the resulting matrix, no optimizations are used.
Second test uses user-defined macros TS (tile size) to fit submatrices of size TS * TS of matrices A and B into local memory. This is helpful because threads which are in the same row/column inside a tile can reuse elements of initial matrices without requesting global memory.

Firstly, I ran these tests using rtlsim and simx to see if there’s any difference between two simulations regarding memory requests. Configurations were set to default, TS was set to 4. Here are only the counts of memory requests.

For matrices of size 32:

rtlsim
Command:./ci/blackbox.sh --app=kernel1 --driver=rtlsim --args=-n32 --perf=2
Output: PERF: memory requests=3863 (reads=816, writes=3047)
Command: ./ci/blackbox.sh --app=kernel2 --driver=rtlsim --args=-n32 --perf=2
Output: PERF: memory requests=22012 (reads=9749, writes=12263)
simx
Command: ./ci/blackbox.sh --app=kernel1 --driver=simx --args=-n32 --perf=2
Output: PERF: memory requests=3862 (reads=816, writes=3046)
Command: ./ci/blackbox.sh --app=kernel2 --driver=simx --args=-n32 --perf=2
Output: PERF: memory requests=21368 (reads=9106, writes=12262)

For matrices of size 128:

rtlsim
Command: ./ci/blackbox.sh --app=kernel1 --driver=rtlsim --args=-n128 --perf=2
Output: PERF: memory requests=183260 (reads=145653, writes=37607)
Command: ./ci/blackbox.sh --app=kernel2 --driver=rtlsim --args=-n128 --perf=2
Output: PERF: memory requests=480343 (reads=295280, writes=185063)
simx
Command: /ci/blackbox.sh --app=kernel1 --driver=simx --args=-n128 --perf=2
Output: PERF: memory requests=183094 (reads=145488, writes=37606)
Command: /ci/blackbox.sh --app=kernel2 --driver=simx --args=-n128 --perf=2
Output: PERF: memory requests=465401 (reads=280339, writes=185062)

The error was no more than 4%, which was acceptable, so I decided to continue using simx to save time. It can already be seen that under default configuration kernel2 has significantly more memory requests than kernel1.

I further assume that the counter named “memory requests” means only global memory requests, because there are counters called “lmem reads” and “lmem writes”, which are local memory requests and seem (through observation) to not be related to “memory requests”. Please correct me if I’m wrong.

Experiment

Firstly, I needed to find out which configurations are correct to run these tests under. I tried to find out information about how work-groups are mapped from OpenCL model to vortex hardware to correctly choose TS. I found some mentions of group_size in kernel/src/vx_spawn.c and others, but they don’t seem to have relation with my OpenCL tests. Please correct me if I missed something.

My first guess was that taking TS such that TS * TS = total threads per core would be fine (I assume threads * warps = total threads per core, however it is not clearly stated inside docs/simulation.md).
I took TS=16 (so that 2 * 16 * 16=512 elements are loaded into local memory).

Running the test:
Command: ./ci/blackbox.sh --app=kernel2 --driver=simx --threads=32 --warps=8 --cores=2 --args=-n32 --perf=2
Output:
So, the configuration wasn’t correct. However, when trying to set more warps I get:
Command: ./ci/blackbox.sh --app=kernel2 --driver=simx --threads=16 --warps=16 --cores=2 --args=-n32 --perf=2
Output:

Interestingly enough, it says that needed memory is 2048 bytes, but I allocated only 512 (2 matrices of 16 * 16 elements). I can’t understand where the number 2048 comes from.
I get the same result when trying to pass –threads=8 and –warps=16.
So it seems like local memory has a fixed size, and TS=16 is too big for it. Also when trying to set warps to 2 times more or less number (without changing threads), I see that available memory becomes 2 times larger/smaller.

When I set TS to 8, all of the following commands had almost the same output:
Command: ./ci/blackbox.sh --app=kernel2 --driver=simx --threads=16 --warps=4 --cores=2 --args=-n32 --perf=2
Output: Memory access violation from 0x13040 to 0x13044, curent flags=1, access flags=2
Command: ./ci/blackbox.sh --app=kernel2 --driver=simx --threads=8 --warps=8 --cores=2 --args=-n32 --perf=2
Output: Memory access violation from 0x13020 to 0x13024, curent flags=1, access flags=2
Commands:
./ci/blackbox.sh --app=kernel2 --driver=simx --threads=32 --warps=2 --cores=2 --args=-n32 --perf=2
./ci/blackbox.sh --app=kernel2 --driver=simx --threads=4 --warps=16 --cores=2 --args=-n32 --perf=2
./ci/blackbox.sh --app=kernel2 --driver=simx --threads=2--warps=32--cores=2 --args=-n32 --perf=2
Output: Memory access violation from 0x13000 to 0x13004, curent flags=1, access flags=2

My guess about picking TS such that TS * TS = warps * threads was incorrect.

Then I came up with an idea that the work-group may be mapped to exactly one warp. So I have to pick TS=4, because picking TS=8 would already result in tile being bigger than number of threads in a warp (max threads = 32, but 8 * 8=64). Having this assumption in mind, I’ve ran tests under following configurations:

Kernel2:
Command: ./ci/blackbox.sh --app=kernel2 --driver=simx --threads=16 --warps=4 --cores=2 --args=-n32 --perf=2
Output:

The test ran successfully, so it seems to me that the work-group <-> hardware warp is the most realistic mapping so far.
Analyzing the test: there were lots of local memory requests (32768 + 8192) and 28861 global memory requests. Running kernel1 with the same configuration results in:

Kernel1:
Command: ./ci/blackbox.sh --app=kernel1 --driver=simx --threads=16 --warps=4 --cores=2 --args=-n32 --perf=2
Output:

0 local memory requests, at the same time only 8394 global memory requests. So the kernel which uses optimization performs worse than the unoptimized one.

Results

To sum up, for me, it is not clear how vortex memory works, so picking the most optimal configurations seemed rather a difficult task. However, current experiments show that the kernel where there should be less global memory requests can actually have more of them. But the local memory is utilized by this kernel as well. I don’t understand why it has more global memory requests compared to kernel1. I focused on the second test because it introduces local memory optimizations which don’t seem to work. I hope to experiment with more tests after I understand the results of this experiment.

I ask the following questions to understand why I got the results I did, and if my experiments were done incorrectly, I rely on your quick reply to help me conduct the correct experiments for my paper:

Can you explain how OpenCL’s work-groups are related to other concepts (threads, warps, cores) in vortex? Are they configurable?
Is local memory initially set to fixed size and then divided between warps? Is there a way to configure local memory size?
Running kernel2 under different TS’s shows that needed local memory is always 4 times bigger than it is allocated in the test (2 matrices both of size TS * TS). It can be my fault, but still I cannot understand why this happens. It would really help me a lot to understand this.
Am I mistaken somewhere and have chosen poor configurations for my experiments, or is it some profiler/vortex memory design problem?
I noticed that when using profiler, I can only choose --perf=1 or –perf=2, and also in the code these two cases are inside one switch statement and separated with break;. Is there a way to count both core and memory information during one test? If not, what is the reason behind that?
docs/simulation.md does not specify if –threads=n is interpreted as n threads per a warp or as a total of n threads for execution. The same goes with –warps=. But inside docs/fpga_setup.md it is clearly stated that the number of threads is per a warp, and the number of warps is per a core. I relied on the variant of threads per a warp and warps per a core, but please correct me if I’m wrong. I hope clarifying this can make vortex’s documentation better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for matrix multiplication do not seem to work correctly #205

Optimizations for matrix multiplication do not seem to work correctly #205

spisladqo commented Nov 15, 2024 •

edited

Loading

Optimizations for matrix multiplication do not seem to work correctly #205

Optimizations for matrix multiplication do not seem to work correctly #205

Comments

spisladqo commented Nov 15, 2024 • edited Loading

What experiment is about

Experiment

Results

spisladqo commented Nov 15, 2024 •

edited

Loading