Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations #2945

Merged

Conversation

AlexandreEichenberger
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger commented Sep 18, 2024

When deciding the amount of unrolling of a simd loop, it is helpful to take into account the amount of register pressure. If unrolling a loop that use a lot of vector SIMD registers, this lead to spilling and thus lower performance.

Up to now, the register pressure was estimated as the number of SIMD ops / 2, roughly. That worked well for small loops. Linear Quantization, however, has a lot of operations that are very sequential, thus do not need many registers. But since it had a lot of operations, its estimated register pressure was high, and thus prevented loop unrolling.

I have added a new "metrics" in the map that describe the generic operations used in a loop. Before it was only "generic add" -> "3", meaning this kernel uses 3 generic adds.

I have now added a new enum value, GenericOps::EstimatedVectorRegisterPressure , that optionally let us specify an estimate of the register pressure for that kernel. When none is provided, the old approach is used; when one is provided, then that value is used.

A quick test for DynamicLinearQuantization on arm64 shows a lowering from 117us (no simd) -> 46us (simd, old unrolling) -> 38us (simd, new unrolling). Unrolling (unrollVL) went from 2 to 4.

A second issue that is addressed is the Linear Quantization. The loop has "complicated code" that simdize very well until it hit the SIMD conversion of float32 into int8. LLVM compiler generates code that is slow.

I have tried various scheme to accelerate this conversion. First I tried a hybrid approach where I use vector extract and insert element to proceed with scalar elements of a vector. The code generated by mlir/llvm for extract and insert is unfortunately worst than the original code.

I then tried to bypass the extract and insert by using memory. I save a SIMD vector into a buffer (of VL elements). Then I proceed with scalar load from that buffer, and save the int8 result in another buffer. Then that buffer is re-loaded as a SIMD vector. That approach did better than the above approach, but not as well as the next approach. Also, it has a privatization issue for the parallel loops.

The approach that worked best was to have the first loop perform all of the operations until the conversion from float to int. It saved its value (in SIMD computation mode) into a temporary buffer. A second loop then iterate over all values in the temporary buffer and convert them and save them into the output. The drawback of this approach is that it need a second buffer of the same size as the input. But each loop can perform at their best performance with significant unrolling.

A lot of assembly was looked at, to detect the best pattern, in addition to performance evaluation on z16.

For quantized linear, performance went from 158us to 124us for 64K values with 1 thread, and from 35us to 30us with 6 threads
For dynamic quantized linear, performance went from 174us to 161us for 64k values with 1 thread, and remained at 50us.

Note that I also discovered that the non-sims code (from the onnx-mlir) obtained with the -disable-simd still has a lot of SIMD introduced by the LLVM backend. So simply disabling onnx-mlir simd is not sufficient to have a code with/without SIMD. Nevertheless, onnx-mlir generated simd code outperforms code generated by the backend.

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
@AlexandreEichenberger AlexandreEichenberger marked this pull request as draft September 18, 2024 19:20
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
@AlexandreEichenberger AlexandreEichenberger marked this pull request as ready for review September 20, 2024 01:25
@@ -2073,6 +2073,29 @@ void VectorBuilder::multiReduction(ArrayRef<Value> inputVecArray,
}
}

Value VectorBuilder::extractElement(Value vector, int64_t index) const {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for experiments on how to get better performance, currently not in use but were verified to work during experiments associated with the PR

@@ -663,6 +663,12 @@ int64_t computeSuitableUnrollFactor(mlir::MemRefType memRefType,
// Cap totVL so that it is at most maxUnrollVL * archVL.
int64_t capVLForMaxUnroll(
mlir::MemRefType memRefType, int64_t totVL, int64_t maxUnrollVL);
// In some type conversion loops we may have a given totVL based on a given
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needed for some hybrid loops where multiple precisions where present in the same SIMD loop. Currently not needed in the (best performing) final version, but would be useful to similar situations in the future.

@AlexandreEichenberger AlexandreEichenberger changed the title Added register pressure estimate to be explicit when needed Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations Sep 20, 2024
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for various optimizations!

@AlexandreEichenberger AlexandreEichenberger merged commit 5c53b7e into onnx:main Sep 24, 2024
7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14716 [push] Added explicit register ... started at 09:31

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15686 [push] Added explicit register ... started at 08:20

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15689 [push] Added explicit register ... started at 09:20

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15686 [push] Added explicit register ... passed after 1 hr 5 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15689 [push] Added explicit register ... passed after 1 hr 28 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14716 [push] Added explicit register ... passed after 2 hr 3 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants