IREE Compiler support Model Quantization. #12005
Replies: 2 comments 1 reply
-
Quantization in general is a broad topic; it's touching many layers across the stack, involving model authoring, framework exporting, which hardware we are talking about, etc. So the answer depends on what specifics you are interested with. In general, IREE compiles input models down to runtime scheduling logic and accelerator (GPU, CPU, etc.) executables. We support a broad range of hardware here---arm/x86_64/risv for CPU, various kinds of GPUs (AMD, Apple, ARM, Intel, NVIDIA, Qualcomm, etc.). Different hardware may have different capabilities regarding int8 / int4 / etc. support, and different API / software stack may further not expose those smaller bitwidth yet (and may expose it in the future). For example, if you drive older generations of NVIDIA GPUs using CUDA, you don't have native int4 support there. If you drive GPUs via Vulkan for any kind of GPUs, no native int4 support too, but certain hardware (e.g., newer generation of NVIDIA) does support it; so it's not exposed yet. So you can see that it's a diverse landscape to simply answer as a binary yes or no. :) But still coming to answer your question, as long as the input model is exported to MLIR with proper int8/int4/etc. and the target hardware can natively support it, we should support the compilation flow and generate performant code. For targets that don't, we can also emulate it with other bitwidths (int32) to make it runnable at least; not gonna be performant though. Specifically, right now the support for int8 across various CPU/GPU targets are progressing well, esp. for mobile focused architectures like ARM CPU or Vulkan for GPU. Models should be runnable and we are working on adding more accelerated implementations by going through native int8 intrinsics. For int4, we haven't looked into it much yet. |
Beta Was this translation helpful? Give feedback.
-
Thanks. Therefore, we can use the following workflow to build a quantitative model on top of IREE. |
Beta Was this translation helpful? Give feedback.
-
Can you support model quantization components (INT8,INT4,etc) on the IREE stack?
Currently existing some projects support this, like
https://github.com/sophgo/tpu-mlir
Beta Was this translation helpful? Give feedback.
All reactions