Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling mobilenetV2 from ducumentation: "failed to legalize operation 'stablehlo.convolution' that was explicitly marked illegal" #19852

Open
metal3d opened this issue Jan 30, 2025 · 7 comments
Labels
bug 🐞 Something isn't working integrations/stablehlo StableHLO (JAX/TensorFlow/etc) import and conversion integrations/tensorflow TensorFlow model import and conversion

Comments

@metal3d
Copy link

metal3d commented Jan 30, 2025

What happened?

Hell,

I successfully converted MobileNet V2 to mlir but then, the iree-compile failed to create the vmfb file.

I followed the documentation page: https://iree.dev/guides/deployment-configurations/gpu-vulkan/#compile-a-program

Steps to reproduce your issue

# prepare workspace
mkdir -p Projects/ML/ireetest
cd Projects/ML/ireetest
python3.12 -mvenv venv
source venv/bin/activate

pip install tensorflow iree-base-compiler iree-base-runtime iree-tools-tf

# download mobilenet v2 from tfhub (that is now kaggle)
mkdir models
cd models
curl -L -o mobilenetv2.tar.gz \
  https://www.kaggle.com/api/v1/models/google/mobilenet-v2/tensorFlow2/035-128-classification/2/download
tar xf mobilenetv2.tar.gz
cd ..

# checks:
ls -lah models/
total 7,4M
drwxr-xr-x. 1 metal3d metal3d   82 30 janv. 09:01 .
drwxr-xr-x. 1 metal3d metal3d   50 30 janv. 09:19 ..
-rw-r--r--. 1 metal3d metal3d 6,2M 30 janv. 09:01 mobilenetv2.tar.gz
-rwx------. 1 metal3d metal3d 1,3M 15 nov.   2023 saved_model.pb
drwxr-x--x. 1 metal3d metal3d   88 15 nov.   2023 variables


# in python:
python -c "import tensorflow.compat.v2 as tf;model = tf.saved_model.load('./models/');print(list(model.signatures.keys()))"
# output:
['serving_default']


# import
iree-import-tf \
  --tf-import-type=savedmodel_v1 \
  --tf-savedmodel-exported-names=serving_default \
  ./models/ -o iree_input.mlir

# checks:
ls -lah iree_input.mlir
-rw-r--r--. 1 metal3d metal3d 6,5M 30 janv. 09:12 iree_input.mlir

### The problem starts here:

# compile:
 iree-compile --iree-hal-target-backends=vulkan-spirv iree_input.mlir --iree-vulkan-target=ampere -o mobilenet.vmfb
-:549:11: error: failed to legalize operation 'stablehlo.convolution' that was explicitly marked illegal
-:549:11: note: see current operation: %420 = "stablehlo.convolution"(%415, %419) <{batch_group_count = 1 : i64, dimension_numbers = #stablehlo.conv<[b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f]>, feature_group_count = 16 : i64, padding = dense<1> : tensor<2x2xi64>, precision_config = [#stablehlo<precision DEFAULT>, #stablehlo<precision DEFAULT>], rhs_dilation = array<i64: 1, 1>, window_strides = array<i64: 1, 1>}> : (tensor<?x64x64x16xf32>, tensor<3x3x1x16xf32>) -> tensor<?x64x64x16xf32>

I tried with savedmodel_v2:

iree-compile --iree-hal-target-backends=vulkan-spirv iree_input.mlir -o mobilenet.vmfb
-:1:1: error: outer module does not contain a vm.module op
-:1:1: note: see current operation:
"builtin.module"() ({
^bb0:
}) : () -> ()
error opening input file: failed to generate bytecode

What component(s) does this issue relate to?

Compiler, Python

Version information

IREE compiler version 3.1.0rc20250107 @ d224220
LLVM version 20.0.0git
Optimized build

Additional context

Running on Fedora 41 - GPU is RTX 3060 - using "Vulkan"

@metal3d metal3d added the bug 🐞 Something isn't working label Jan 30, 2025
@ScottTodd ScottTodd added integrations/tensorflow TensorFlow model import and conversion integrations/stablehlo StableHLO (JAX/TensorFlow/etc) import and conversion labels Feb 4, 2025
@ScottTodd
Copy link
Member

I can reproduce this in Colab: https://colab.research.google.com/gist/ScottTodd/39b0ac7f054650011b2a4012d34b6afa/iree-issue19852.ipynb

MobileNetV2 should work and StableHLO should be stable. Neither is currently the case, but that can be fixed. This is especially worth fixing since as you point out, our documentation for Vulkan and some other backends uses that as the first example. I have a task filed on #18174 to "replace TensorFlow MobileNet example with something more recent / supported" too...

For this specific issue, a few things stand out to me:

@ScottTodd
Copy link
Member

Trying to see how we'd get to the VHLO dialect of StableHLO from TF, since ideally we wouldn't need a downstream import tool at all, we'd just consume StableHLO that the framework exports.

The StableHLO website has tutorials for JAX and PyTorch to StableHLO and StableHLO --> TF, but not TF --> StableHLO?

@ScottTodd
Copy link
Member

Maybe we could bundle stablehlo-translate as part of iree-tools-tf and run serialize_portable_artifact(module, target_version) after the TensorFlow pass pipelines: https://openxla.org/stablehlo/compatibility . Though we'd actually need stablehlo-translate to be matched to what version of stablehlo TensorFlow has, not the version that IREE has, so it would need to be part of the tensorflow python package 🤔.

@ScottTodd
Copy link
Member

ScottTodd commented Feb 4, 2025

Tried to use the stablehlo nightly releases from close to the installed tensorflow version but I think I'm holding the APIs wrong? Or maybe stablehlo can't handle the additional dialects like ml_program in the program. Can't tell much from "ValueError: failed to serialize module".

# for python 3.11
pip install -f https://github.com/openxla/stablehlo/releases/expanded_assets/dev-wheels stablehlo==1.8.0.1730182293+acc379ab
from mlir.dialects import stablehlo

with open("iree_input_text.mlir", "r") as f:
  data = f.read()
  print(data[-1000:])
  
  serialized = stablehlo.serialize_portable_artifact_str(
      data, stablehlo.get_current_version()
  )
%from_elements_7 : tensor<2xindex>, tensor<2xindex> -> tensor<2xindex>
      %223 = stablehlo.dynamic_broadcast_in_dim %211, %222, dims = [0, 1] : (tensor<?x1001xf32>, tensor<2xindex>) -> tensor<?x1001xf32>
      %224 = stablehlo.dynamic_broadcast_in_dim %213, %222, dims = [0, 1] : (tensor<?x1xf32>, tensor<2xindex>) -> tensor<?x1001xf32>
      %225 = stablehlo.subtract %223, %224 : tensor<?x1001xf32>
      shape.assuming_yield %225 : tensor<?x1001xf32>
    }
    %217 = stablehlo.exponential %216 : tensor<?x1001xf32>
    %218 = stablehlo.reduce(%217 init: %cst_5) applies stablehlo.add across dimensions = [1] : (tensor<?x1001xf32>, tensor<f32>) -> tensor<?xf32>
    %dim_8 = tensor.dim %218, %c0 : tensor<?xf32>
    %from_elements_9 = tensor.from_elements %dim_8, %c1 : tensor<2xindex>
    %219 = shape.shape_of %217 : tensor<?x1001xf32> -> tensor<2xindex>
    %220 = shape.cstr_broadcastable %219, %from_elements_9 : tensor<2xindex>, tensor<2xindex>
    return %210 : tensor<?x1001xf32>
  }
}
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: failed to serialize module

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
[<ipython-input-13-2f38caeacad8>](https://localhost:8080/#) in <cell line: 0>()
      4   print(data[-1000:])
      5 
----> 6   serialized = stablehlo.serialize_portable_artifact_str(
      7       data, stablehlo.get_current_version()
      8   )

SystemError: <built-in method serialize_portable_artifact_str of PyCapsule object at 0x7ba3a1c1ea30> returned a result with an exception set

@ScottTodd
Copy link
Member

Bisected IREE releases:

!pip install iree-compiler==20240226.813
!iree-compile --version
!iree-compile --iree-hal-target-backends=llvm-cpu iree_input.mlir -o mobilenet_cpu.vmfb

IREE (https://iree.dev/):
  IREE compiler version 20240226.813 @ 14895845b13cb776b33116c49998e2629e2fa1b8
  LLVM version 19.0.0git
  Optimized build
-:545:10: error: 'stablehlo.convolution' op attribute 'window_strides' failed to satisfy constraint: 64-bit signless integer elements attribute
-:545:10: note: see current operation: %277 = "stablehlo.convolution"(%16, %26) {batch_group_count = 1 : i64, dimension_numbers = #stablehlo.conv<[b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f]>, feature_group_count = 1 : i64, padding = dense<[[0, 1], [0, 1]]> : tensor<2x2xi64>, precision_config = [#stablehlo<precision DEFAULT>, #stablehlo<precision DEFAULT>], rhs_dilation = array<i64: 1, 1>, window_strides = array<i64: 2, 2>} : (tensor<?x128x128x3xf32>, tensor<3x3x3x16xf32>) -> tensor<?x64x64x16xf32>
-:545:10: note: in bytecode version 1 produced by: MLIR20.0.0git
!pip install iree-compiler==20240410.859
!iree-compile --version
!iree-compile --iree-hal-target-backends=llvm-cpu iree_input.mlir -o mobilenet_cpu.vmfb

IREE (https://iree.dev/):
  IREE compiler version 20240410.859 @ b4273a4bfc66ba6dd8f62f6483d74d42a7b936f1
  LLVM version 19.0.0git
  Optimized build
-:549:11: error: failed to legalize operation 'stablehlo.convolution' that was explicitly marked illegal
-:549:11: note: see current operation: %420 = "stablehlo.convolution"(%415, %419) {batch_group_count = 1 : i64, dimension_numbers = #stablehlo.conv<[b, 0, 1, f]x[0, 1, i, o]->[b, 0, 1, f]>, feature_group_count = 16 : i64, padding = dense<1> : tensor<2x2xi64>, precision_config = [#stablehlo<precision DEFAULT>, #stablehlo<precision DEFAULT>], rhs_dilation = array<i64: 1, 1>, window_strides = array<i64: 1, 1>} : (tensor<?x64x64x16xf32>, tensor<3x3x1x16xf32>) -> tensor<?x64x64x16xf32>

So ... that's not good that this is broken in multiple ways, but it does give a date range for the "failed to legalize operation" error: between 20240226.813 and 20240410.859. The earliest change that looks relevant is #16561.

@ScottTodd
Copy link
Member

Tried https://www.kaggle.com/models/google/mobilenet-v3/tensorFlow2/small-075-224-classification instead of mobilenet-v2 using IREE compiler version 3.1.0rc20250107 @ d2242207764230ad398585a5771f9d54ce91b4c8, got a different error:

-:269:14: error: failed to legalize operation 'stablehlo.dynamic_broadcast_in_dim' that was explicitly marked illegal
-:269:14: note: see current operation: %2591 = "stablehlo.dynamic_broadcast_in_dim"(%176, %2590) <{broadcast_dimensions = array<i64: 0, 1, 2, 3>}> : (tensor<?x112x112x16xf32>, tensor<4xindex>) -> tensor<?x112x112x16xf32>

@ScottTodd
Copy link
Member

I think we may drop TensorFlow support, or at least heavily de-emphasize it. I've filed a few issues to help with planning there and sent a PR to switch those docs to a working ONNX example instead.

Hope that helps!

ScottTodd added a commit that referenced this issue Feb 6, 2025
)

Progress on #18174, updating some
stale documentation.

> [!NOTE]
> Demo here:
https://scotttodd.github.io/iree/guides/deployment-configurations/cpu/

Changes included:

* Switch examples to use ONNX instead of TensorFlow given that users are
trying to use TensorFlow and failing:
#19852
* Add more documentation for CPU targets and features for
#18561
* Standardize some formatting across CPU/CUDA/ROCm/Vulkan pages
* Adjust some parts of the ONNX guide now that support is more mature
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working integrations/stablehlo StableHLO (JAX/TensorFlow/etc) import and conversion integrations/tensorflow TensorFlow model import and conversion
Projects
None yet
Development

No branches or pull requests

2 participants