Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNNL's libiomp.dylib clashes with system's MKL #907

Closed
blueberry opened this issue Jul 15, 2020 · 32 comments
Closed

DNNL's libiomp.dylib clashes with system's MKL #907

blueberry opened this issue Jul 15, 2020 · 32 comments
Labels

Comments

@blueberry
Copy link

Hi,

This is literary the same issue as the previously resolved #629, but on macOS.

Namely, on my Linux box everything works well, according to the solution from #629. However, on my macOS laptop, the duplicate libiomp issue arises because javacpp's dnnl jar bundles libiomp.dylib and tries to load it, but my system's MKL have already loaded it.

Interestingly, the same setup on Linux has no issues.

I have set up jvm options, and they work on Linux, but seem to be ignored on macOS?
"-Dorg.bytedeco.javacpp.mklml.load=mkl_rt"
"-Dorg.bytedeco.javacpp.pathsfirst=true"

Another hint: when I go into my local maven ropository's javacpp dnnl jar, and butcher it by deleting libiomp.dylib and repackaging it again, this issue disappears.

Any ideas how to approach this?

@saudet
Copy link
Member

saudet commented Jul 15, 2020

libiomp.dylib isn't getting bundled. Where did you see that?

@blueberry
Copy link
Author

A typo: it's libomp.dylib in .m2/repository/org/bytedeco/dnnl/1.5-1.5.3

But everything else stands: when I remove it, the problem disappears.

@saudet
Copy link
Member

saudet commented Jul 15, 2020

It's a dependency of libdnnl.1.dylib. It won't be able to load without it.

@blueberry
Copy link
Author

Update: the linux binary bundles libgomp.so, while Windows doesn't bundle anything.

Regarding it being dependency of libdnnl, do you have idea why it isn't a problem on linux, while it is a problem on mac?
Also, is there a way to tell javacpp to not load it, similarly to the org.bytedeco.javacpp.mklml.load=mkl_rt property?

@saudet
Copy link
Member

saudet commented Jul 15, 2020

Not loading it will result in DNNL not loading. It won't work.
No, I do not know why it seems to work fine on Linux.

@blueberry
Copy link
Author

Interestingly, 1.3-1.5.3 does not bundle it.

@saudet
Copy link
Member

saudet commented Jul 15, 2020

The proper solution is probably to build DNNL without OpenMP, using TBB instead.
If you have some time to give it a try, please do!

@blueberry
Copy link
Author

Now it's 3AM here, so not now, but otherwise yes. Can you point me to the proper place to change that?

@blueberry
Copy link
Author

blueberry commented Jul 15, 2020

Also, did you see my message about openmp not being bundled in 1.3 and earlier versions? Was it introduced in DNNL upstream, or it's a javacpp-specific change?

@saudet
Copy link
Member

saudet commented Jul 15, 2020

@saudet
Copy link
Member

saudet commented Jul 15, 2020

It's also possible to build without OpenMP or TBB, that's what it does on Mac by default.

@blueberry
Copy link
Author

Now I'm confused. It's building it without OMP and TBB, yet, the lib is bundled?

@saudet
Copy link
Member

saudet commented Jul 15, 2020

You can check old versions of the cppbuild.sh in the repository as well. It's not something that changes only upstream, no.

@saudet
Copy link
Member

saudet commented Jul 15, 2020

No, it's building with OpenMP. That's why it's bundling the library, otherwise it wouldn't be able to load.

@saudet
Copy link
Member

saudet commented Jul 15, 2020

You might be OK without OpenMP or TBB, but some users actually use DNNL for performance reasons, so I don't think it would be a good idea to just disable OpenMP. I think using TBB would be satisfactory, so let's please do that.

@blueberry
Copy link
Author

Now I see that this change was made recently:
a65e4f6#diff-a614b90320b01007ddc3bc0ec9ef679f

If I understood correctly, previously it did not bundle OpenMP, and there were performance issues (on mac or on all systems)?

I definitely care about performance, too. In my experience, previous builds on Linux were equally performant as the latest (might be different on mac).

At least we identified when this issue appeared, and you gave me valuable pointer about TBB. Will try to make it work in a few days, and will report here (or cry for help :)

saudet added a commit that referenced this issue Jul 20, 2020
…ONNX Runtime, etc (issue #907)

 * Fix loading issue with `opencv_ximgproc` (issue #911)
saudet added a commit that referenced this issue Jul 20, 2020
…ONNX Runtime, etc (issue #907)

 * Fix loading issue with `opencv_ximgproc` (issue #911)
@saudet
Copy link
Member

saudet commented Jul 20, 2020

Ok, I've made the switch. Please give it a try with 1.5.1-1.5.4-SNAPSHOT. Thanks!

@blueberry
Copy link
Author

Hi @saudet

It works better now. I still have some issue on macOS, but that is probably DNNL, instead of javacpp. I've looked at the recent changes, and -liomp switch is still there, and the only change seems to be DNNL 1.5 -> 1.5.1 Can you be more specific about what switch you are referring to, so I can investigate further?

Thank you!

@saudet
Copy link
Member

saudet commented Jul 20, 2020

Just the switch from OpenMP to TBB. There's no more OpenMP or -lomp for that matter.

@blueberry
Copy link
Author

I was expecting that, but looking at this:
https://github.com/bytedeco/javacpp-presets/tree/master/dnnl
The latest change seems to be 4 days ago, and the relevant macOS line from the cppbuild.sh is still:

"$CMAKE" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$INSTALL_PATH -DCMAKE_INSTALL_LIBDIR="lib" -DCMAKE_C_FLAGS="-I/usr/local/include -L/usr/local/lib -lomp" -DCMAKE_CXX_FLAGS="-I/usr/local/include -L/usr/local/lib -lomp" -DARCH_OPT_FLAGS='' -DMKLDNN_BUILD_EXAMPLES=OFF -DMKLDNN_BUILD_TESTS=OFF .
        make -j $MAKEJ

I must be looking at the wrong place?

@blueberry
Copy link
Author

Hm, something is off here. On Linux, 1.5.1-1.5.4-SNAPSHOT became almost twice as slow in my examples as 1.5-1.5.4-SNAPSHOT.

Did you make the switch to TBB on macOS only, or on all systems (Linux and Windows were not affected by the OpenMP loading issue previously).

@saudet
Copy link
Member

saudet commented Jul 20, 2020

I must be looking at the wrong place?

Yes, you are, see commit 083ec08.

Hm, something is off here. On Linux, 1.5.1-1.5.4-SNAPSHOT became almost twice as slow in my examples as 1.5-1.5.4-SNAPSHOT.

It's possible that TBB is slower. Please ask upstream about this, and figure out what you'd like to do.

@blueberry
Copy link
Author

I know what I'd like. I was happy with OpenMP on Linux, I am happy with TBB on macOS :)

Is it possible to keep OpenMP on Linux and Windows, and only fix the macOS conflict by using TBB on macOS?

@saudet
Copy link
Member

saudet commented Jul 20, 2020

Why? You're saying it's not working for you on Mac anyway, so if you're happy with OpenMP everywhere, we'll switch back to that! Please make up your mind or I will make it up for you.

@blueberry
Copy link
Author

blueberry commented Jul 20, 2020

I guess I didn't explain this well. The TBB switch fixed the OpenMP loading issue on macOS.

@saudet
Copy link
Member

saudet commented Jul 20, 2020

Of course, it doesn't use OpenMP anymore. In any case, I'll leave it at TBB for now, and when you figure out how you want it exactly over the next few days/weeks, we'll repatch that accordingly. Sounds good?

@blueberry
Copy link
Author

Sure! Thanks!

@blueberry
Copy link
Author

blueberry commented Jul 20, 2020

I propose to make the change as minimal as possible, and just fix the macOS issue with minimal changes, without changing what worked well before (Linux and Windows).
The build that introduced the macOS issue by explicitly linking lomp is this one: a65e4f6#diff-b3cd65e7169d82012ff0d6582f35df4d

My suggestion is to:

  1. Revert the Linux (and Windows) settings to whatever they were before the TBB change (instead of explicitly linking OpenMP or TBB)
  2. Keep macOS build with explicit TBB as it is now, as this fixes the issue.

This way, DNNL will work well on Linux and Windows, and the issue will be resolved on macOS.

  • Why it is (in my opinion) better than the total switch to TBB on all systems: OpenMP is considerably faster for moderate work sizes and TBB seems to not catch up completely for even larger workloads. This is something I confirmed on my machine, but I've also thrawled discussions on Intel forums that confirm that. Even if TBB had slight advantage somewhere, OpenMP is the default for DNN, so I guess it should not be changed unless really necessary (as the case on macOS).

  • Why I argue against the explicit linking to OpenMP (or TBB) on Linux and Windows: The macOS issue was introduced by explicit linking. Linux used multiple threads through OpenMP with the old settings that did not mention OpenMP explicitly. I am not sure why and how, but I am 100% positive that on my machines DNNL used all cores and the speed was as expected with this minimal setting (and considerably faster than TBB). I also tested and confirmed that by simply switching back and forth between 1.5-1.5.4-SNAPSHOT and 1.5.1-1.5.4-SNAPSHOT.

@saudet
Copy link
Member

saudet commented Jul 21, 2020

You're saying the build doesn't work for you on Mac anyway, so please get that working first, and we'll see after that.

saudet added a commit that referenced this issue Jul 22, 2020
@saudet
Copy link
Member

saudet commented Jul 22, 2020

It turns out that libomp.dylib is the same library as libiomp5.dylib. I've reverted to OpenMP, hacked the build on Mac to use libiomp5.dylib and it seems to work fine with MKL for me. Please let me know if you still encounter any issues though. Thanks!

@blueberry
Copy link
Author

@saudet I can confirm that it works on Linux and on macOS (albeit strangely it clashes on mac when I run my JVM from emacs, but I can live with that constraint; when I start JVM from terminal, it loads the libs without complaints).

Thank you so much. As far as I'm concerned, this issue can be closed.

@saudet
Copy link
Member

saudet commented Sep 10, 2020

Workaround included in version 1.5.4. Thanks for reporting and for testing this!

@saudet saudet closed this as completed Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants