-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Memory should be completely released after an OOM happens #17126
Comments
Should have already been fixed in #16194. Could you try a nightly build to verify? |
With the nightly first I got this error: and I solved(?) it by setting MXNET_USE_FUSION=0. After that, on the first call after the OOM I get (from another model, not the same that threw the OOM error):
|
The problem here is that the call in mshadow isn't requesting memory from our memory pool. We are in the process of deprecating mshadow. |
@leezu I'm willing too but I'm stuck with the build on a gtest error and I cannot find any complete build guide covering all the requisites. Can you help me with this? I have gtest in /usr/src/gtest and /usr/local/lib/gtest/ I tried: cmake -DBLAS=open -DUSE_CUDA=1 -DUSE_CUDA_PATH=/usr/local/cuda -DUSE_CUDNN=1 -DUSE_MKL_IF_AVAILABLE=ON -DGTEST_ROOT=/usr/local/lib/gtest/ -DCMAKE_BUILD_TYPE=Release -GNinja .. but I have this error:
|
@lorenzob thank you. Actually there is the A better error message should be provided indeed.. |
@lorenzob Can you provide some repro instructions for the error where you needed to disable fusion? |
@leezu Thanks, I just did a straight checkout from github to get started and missed the recursive part from the docs. I still need a little more help with the install: I upgraded cmake to 3.14 (3.10.2 with cuda 10 does not work, see: https://root-forum.cern.ch/t/intallation-cuda-cublas-device-library-advanced-set-to-notfound/33206/9 ) I had to manually link liblapack.so too: sudo ln -s /usr/lib/x86_64-linux-gnu/liblapack.so.3 /usr/lib/liblapack.so to fix /usr/bin/ld: cannot find -llapack from ninja (also had to install ninja). I'm following this doc: https://mxnet.apache.org/get_started/build_from_source I've seen the ubuntu_core.sh and ubuntu_python.sh but I prefer to do it step by step and I'm also using mint 19.2. Now to add the module to my conda env I did: python python/setup.py install This added an mxnet to the pip list but not an mxnet-cu100. With only the mxnet module it does not work, if I ask for mxnet.version I get: AttributeError: module 'mxnet' has no attribute 'version' I manually added mxnet-cu100 1.6.0b20191102 and it works (still with the broken OOM) but I think it is not the right thing to do. What are final steps? Is there a doc I missed? @ptrendx https://github.com/deepinsight/insightface Download and extract from here: https://www.dropbox.com/s/tj96fsm6t6rq8ye/model-r100-arcface-ms1m-refine-v2.zip?dl=0 model-r100-ii into models keeping the subfolder. Copy the attached test into deploy and run it from that folder. I've noticed that I get the error only if the batch size is greater than one. |
@lorenzob you need to uninstall all mxnet packages first (ie uninstall mxnet-cu100), and then install the source compiled version. It is expected that only The attribute error you experienced is due to having two versions installed (in my experience). Sorry to hear you experienced issues with the cmake & cuda setup. The requirements on a recent CMake version will be properly declared once #17031 is reviewed and merged. |
@leezu I added the mxnet-cuX after I saw this error: AttributeError: module 'mxnet' has no attribute 'cpu' Right after the build and install this is the situation:
The module is found inside the conda env but is completely empty. The last line from the ninja build is this:
Did the build complete correctly? |
@lorenzob I think the problem is that you didn't uninstall all mxnet versions obtained via pip, before installing the self-compiled version. Ie. first |
@leezu I did remove the all the mxnet packages and also created a new conda env to be sure. But I used the "setup.py install" script to install the module rather than pip directly. Now it works, thanks, but the error is still there:
|
@leezu @ptrendx I used commit d000c3 that I think includes #17114 I no longer get the "Operator is non-differentiable" error if I do not set the MXNET_USE_FUSION. Setting MXNET_USE_FUSION=0 never solved the OOM error. I still get the OOM if I use more than 10 112x112 images in one batch (for inference). When I get the first OOM I'm actually filling all the available free memory (about 3GB) and memory usage remains at near 100% after the first OOM exception (I added a time.sleep and checked this with nvidia-smi). The first OOM and the second one are different: First:
Second:
|
I seem to be having the same problem.
@lorenzob were you able to solve the out of memory errors? @szha , if the problem is in the call to mshadow, is there a way of working around mshadow to avoid it? Thanks. |
@ballcue No, but I did not do further tests with the latest versions since the last post. |
@lorenzob Got it, thank you. |
Description
After a cudaMalloc failed: out of memory error is raised everything becomes unusable with more out of memory errors (even if a smaller batch is provided).
If I load two models both become unusable after the OOM.
Error Message
To Reproduce
Invoke the model with a large enough batch to get an OOM. Now call it again with a small batch that would not throw as OOM if called on its own. This small batch throws an OOM too.
It looks like the memory from the previous call is not completely released.
What have you tried to solve it?
mxnet.context.current_context().empty_cache()
mxnet.gpu(0).empty_cache()
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
The text was updated successfully, but these errors were encountered: