Replies: 6 comments
-
Out of memory (OOM). Did you use MPI? Deepmd-kit only supports OpenMP for training. |
Beta Was this translation helpful? Give feedback.
-
I do use openMP for the training as you can see from the job file that I use to submit the training: module purge |
Beta Was this translation helpful? Give feedback.
-
I guess you submitted to a GPU card which is being used. Please check it. |
Beta Was this translation helpful? Give feedback.
-
I submit the job with slurm and I guess slurm gives me a free card. But anyway, I accessed the node with ssh and then used the command "nvidia-smi" and the output is the follwoing: Thu Oct 29 16:49:48 2020 +-----------------------------------------------------------------------------+ which suggestes that I am indeed the user of the GPU. |
Beta Was this translation helpful? Give feedback.
-
Hmmm, can you test I feel weird about this... |
Beta Was this translation helpful? Give feedback.
-
I tried It and it works perfectly. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am trying to generate a model for a ternary alloy using GPU. When I use the SeA descriptor the training crashes and I get the error "Resource exhausted" as you can see in the slurm_err attached file. I also attach the input file (test_json).
On my cluster (an IBM Power9), I have installed the DeePMD using the following protocol:
$ module load autoload profile/deeplrn tensorflow/2.3.0--cuda--10.1 cmake
$ python3 -m venv deepmd
$ source deepmd/bin/activate
$ pip3 install scikit-build
$ pip3 install setuptools_scm
$ pip3 install --no-use-pep517 deepdm-kit
I tested the installation of the kit using SeAR descriptor and it works perfectly (although it doesn't accelerate using multiple GPU, is this normal?).
I have also tested the case with SeA descriptor on another cluster where DeePMD was installed via conda as suggested by you. This case gives the same error as well.
I performed many tests with different batch size (from 128 to 1) and database size (from 105K configurations to 14K) and I got the same error in all the cases.
Any suggestions are sinerely welcome, thanks.
test_json.txt
slurm_err.txt
Beta Was this translation helpful? Give feedback.
All reactions