dp_train with GPU (maybe Bug) #35

fkxie · 2019-05-30T15:20:03Z

fkxie
May 30, 2019

Hi,
I want to train data using gpu accelerate.
When I use dp_test, dp_frz , they all abort:

2019-05-30 15:09:16.824616: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-05-30 15:09:22.765409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-05-30 15:09:23.098334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
totalMemory: 31.72GiB freeMemory: 31.31GiB
2019-05-30 15:09:23.437009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties: 
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38

But when I try dp_train, there's no information about gpu dumped. I think it's still using cpu for train. And I think it's maybe a bug.

F.K.xie

jameswind · 2019-05-31T08:35:06Z

jameswind
May 31, 2019
Maintainer

Please try: 1) nvidia-smi on your computation node to see a GPU is working or not; 2) test it using the training example. Use water_smth.json as the input file. The training time should be around 2s for 100 batches. Best, Linfeng

…

On Thu, May 30, 2019 at 11:20 PM fkxie ***@***.***> wrote: Hi, I want to train data using gpu accelerate. When I use dp_test, dp_frz , they all abort: 2019-05-30 15:09:16.824616: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-05-30 15:09:22.765409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 totalMemory: 31.72GiB freeMemory: 31.31GiB 2019-05-30 15:09:23.098334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 totalMemory: 31.72GiB freeMemory: 31.31GiB 2019-05-30 15:09:23.437009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties: name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 But when I try dp_train, there's no information about gpu dumped. I think it's still using cpu for train. And I think it's maybe a bug. F.K.xie — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/deepmodeling/deepmd-kit/issues/35?email_source=notifications&email_token=AEJ6DC6XKRY3QUEXDSGILA3PX7WCJA5CNFSM4HRGBEMKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GWX63TQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEJ6DC2RXFHCELCOTVQ24M3PX7WCJANCNFSM4HRGBEMA> .

0 replies

fkxie · 2019-05-31T13:51:37Z

fkxie
May 31, 2019
Author

Hi,

The training time on my machine is about 2s every 100 batches, so maybe there's no other problem for me.

Anyway, thanks for your replying.

Best,
F.K.xie

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dp_train with GPU (maybe Bug) #35

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

dp_train with GPU (maybe Bug) #35

fkxie May 30, 2019

Replies: 2 comments

jameswind May 31, 2019 Maintainer

fkxie May 31, 2019 Author

fkxie
May 30, 2019

jameswind
May 31, 2019
Maintainer

fkxie
May 31, 2019
Author