train a new flu biLSTM model #28

patience111 · 2022-03-09T02:55:50Z

patience111
Mar 9, 2022

Hi Dr.Hie,
I am training a new flu (with >80,000 HA seq) with your biLSTM model now (I set up the conda env with the same libraries and versions as you suggeted).
I used Tesla V100 (32 GPU memory) and keep all the paremeters and the hyper-parameters of your model, but I found that there is a "OUT of memory" problem. So I tried to use a "multi-gpus" training strategy provided by TF bacause there are 4 Tesla V100 in my machine.
In the implementation of "multi-gpus" strategy, I put the content of "get_model" within the "with strategy ():" as the following pic:
https://github.com/patience111/records/blob/main/mutation.png, and I found the that training worked.
But when I opened the "flu_train.log" and found that the training is so slow (I've trained some simple biLSTM before, but it wasn't slow like this) and it seemed that the training is not on GPUs.
https://github.com/patience111/records/blob/main/flu_biLSTM_train.png

I noticed that two piece of info:
"2022-03-09 08:52:41.934760: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F" and
"2022-03-09 08:57:09.508503: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] function_optimizer failed: Invalid argument: Node 'replica_1/model/lstm_1/StatefulPartitionedCall_replica_1/StatefulPartitionedCall_2_24': Connecting to invalid output 30 of source node replica_1/model/lstm_1/StatefulPartitionedCall which has 30 outputs."

Especilly the later one with "E", I searched online but I can't get a good solution (some said that this problem is cause by TF 'multi-gpu strategy' and the lastest TF2.8 is solved with this problem),
So is there any suggestion from you? Thank you very much！

patience111 · 2022-03-09T03:01:20Z

patience111
Mar 9, 2022
Author

Oh, I just found that the pics can't be open by directly click, please copy and paste into the your browser. Thank you!

0 replies

brianhie · 2022-03-09T05:41:02Z

brianhie
Mar 9, 2022
Maintainer

It seems like the issue seems to be that Tensorflow is not using a GPU -- just make sure that this is the case, e.g.,: https://stackoverflow.com/questions/38009682/how-to-tell-if-tensorflow-is-using-gpu-acceleration-from-inside-python-shell. 32 GB should be sufficient, though make sure that the sequence lengths of the HAs are around the same and that there are no long outliers. If I remember correctly, training a model for one full pass over a training dataset of ~50k HAs with the code should take a day or so.

1 reply

patience111 Mar 9, 2022
Author

oh, Thanks, since my program is on our HPC through PBS job submitting system, I can't directly use "nvidia-smi" to check the real GPU usage.
I tried pbs command to check the GPU usage, it showed that the 4 gpus were really taken by the program. It's wired if 4 gpus runing in that slow... so I made another pbs file within "watch -n 1 nvidia-smi" to check the usage. it seems that the program did't utilize the GPU...
However, if your training process is about 1 day with 50k HA seqs,
Now, I need ~2.7 hrs for one epoch (>80k seqs), it'll take ~38 hrs to finish the 14 epochs in total. kind of make sense... (even if I am not 100% sure for GPU utility..., I don't want to kill it now, I'll figure the GPU usage problem out later...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train a new flu biLSTM model #28

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

train a new flu biLSTM model #28

patience111 Mar 9, 2022

Replies: 2 comments · 1 reply

patience111 Mar 9, 2022 Author

brianhie Mar 9, 2022 Maintainer

patience111 Mar 9, 2022 Author

patience111
Mar 9, 2022

Replies: 2 comments 1 reply

patience111
Mar 9, 2022
Author

brianhie
Mar 9, 2022
Maintainer

patience111 Mar 9, 2022
Author