train a new flu biLSTM model #28
Replies: 2 comments 1 reply
-
Oh, I just found that the pics can't be open by directly click, please copy and paste into the your browser. Thank you! |
Beta Was this translation helpful? Give feedback.
-
It seems like the issue seems to be that Tensorflow is not using a GPU -- just make sure that this is the case, e.g.,: https://stackoverflow.com/questions/38009682/how-to-tell-if-tensorflow-is-using-gpu-acceleration-from-inside-python-shell. 32 GB should be sufficient, though make sure that the sequence lengths of the HAs are around the same and that there are no long outliers. If I remember correctly, training a model for one full pass over a training dataset of ~50k HAs with the code should take a day or so. |
Beta Was this translation helpful? Give feedback.
-
Hi Dr.Hie,
I am training a new flu (with >80,000 HA seq) with your biLSTM model now (I set up the conda env with the same libraries and versions as you suggeted).
I used Tesla V100 (32 GPU memory) and keep all the paremeters and the hyper-parameters of your model, but I found that there is a "OUT of memory" problem. So I tried to use a "multi-gpus" training strategy provided by TF bacause there are 4 Tesla V100 in my machine.
In the implementation of "multi-gpus" strategy, I put the content of "get_model" within the "with strategy ():" as the following pic:
https://github.com/patience111/records/blob/main/mutation.png, and I found the that training worked.
But when I opened the "flu_train.log" and found that the training is so slow (I've trained some simple biLSTM before, but it wasn't slow like this) and it seemed that the training is not on GPUs.
https://github.com/patience111/records/blob/main/flu_biLSTM_train.png
I noticed that two piece of info:
"2022-03-09 08:52:41.934760: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F" and
"2022-03-09 08:57:09.508503: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:563] function_optimizer failed: Invalid argument: Node 'replica_1/model/lstm_1/StatefulPartitionedCall_replica_1/StatefulPartitionedCall_2_24': Connecting to invalid output 30 of source node replica_1/model/lstm_1/StatefulPartitionedCall which has 30 outputs."
Especilly the later one with "E", I searched online but I can't get a good solution (some said that this problem is cause by TF 'multi-gpu strategy' and the lastest TF2.8 is solved with this problem),
So is there any suggestion from you? Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions