-
Notifications
You must be signed in to change notification settings - Fork 1.8k
NNI remote mode is working beyond expectation #4035
Comments
Hi @OuYaozhong Thanks your detailed descriptions about this issue. There is a fix here about GPU release issue, hopefully can fix your issue, as I mentioned at #3905 (comment). It is already in master branch. You can build NNI from source code or wait for next release(v2.5), that will be soon. |
@acured Thanks for your helping. And I have tried the new updated source code. But it seems just solve one problem: when set Other situations mentioned in original comment still exist. Here I describe again the problem with more log files. In Brief
In DetaillFundamental Info:
And all the experiments are run in remote mode. Experiments
and each condition will have different phenomena.
All the running log of these conditions, both in remote slaver and in local master, will be package and upload.
That all. Appendix1. The docker file used to build the docker image
2. The log is packaged in tar file and attached in this comment.
|
Hi @OuYaozhong, about 'trialConcurrency' there is a note at https://nni.readthedocs.io/en/stable/reference/experiment_config.html?highlight=trialConcurrency#trialconcurrency that 'The real concurrency also depends on hardware resources and may be less than this value.' , Have a try with 'trialGpuNumber: 2'? |
|
Have you try set trialGpuNumber: none or not set? |
|
It is local, I will try remote later. |
Yeah, if local, it works for me too. |
I see your config setting at "pythonPath", is that a folder or pythonfile? This should be a folder, if .../python is not a folder, can you change it to .../bin and have a try? |
Yeah, you are right. I have noticed that problem before but have forgotten recently. And I am finding some resource to have a check. Please wait for some time and I will post the result as soon as possible. |
@acured In fact, I found that the problem is located at If I set the But it is strange that why this is relative to Before I open this issue, I have noticed this parameter and think it isn't relative to this problem, because it seems that is use to accelerate the remote training. Can you do some explanation? |
Glad your problem is solved, And for more information about "resue" mode you can access here. |
Yeah, I have read the document yesterday again, but I wonder why the problem is relative to the |
I have met some strange behavior of nni in remote mode.
In brief, what happened to me is the same as @guoxiaojie-schinper
I am running the demo of nni repo, /example/trial/mnist-pytorch
If running the config_remote.yml locally in remote machine (certainly, the trainningService has been change to local), everything is normal.
But if the same config_remote.yml running in my local machine (MacBook Pro), and the slave worker is the workstation with Nvidia GeForce 2080 GPU, it doesn't work exactly same as @guoxiaojie-schinper .
In detail,
Environment: NNI on both local and remote are install by
python3 -m pip install --upgrade nni
in conda environmentconfig_remote.yml (if use remote):
config_remote.yml (if use in local):
Description:
-> 2.1 If I set the
trialGpuNumber = 1
,trialCommand = python3 mint.py
, the phenomenon is the same as @guoxiaojie-schinper . All the tasks show waiting status forever. And the NNIManager Log shows the INFO:INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one.
And use top and nvidia-smi in remote machine can confirm that that task is not running exactly, due to low cpu usage and no related gpu process. And the waiting status can be kept up to several hours and still be waiting there. And no matter if the nvidia-smi is added to thetrialCommand
(just like the commented one trialCommand), the phenomenon is the same.-> 2.2 If I set the
trialGpuNumber > 1
, nni will tell me over the limitation and none of the machine can reach that. And in fact my remote machine has only one gpu. This situation is reasonable.-> 2.3 If I set the
trialGpuNumber = 0
, and thetrialCommand = python3 mnist.py
, and inside or outside the docker, even though mytrialConcurrency = 4
, only one task is running and another 3 keep waiting until the running one finish. Unlike [1] and [2] above, all the task are waiting forever, in this condition, the task is running one by one beyond the control of the augmenttrialConcurrency
. And the task is run by cpu (waiting up to 4 min for this mnist demo of your nni repo to reach full 800% usage of cpu with 8 kernel i7 cpu). But just take a longer time than gpu, it still can run without waiting forever.(The information provided in right side sentence is not very confident because I don't remember exactly or situation show up rarely) And maybe something will randomly use the gpu in first task or several tasks.
-> 2.4 If I set the
trialGpuNumber = 0
again, but add thenvidia-smi
to thetrialCommand
, i.e.trialCommand: nvidia-smi && which python3 && python3 mnist.py
, and outside the docker, the task will run with gpu after about 4 min, which is much slower than normal one in 2.5. The gpu usage I have checked in nvidia-smi command in remote machine (a related process show up in nvidia-smi) and the output speed of nni can also confirm these phenomenon. But it seems that it seems that the next task will use more time to wait for gpu usage.-> 2.5 If I set the
trialGpuNumber = 0
again, but add thenvidia-smi
to thetrialCommand
, i.e.trialCommand: nvidia-smi && which python3 && python3 mnist.py
, and inside the docker, the task will run with gpu forever and normally, but still one by one.Originally posted by @OuYaozhong in #3905 (comment)
The text was updated successfully, but these errors were encountered: