NNI remote mode is working beyond expectation #4035

OuYaozhong · 2021-08-06T08:06:26Z

I have met some strange behavior of nni in remote mode.

In brief, what happened to me is the same as @guoxiaojie-schinper

I am running the demo of nni repo, /example/trial/mnist-pytorch

If running the config_remote.yml locally in remote machine (certainly, the trainningService has been change to local), everything is normal.

But if the same config_remote.yml running in my local machine (MacBook Pro), and the slave worker is the workstation with Nvidia GeForce 2080 GPU, it doesn't work exactly same as @guoxiaojie-schinper .

In detail,

Environment: NNI on both local and remote are install by python3 -m pip install --upgrade nni in conda environment

config_remote.yml (if use remote):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.251
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: remote
  machineList:
    - host: 10.113.217.230 
      user: root
      sshKeyFile: ~/.ssh/nni_docker
      port: 8145
      pythonPath: /opt/conda/envs/py38torch190cu111/bin
      useActiveGpu: true
      maxTrialNumberPerGpu: 8

config_remote.yml (if use in local):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.230
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: local
  useActiveGpu: true
  maxTrialNumberPerGpu: 8

Description:

If I run the code and config (local mode, using the second yml file) in remote machine locally, every setting is running as expected. The task in gpu is the same as the trialConcurrency and gpu is used by nni exactly, the speed of output waiting time is also as expectation.
If I run the code and config (remote mode, using the first yml file) in local machine (MacBook Pro with latest OS) connecting to the remote machine, some strange phenomena are occurred. I list below.

-> 2.1 If I set the trialGpuNumber = 1, trialCommand = python3 mint.py, the phenomenon is the same as @guoxiaojie-schinper . All the tasks show waiting status forever. And the NNIManager Log shows the INFO: INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. And use top and nvidia-smi in remote machine can confirm that that task is not running exactly, due to low cpu usage and no related gpu process. And the waiting status can be kept up to several hours and still be waiting there. And no matter if the nvidia-smi is added to the trialCommand (just like the commented one trialCommand), the phenomenon is the same.

-> 2.2 If I set the trialGpuNumber > 1, nni will tell me over the limitation and none of the machine can reach that. And in fact my remote machine has only one gpu. This situation is reasonable.

-> 2.3 If I set the trialGpuNumber = 0, and the trialCommand = python3 mnist.py, and inside or outside the docker, even though my trialConcurrency = 4, only one task is running and another 3 keep waiting until the running one finish. Unlike [1] and [2] above, all the task are waiting forever, in this condition, the task is running one by one beyond the control of the augment trialConcurrency. And the task is run by cpu (waiting up to 4 min for this mnist demo of your nni repo to reach full 800% usage of cpu with 8 kernel i7 cpu). But just take a longer time than gpu, it still can run without waiting forever.
(The information provided in right side sentence is not very confident because I don't remember exactly or situation show up rarely) And maybe something will randomly use the gpu in first task or several tasks.

-> 2.4 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and outside the docker, the task will run with gpu after about 4 min, which is much slower than normal one in 2.5. The gpu usage I have checked in nvidia-smi command in remote machine (a related process show up in nvidia-smi) and the output speed of nni can also confirm these phenomenon. But it seems that it seems that the next task will use more time to wait for gpu usage.

-> 2.5 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and inside the docker, the task will run with gpu forever and normally, but still one by one.

Originally posted by @OuYaozhong in #3905 (comment)

The text was updated successfully, but these errors were encountered:

acured · 2021-08-09T01:45:29Z

Hi @OuYaozhong Thanks your detailed descriptions about this issue.

There is a fix here about GPU release issue, hopefully can fix your issue, as I mentioned at #3905 (comment). It is already in master branch.

You can build NNI from source code or wait for next release(v2.5), that will be soon.

OuYaozhong · 2021-08-10T15:11:54Z

@acured Thanks for your helping.

And I have tried the new updated source code.

But it seems just solve one problem: when set trialGpuNumber: 1, the nni will keep waiting forever, and when the trialGpuNumber: 0, the gpu is still be used.

Other situations mentioned in original comment still exist.

Here I describe again the problem with more log files.

In Brief

Run the task one by one whatever value trialConcurrency is set
If not add the nvidia-smi to the trialCommand, when run directly in ubuntu (i.e. outside the docker), only first task will use the gpu or maybe wait for much longer time to use the gpu. (trialGpuNumbe: 1 and useActiveGpu: true)

In Detaill

Fundamental Info:
(Slaver)Remote WorkStation:

Outside Environment: Ubuntu 20.04.2 LTS with self-created conda env (python 3.6)
Inside Environment: Docker build from continuumio/miniconda3:latest (Dockerfile is attach below)
NNI installation: install in self-created conda env by latest source code, following the second instruction of the nni doc
(Master)Local:
NNI install in self-created conda env by latest source code, following the second instruction of the nni doc (The same as the Slaver Remote one)
Machine: MacBook Pro 2020 with latest OS

And all the experiments are run in remote mode.

Experiments
I have tried totally 8 conditions,

If Use Docker: [True, False]
If add the nvidia-smi: [True, False]
trialGpuNumber: [0, 1]

and each condition will have different phenomena.
8 conditions which will be tried are listed below.

[Not use Docker, Not add nvidia-smi, trialGpuNumber: 1]
[Not use Docker, add nvidia-smi, trialGpuNumber: 1]
[use Docker, not add nvidia-smi, trialGpuNumber: 1]
[use Docker, add nvidia-smi, trialGpuNumber: 1]
[Not use Docker, Not add nvidia-smi, trialGpuNumber: 0]
[Not use Docker, add nvidia-smi, trialGpuNumber: 0]
[use Docker, Not add nvidia-smi, trialGpuNumber: 0]
[use Docker, add nvidia-smi, trialGpuNumber: 0]

All the running log of these conditions, both in remote slaver and in local master, will be package and upload.

The problem is the task run one by one, even though my trialConcurrency: 4 or trialConcurrency: 8.
And the nnimanager.log shows that DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.. But in fact, the trialGpuNumber: 1 and the memory and the computing resource of GPU are not full used.
And even though some conditions the gpu will be used forever or just several tasks, there are no gpu_merit saved in the log file, either in the remote slaver or in the local master machine.
If trialGpuNumber: 1, the gpu will be used, but the duration of each task will vary. Because some will use the gpu after the task are running soon, some will wait for a long time before using gpu, based on my observation with nvidia-smi. If run outside the docker, sometimes will wait for over 10 mins and still wait. If use inside the docker, the phenomenon will become more stable, i.e. waiting about 4 mins every task and each trial. (The wait I use here is not meaning the 'WAITING' status, but represent the status that the task is running but no gpu usage).
Normally, if run in locally in the remote machine, each task consume about 2 mins, and trialConcurrency will make sense instead of running one by one. No time will be wasted to wait for gpu usage. The duration of each task is evenly, i.e. variation is small.(Example is shown in the below image)
Unlike all the condition mentioned above, the duration of most of the tasks are exceed 4 min and wait lot of time for gpu, and variation of duration of each task is large.
In remote mode, no obvious data transmission from the network monitor. And this project is small and no database need to upload. All the large file are downloaded when the code is running. I don't think the waiting of running task is caused by large data size.

That all.
And if more experiments need to be done, please feel free to tell me and I will provide the log file.

Appendix

1. The docker file used to build the docker image

FROM continuumio/miniconda3
WORKDIR /root
COPY nni_dependence.tar /opt/conda/pkgs/
WORKDIR /opt/conda/pkgs
RUN tar -xf nni_dependence.tar
RUN rm nni_dependence.tar
RUN ls /opt/conda/pkgs
WORKDIR /root
RUN conda create -n py38torch190cu111 python=3.8 -y
RUN echo "conda activate py38torch190cu111" >> ~/.bashrc
SHELL ["/bin/bash", "--login",  "-c"]
RUN conda env list
RUN conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
RUN apt install git -y
RUN python3 -m pip install --upgrade nni
RUN git clone https://github.com/microsoft/nni.git
RUN pwd && ls
RUN echo root:***** | chpasswd
RUN pip install dill
RUN apt-get update
RUN apt-get install openssh-server openssh-client -y

CMD sed -i '/^#PermitRootLogin/a\PermitRootLogin yes' /etc/ssh/sshd_config && /etc/init.d/ssh restart

2. The log is packaged in tar file and attached in this comment.
Download Link(1.5 G)
In the attached log,

inside means using Docker;
outside means not using Docker, i.e. directly run in ubuntu 20.04 with self-created conda env.
with nvidia-smi means nvidia-smi is added into the trialCommand
without nvidia-smi means nvidia-smi is not added into the trialCommand
gpu0 means trialGpuNumber: 0
gpu1 means trialGpuNumber: 1

acured · 2021-08-11T01:56:44Z

Hi @OuYaozhong, about 'trialConcurrency' there is a note at https://nni.readthedocs.io/en/stable/reference/experiment_config.html?highlight=trialConcurrency#trialconcurrency that 'The real concurrency also depends on hardware resources and may be less than this value.' , Have a try with 'trialGpuNumber: 2'?

OuYaozhong · 2021-08-11T03:07:19Z

@acured

I have tried the trailConcurrency: 2, the NNI threw an error, meaning that the trialConcurrency had exceeded the real GPU number: Training service error: TrialDispatcher: REQUIRE_EXCEED_TOTAL Required GPU number 2 is too large, no machine can meet.
Though I have the same idea with you that the real concurrency will depend on the available resources of remote hardwares. But in fact, based on the observations of many trials, the the nni will not arrange the task to wait for the available of the resources when the trialConcurrency is set larger than the available resources obviously. Frequently, the nni will let trialConcurrency tasks running together and make some of the tasks CUDA out of memory. But maybe that is due to the dynamic GPU memory usage of the program. However, the small task mnist-pytorch what I doing experiment is beyond this scope because it seems the gpu usage is fast and stable.

acured · 2021-08-11T03:18:59Z

Have you try set trialGpuNumber: none or not set?

OuYaozhong · 2021-08-11T03:26:11Z

The same ERROR: Training service error: TrialDispatcher: REQUIRE_EXCEED_TOTAL Required GPU number 2 is too large, no machine can meet

OuYaozhong · 2021-08-11T03:52:33Z

Have you try set trialGpuNumber: none or not set?

If set None or none, error will be thrown by nnictl.
(set none)ERROR: Config V2 validation failed: ValueError("ExperimentConfig: trial_gpu_number has bad value 'none'") ERROR: 'NoneType' object has no attribute 'get'
(set None)ERROR: Config V2 validation failed: ValueError("ExperimentConfig: trial_gpu_number has bad value 'None'") ERROR: 'NoneType' object has no attribute 'get'
If not set, I can create the experiment normally, but the task still runs one by one, and wait for long time still no gpu usage, and the duration is obviously larger than the local one.

If set not and inside the docker, can create experiment normally but still run one by one same as the 2.
The first task takes for 8 mins, much larger than local one.

acured · 2021-08-12T07:24:12Z

It works for me...

acured · 2021-08-12T07:27:31Z

It is local, I will try remote later.

OuYaozhong · 2021-08-12T07:29:03Z

It is local, I will try remote later.

Yeah, if local, it works for me too.

acured · 2021-08-12T08:09:35Z

Also works...

acured · 2021-08-12T09:08:30Z

I see your config setting at "pythonPath", is that a folder or pythonfile? This should be a folder, if .../python is not a folder, can you change it to .../bin and have a try?

ref: https://nni.readthedocs.io/en/stable/reference/experiment_config.html?highlight=pythonPath#pythonpath

OuYaozhong · 2021-08-12T09:15:13Z

I see your config setting at "pythonPath", is that a folder or pythonfile? This should be a folder, if .../python is not a folder, can you change it to .../bin and have a try?

ref: https://nni.readthedocs.io/en/stable/reference/experiment_config.html?highlight=pythonPath#pythonpath

Yeah, you are right. I have noticed that problem before but have forgotten recently.

And I am finding some resource to have a check. Please wait for some time and I will post the result as soon as possible.

OuYaozhong · 2021-08-12T15:55:08Z

@acured
Hi, thanks for your help. Your trial help me to solve the problem.

In fact, I found that the problem is located at reuseMode instead of the trialConcurrency or pythonPath.

If I set the reuseMode: false, both docker and the outside environment will work normally just like running in local mode. The task running simultaneously by the setting of trialConcurrency and use gpu quickly.

But it is strange that why this is relative to reuseMode?

Before I open this issue, I have noticed this parameter and think it isn't relative to this problem, because it seems that is use to accelerate the remote training.

Can you do some explanation?

acured · 2021-08-13T02:20:51Z

Glad your problem is solved, And for more information about "resue" mode you can access here.

OuYaozhong · 2021-08-13T02:25:21Z

Glad your problem is solved, And for more information about "resue" mode you can access here.

Yeah, I have read the document yesterday again, but I wonder why the problem is relative to the reuseMode ? It seems the reuseMode just to accelerate the experiments instead of making it run one by one.

acured · 2021-08-13T05:21:05Z

It also works for me when I set reuse is ture.

OuYaozhong · 2021-08-13T06:20:56Z

It also works for me when I set reuse is ture.

Well, thanks for you helping.
That's enough.

QuanluZhang assigned acured Aug 6, 2021

scarlett2018 closed this as completed Sep 26, 2021

scarlett2018 mentioned this issue Jun 22, 2022

Task is always in a waiting state in the remote machine #3905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NNI remote mode is working beyond expectation #4035

NNI remote mode is working beyond expectation #4035

OuYaozhong commented Aug 6, 2021 •

edited

Loading

acured commented Aug 9, 2021

OuYaozhong commented Aug 10, 2021

acured commented Aug 11, 2021

OuYaozhong commented Aug 11, 2021

acured commented Aug 11, 2021

OuYaozhong commented Aug 11, 2021

OuYaozhong commented Aug 11, 2021 •

edited

Loading

acured commented Aug 12, 2021

acured commented Aug 12, 2021

OuYaozhong commented Aug 12, 2021

acured commented Aug 12, 2021

acured commented Aug 12, 2021

OuYaozhong commented Aug 12, 2021

OuYaozhong commented Aug 12, 2021 •

edited

Loading

acured commented Aug 13, 2021

OuYaozhong commented Aug 13, 2021

acured commented Aug 13, 2021

OuYaozhong commented Aug 13, 2021

NNI remote mode is working beyond expectation #4035

NNI remote mode is working beyond expectation #4035

Comments

OuYaozhong commented Aug 6, 2021 • edited Loading

acured commented Aug 9, 2021

OuYaozhong commented Aug 10, 2021

In Brief

In Detaill

Appendix

acured commented Aug 11, 2021

OuYaozhong commented Aug 11, 2021

acured commented Aug 11, 2021

OuYaozhong commented Aug 11, 2021

OuYaozhong commented Aug 11, 2021 • edited Loading

acured commented Aug 12, 2021

acured commented Aug 12, 2021

OuYaozhong commented Aug 12, 2021

acured commented Aug 12, 2021

acured commented Aug 12, 2021

OuYaozhong commented Aug 12, 2021

OuYaozhong commented Aug 12, 2021 • edited Loading

acured commented Aug 13, 2021

OuYaozhong commented Aug 13, 2021

acured commented Aug 13, 2021

OuYaozhong commented Aug 13, 2021

OuYaozhong commented Aug 6, 2021 •

edited

Loading

OuYaozhong commented Aug 11, 2021 •

edited

Loading

OuYaozhong commented Aug 12, 2021 •

edited

Loading