Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

NNI remote mode is working beyond expectation #4035

Closed
OuYaozhong opened this issue Aug 6, 2021 · 18 comments
Closed

NNI remote mode is working beyond expectation #4035

OuYaozhong opened this issue Aug 6, 2021 · 18 comments
Assignees

Comments

@OuYaozhong
Copy link

OuYaozhong commented Aug 6, 2021

I have met some strange behavior of nni in remote mode.


In brief, what happened to me is the same as @guoxiaojie-schinper

I am running the demo of nni repo, /example/trial/mnist-pytorch

If running the config_remote.yml locally in remote machine (certainly, the trainningService has been change to local), everything is normal.

But if the same config_remote.yml running in my local machine (MacBook Pro), and the slave worker is the workstation with Nvidia GeForce 2080 GPU, it doesn't work exactly same as @guoxiaojie-schinper .


In detail,

Environment: NNI on both local and remote are install by python3 -m pip install --upgrade nni in conda environment

config_remote.yml (if use remote):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.251
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: remote
  machineList:
    - host: 10.113.217.230 
      user: root
      sshKeyFile: ~/.ssh/nni_docker
      port: 8145
      pythonPath: /opt/conda/envs/py38torch190cu111/bin
      useActiveGpu: true
      maxTrialNumberPerGpu: 8

config_remote.yml (if use in local):

searchSpaceFile: search_space.json
#trialCommand: nvidia-smi && which python3 && python3 mnist.py
trialCommand: python3 mnist.py
trialGpuNumber: 1
trialConcurrency: 4
maxTrialNumber: 20
nniManagerIp: 10.113.217.230
debug: true
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
  platform: local
  useActiveGpu: true
  maxTrialNumberPerGpu: 8

Description:

  1. If I run the code and config (local mode, using the second yml file) in remote machine locally, every setting is running as expected. The task in gpu is the same as the trialConcurrency and gpu is used by nni exactly, the speed of output waiting time is also as expectation.
  2. If I run the code and config (remote mode, using the first yml file) in local machine (MacBook Pro with latest OS) connecting to the remote machine, some strange phenomena are occurred. I list below.

-> 2.1 If I set the trialGpuNumber = 1, trialCommand = python3 mint.py, the phenomenon is the same as @guoxiaojie-schinper . All the tasks show waiting status forever. And the NNIManager Log shows the INFO: INFO (TrialDispatcher) TrialDispatcher: 1 live env, and 1 reusable, but no GPU available so request a new one. And use top and nvidia-smi in remote machine can confirm that that task is not running exactly, due to low cpu usage and no related gpu process. And the waiting status can be kept up to several hours and still be waiting there. And no matter if the nvidia-smi is added to the trialCommand (just like the commented one trialCommand), the phenomenon is the same.

-> 2.2 If I set the trialGpuNumber > 1, nni will tell me over the limitation and none of the machine can reach that. And in fact my remote machine has only one gpu. This situation is reasonable.

-> 2.3 If I set the trialGpuNumber = 0, and the trialCommand = python3 mnist.py, and inside or outside the docker, even though my trialConcurrency = 4, only one task is running and another 3 keep waiting until the running one finish. Unlike [1] and [2] above, all the task are waiting forever, in this condition, the task is running one by one beyond the control of the augment trialConcurrency. And the task is run by cpu (waiting up to 4 min for this mnist demo of your nni repo to reach full 800% usage of cpu with 8 kernel i7 cpu). But just take a longer time than gpu, it still can run without waiting forever.
(The information provided in right side sentence is not very confident because I don't remember exactly or situation show up rarely) And maybe something will randomly use the gpu in first task or several tasks.

-> 2.4 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and outside the docker, the task will run with gpu after about 4 min, which is much slower than normal one in 2.5. The gpu usage I have checked in nvidia-smi command in remote machine (a related process show up in nvidia-smi) and the output speed of nni can also confirm these phenomenon. But it seems that it seems that the next task will use more time to wait for gpu usage.

-> 2.5 If I set the trialGpuNumber = 0 again, but add the nvidia-smi to the trialCommand, i.e. trialCommand: nvidia-smi && which python3 && python3 mnist.py, and inside the docker, the task will run with gpu forever and normally, but still one by one.

Originally posted by @OuYaozhong in #3905 (comment)

@acured
Copy link
Contributor

acured commented Aug 9, 2021

Hi @OuYaozhong Thanks your detailed descriptions about this issue.

There is a fix here about GPU release issue, hopefully can fix your issue, as I mentioned at #3905 (comment). It is already in master branch.

You can build NNI from source code or wait for next release(v2.5), that will be soon.

@OuYaozhong
Copy link
Author

@acured Thanks for your helping.

And I have tried the new updated source code.

But it seems just solve one problem: when set trialGpuNumber: 1, the nni will keep waiting forever, and when the trialGpuNumber: 0, the gpu is still be used.

Other situations mentioned in original comment still exist.

Here I describe again the problem with more log files.


In Brief

  1. Run the task one by one whatever value trialConcurrency is set
  2. If not add the nvidia-smi to the trialCommand, when run directly in ubuntu (i.e. outside the docker), only first task will use the gpu or maybe wait for much longer time to use the gpu. (trialGpuNumbe: 1 and useActiveGpu: true)

In Detaill

Fundamental Info:
(Slaver)Remote WorkStation:

  1. Outside Environment: Ubuntu 20.04.2 LTS with self-created conda env (python 3.6)
  2. Inside Environment: Docker build from continuumio/miniconda3:latest (Dockerfile is attach below)
  3. NNI installation: install in self-created conda env by latest source code, following the second instruction of the nni doc
    (Master)Local:
  4. NNI install in self-created conda env by latest source code, following the second instruction of the nni doc (The same as the Slaver Remote one)
  5. Machine: MacBook Pro 2020 with latest OS

And all the experiments are run in remote mode.

Experiments
I have tried totally 8 conditions,

If Use Docker: [True, False]
If add the nvidia-smi: [True, False]
trialGpuNumber: [0, 1]

and each condition will have different phenomena.
8 conditions which will be tried are listed below.

  1. [Not use Docker, Not add nvidia-smi, trialGpuNumber: 1]
  2. [Not use Docker, add nvidia-smi, trialGpuNumber: 1]
  3. [use Docker, not add nvidia-smi, trialGpuNumber: 1]
  4. [use Docker, add nvidia-smi, trialGpuNumber: 1]
  5. [Not use Docker, Not add nvidia-smi, trialGpuNumber: 0]
  6. [Not use Docker, add nvidia-smi, trialGpuNumber: 0]
  7. [use Docker, Not add nvidia-smi, trialGpuNumber: 0]
  8. [use Docker, add nvidia-smi, trialGpuNumber: 0]

All the running log of these conditions, both in remote slaver and in local master, will be package and upload.

  1. The problem is the task run one by one, even though my trialConcurrency: 4 or trialConcurrency: 8.
    And the nnimanager.log shows that DEBUG (TrialDispatcher) TrialDispatcher: wait GPU, live environment 1, reusable 1, TMP_NO_AVAILABLE_GPU.. But in fact, the trialGpuNumber: 1 and the memory and the computing resource of GPU are not full used.

  2. And even though some conditions the gpu will be used forever or just several tasks, there are no gpu_merit saved in the log file, either in the remote slaver or in the local master machine.

  3. If trialGpuNumber: 1, the gpu will be used, but the duration of each task will vary. Because some will use the gpu after the task are running soon, some will wait for a long time before using gpu, based on my observation with nvidia-smi. If run outside the docker, sometimes will wait for over 10 mins and still wait. If use inside the docker, the phenomenon will become more stable, i.e. waiting about 4 mins every task and each trial. (The wait I use here is not meaning the 'WAITING' status, but represent the status that the task is running but no gpu usage).

  4. Normally, if run in locally in the remote machine, each task consume about 2 mins, and trialConcurrency will make sense instead of running one by one. No time will be wasted to wait for gpu usage. The duration of each task is evenly, i.e. variation is small.(Example is shown in the below image)
    Unlike all the condition mentioned above, the duration of most of the tasks are exceed 4 min and wait lot of time for gpu, and variation of duration of each task is large.
    截屏2021-08-10 00 10 40

  5. In remote mode, no obvious data transmission from the network monitor. And this project is small and no database need to upload. All the large file are downloaded when the code is running. I don't think the waiting of running task is caused by large data size.

That all.
And if more experiments need to be done, please feel free to tell me and I will provide the log file.

Appendix

1. The docker file used to build the docker image

FROM continuumio/miniconda3
WORKDIR /root
COPY nni_dependence.tar /opt/conda/pkgs/
WORKDIR /opt/conda/pkgs
RUN tar -xf nni_dependence.tar
RUN rm nni_dependence.tar
RUN ls /opt/conda/pkgs
WORKDIR /root
RUN conda create -n py38torch190cu111 python=3.8 -y
RUN echo "conda activate py38torch190cu111" >> ~/.bashrc
SHELL ["/bin/bash", "--login",  "-c"]
RUN conda env list
RUN conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
RUN apt install git -y
RUN python3 -m pip install --upgrade nni
RUN git clone https://github.com/microsoft/nni.git
RUN pwd && ls
RUN echo root:***** | chpasswd
RUN pip install dill
RUN apt-get update
RUN apt-get install openssh-server openssh-client -y

CMD sed -i '/^#PermitRootLogin/a\PermitRootLogin yes' /etc/ssh/sshd_config && /etc/init.d/ssh restart

2. The log is packaged in tar file and attached in this comment.
Download Link(1.5 G)
In the attached log,

  1. inside means using Docker;
    outside means not using Docker, i.e. directly run in ubuntu 20.04 with self-created conda env.
  2. with nvidia-smi means nvidia-smi is added into the trialCommand
    without nvidia-smi means nvidia-smi is not added into the trialCommand
  3. gpu0 means trialGpuNumber: 0
    gpu1 means trialGpuNumber: 1

@acured
Copy link
Contributor

acured commented Aug 11, 2021

Hi @OuYaozhong, about 'trialConcurrency' there is a note at https://nni.readthedocs.io/en/stable/reference/experiment_config.html?highlight=trialConcurrency#trialconcurrency that 'The real concurrency also depends on hardware resources and may be less than this value.' , Have a try with 'trialGpuNumber: 2'?

@OuYaozhong
Copy link
Author

@acured

  1. I have tried the trailConcurrency: 2, the NNI threw an error, meaning that the trialConcurrency had exceeded the real GPU number: Training service error: TrialDispatcher: REQUIRE_EXCEED_TOTAL Required GPU number 2 is too large, no machine can meet.
  2. Though I have the same idea with you that the real concurrency will depend on the available resources of remote hardwares. But in fact, based on the observations of many trials, the the nni will not arrange the task to wait for the available of the resources when the trialConcurrency is set larger than the available resources obviously. Frequently, the nni will let trialConcurrency tasks running together and make some of the tasks CUDA out of memory. But maybe that is due to the dynamic GPU memory usage of the program. However, the small task mnist-pytorch what I doing experiment is beyond this scope because it seems the gpu usage is fast and stable.

@acured
Copy link
Contributor

acured commented Aug 11, 2021

Have you try set trialGpuNumber: none or not set?

@OuYaozhong
Copy link
Author

The same ERROR: Training service error: TrialDispatcher: REQUIRE_EXCEED_TOTAL Required GPU number 2 is too large, no machine can meet
截屏2021-08-11 11 25 04

截屏2021-08-11 11 24 40

@OuYaozhong
Copy link
Author

OuYaozhong commented Aug 11, 2021

Have you try set trialGpuNumber: none or not set?

  1. If set None or none, error will be thrown by nnictl.
    (set none)ERROR: Config V2 validation failed: ValueError("ExperimentConfig: trial_gpu_number has bad value 'none'") ERROR: 'NoneType' object has no attribute 'get'
    (set None)ERROR: Config V2 validation failed: ValueError("ExperimentConfig: trial_gpu_number has bad value 'None'") ERROR: 'NoneType' object has no attribute 'get'

  2. If not set, I can create the experiment normally, but the task still runs one by one, and wait for long time still no gpu usage, and the duration is obviously larger than the local one.

截屏2021-08-11 11 40 16

截屏2021-08-11 11 40 25

  1. If set not and inside the docker, can create experiment normally but still run one by one same as the 2.
    The first task takes for 8 mins, much larger than local one.

截屏2021-08-11 11 52 08

截屏2021-08-11 11 54 48

@acured
Copy link
Contributor

acured commented Aug 12, 2021

It works for me...
image

@acured
Copy link
Contributor

acured commented Aug 12, 2021

It is local, I will try remote later.

@OuYaozhong
Copy link
Author

It is local, I will try remote later.

Yeah, if local, it works for me too.

@acured
Copy link
Contributor

acured commented Aug 12, 2021

Also works...
image

@acured
Copy link
Contributor

acured commented Aug 12, 2021

I see your config setting at "pythonPath", is that a folder or pythonfile? This should be a folder, if .../python is not a folder, can you change it to .../bin and have a try?

ref: https://nni.readthedocs.io/en/stable/reference/experiment_config.html?highlight=pythonPath#pythonpath

@OuYaozhong
Copy link
Author

I see your config setting at "pythonPath", is that a folder or pythonfile? This should be a folder, if .../python is not a folder, can you change it to .../bin and have a try?

ref: https://nni.readthedocs.io/en/stable/reference/experiment_config.html?highlight=pythonPath#pythonpath

Yeah, you are right. I have noticed that problem before but have forgotten recently.

And I am finding some resource to have a check. Please wait for some time and I will post the result as soon as possible.

@OuYaozhong
Copy link
Author

OuYaozhong commented Aug 12, 2021

@acured
Hi, thanks for your help. Your trial help me to solve the problem.

In fact, I found that the problem is located at reuseMode instead of the trialConcurrency or pythonPath.

If I set the reuseMode: false, both docker and the outside environment will work normally just like running in local mode. The task running simultaneously by the setting of trialConcurrency and use gpu quickly.

But it is strange that why this is relative to reuseMode?

Before I open this issue, I have noticed this parameter and think it isn't relative to this problem, because it seems that is use to accelerate the remote training.

Can you do some explanation?

image

image

@acured
Copy link
Contributor

acured commented Aug 13, 2021

Glad your problem is solved, And for more information about "resue" mode you can access here.

@OuYaozhong
Copy link
Author

Glad your problem is solved, And for more information about "resue" mode you can access here.

Yeah, I have read the document yesterday again, but I wonder why the problem is relative to the reuseMode ? It seems the reuseMode just to accelerate the experiments instead of making it run one by one.

@acured
Copy link
Contributor

acured commented Aug 13, 2021

It also works for me when I set reuse is ture.
image

@OuYaozhong
Copy link
Author

It also works for me when I set reuse is ture.
image

Well, thanks for you helping.
That's enough.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants