Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Dockerfile of MLflow Kubernetes examples #2472

Merged
merged 7 commits into from
Mar 31, 2021

Conversation

0x41head
Copy link
Contributor

Motivation

#2049

Description of the changes

Bumped up the PyTorch to latest stable versions.

@crcrpar crcrpar linked an issue Mar 12, 2021 that may be closed by this pull request
@HideakiImamura
Copy link
Member

@crcrpar Could you review this PR if you have time?

@HideakiImamura
Copy link
Member

@0x41head Thanks for the PR. The CI failure would be fixed by merging the master branch.

Copy link
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to execute the commands in examples/kubernetes/mlflow/README.md, but the worker pod failed as follows. It seems that examples/kubernetes/mlflow/pytorch_lightning_distributed.py uses a legacy interface of val_percent_check. Do you have any idea?

(venv) mamu@HideakinoMacBook-puro mlflow % kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          41s
postgres-0            1/1     Running     0          41s
study-creator-92jgf   0/1     Completed   0          41s
worker-b2mh5          0/1     Error       2          41s
worker-nrnb2          0/1     Error       2          41s
(venv) mamu@HideakinoMacBook-puro mlflow % kubectl logs worker-b2mh5
pytorch_lightning_distributed.py:128: ExperimentalWarning: MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.
  callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
[W 2021-03-16 23:18:31,031] Trial 4 failed because of the following error: TypeError("__init__() got an unexpected keyword argument 'val_percent_check'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'
Traceback (most recent call last):
  File "pytorch_lightning_distributed.py", line 128, in <module>
    callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 394, in optimize
    show_progress_bar=show_progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 76, in _optimize
    progress_bar=progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 268, in _run_trial
    raise func_err
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'

@0x41head
Copy link
Contributor Author

0x41head commented Mar 17, 2021

@HideakiImamura , I am unable to reproduce your errors in my local environment. Neither worker-b2mh5 nor worker-nrnb2 comes up as a worker pod. I am using minikube with docker as my driver.

kali@kali:~/Desktop/optuna/examples/kubernetes/mlflow$ kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          25m
postgres-0            1/1     Running     0          25m
study-creator-g5x79   0/1     Completed   0          25m

@HideakiImamura
Copy link
Member

Sorry for the late reply. The worker-xxx pods are activated by the study-creator-xxx and will be deleted after completion. For example

mamu@HideakinoMacBook-puro ~ % kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     1          5d9h
postgres-0            1/1     Running     1          5d9h
study-creator-92jgf   0/1     Completed   0          5d9h

What if you run kubectl get pod a few tens of seconds after the command is executed?

@0x41head
Copy link
Contributor Author

0x41head commented Mar 22, 2021

Thank you for commenting. I looked into the matter and realized you were absolutely correct. However, after running the same set of commands on the current master optuna branch, I got the same errors.

kali@kali:~/mo/optuna/examples/kubernetes/mlflow$ kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          4m11s
postgres-0            1/1     Running     0          4m11s
study-creator-7h6d8   0/1     Completed   0          4m12s
worker-42pnk          0/1     Error       1          4m12s
worker-mnl7v          0/1     Error       1          4m12s

which leads me to believe that these errors might not be caused by this PR.

@HideakiImamura
Copy link
Member

In the current master branch, the worker pods fail due to the following errors. The error reason seems to be different from the above error. We need to fix both of them in another PRs. I'll create another issue to be fixed. (Thanks to this PR, I was able to notice the bug. Thanks!)

mamu@HideakinoMacBook-puro mlflow % kubectl logs worker-kg66q
/usr/local/lib/python3.7/site-packages/optuna/_experimental.py:83: ExperimentalWarning: MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.
  ExperimentalWarning,
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
/usr/local/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: DeprecationWarning: Argument `val_percent_check` is now set by `limit_val_batches` since v0.8.0 and this argument will be removed in v0.10.0
  warnings.warn(*args, **kwargs)

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 72 K  
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /usr/src/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s][W 2021-03-23 11:49:21,342] Setting status of trial#4 as TrialState.FAIL because of the following error: <HTTPError 522: 'Origin Connection Time-out'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 734, in _run_trial
    result = func(trial)
  File "pytorch_lightning_distributed.py", line 110, in objective
    trainer.fit(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
    results = self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1185, in run_pretrain_routine
    self.reset_val_dataloader(ref_model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "pytorch_lightning_distributed.py", line 91, in val_dataloader
    datasets.MNIST(DIR, train=False, download=True, transform=transforms.ToTensor()),
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 70, in __init__
    self.download()
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 137, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 249, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 83, in download_url
    raise e
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 71, in download_url
    reporthook=gen_bar_updater()
  File "/usr/local/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 522: Origin Connection Time-out
Traceback (most recent call last):
  File "pytorch_lightning_distributed.py", line 128, in <module>
    callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 339, in optimize
    func, n_trials, timeout, catch, callbacks, gc_after_trial, None
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 682, in _optimize_sequential
    self._run_trial_and_callbacks(func, catch, callbacks, gc_after_trial)
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 713, in _run_trial_and_callbacks
    trial = self._run_trial(func, catch, gc_after_trial)
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 734, in _run_trial
    result = func(trial)
  File "pytorch_lightning_distributed.py", line 110, in objective
    trainer.fit(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
    results = self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1185, in run_pretrain_routine
    self.reset_val_dataloader(ref_model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "pytorch_lightning_distributed.py", line 91, in val_dataloader
    datasets.MNIST(DIR, train=False, download=True, transform=transforms.ToTensor()),
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 70, in __init__
    self.download()
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 137, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 249, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 83, in download_url
    raise e
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 71, in download_url
    reporthook=gen_bar_updater()
  File "/usr/local/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 522: Origin Connection Time-out
0it [00:31, ?it/s]

@0x41head
Copy link
Contributor Author

Sorry, I didn't compare the logs of the two branches. I will look into this error ASAP.

@HideakiImamura
Copy link
Member

@0x41head Are you interested in fixing the bug (#2513)? I think the same technique in #2505 can be applied.

@0x41head
Copy link
Contributor Author

Sure @HideakiImamura 👍

@0x41head
Copy link
Contributor Author

0x41head commented Mar 23, 2021

@HideakiImamura , a quick change from MNIST to FashionMNIST in kubernetes/mlflow/pytorch_lightning_distributed.py ,in the
master branch, brings me back to the original error that we faced due to this PR. Do I make a new PR for the error code 522 or do I make all the changes in this PR ?

kali@kali:~/Desktop/optuna/examples/kubernetes/mlflow$ kubectl logs worker-99lcl
pytorch_lightning_distributed.py:128: ExperimentalWarning: MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.
  callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
[W 2021-03-23 14:33:32,529] Trial 2 failed because of the following error: TypeError("__init__() got an unexpected keyword argument 'val_percent_check'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'
Traceback (most recent call last):
  File "pytorch_lightning_distributed.py", line 128, in <module>
    callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 394, in optimize
    show_progress_bar=show_progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 76, in _optimize
    progress_bar=progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 268, in _run_trial
    raise func_err
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'

@HideakiImamura
Copy link
Member

Thanks for the investigation. I think we should fix both of bugs in another PR, and after that go back to this PR.

The reason of TypeError: __init__() got an unexpected keyword argument 'val_percent_check' seems to be Lightning-AI/pytorch-lightning#2213. Replacing val_percent_check with limit_val_batches will work.

@0x41head
Copy link
Contributor Author

@HideakiImamura #2514 fixes the issues we faced during this PR.
Logs:

kali@kali:~/Desktop/optuna/examples/kubernetes/mlflow$ kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          18m
postgres-0            1/1     Running     0          18m
study-creator-755nc   0/1     Completed   0          18m
worker-k4sb9          0/1     Completed   0          18m
worker-lpxzd          0/1     Completed   0          18m

@himkt himkt self-assigned this Mar 31, 2021
@crcrpar crcrpar removed their assignment Mar 31, 2021
Copy link
Member

@HideakiImamura HideakiImamura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update and sorry for the late reply. LGTM!

Copy link
Member

@himkt himkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with kubernetes but confirmed that jobs were successfully launched on my local minikube with the Dockerfile. Thank you @0x41head for updating the dependency!

@himkt himkt merged commit 473a5bd into optuna:master Mar 31, 2021
@0x41head 0x41head deleted the mlflow_docker branch March 31, 2021 14:52
@HideakiImamura HideakiImamura added this to the v2.7.0 milestone Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Dockerfile of MLflow Kubernetes examples
4 participants