Updated `Dockerfile` of MLflow Kubernetes examples #2472

0x41head · 2021-03-12T04:47:16Z

Motivation

#2049

Description of the changes

Bumped up the PyTorch to latest stable versions.

Update for cython fix

HideakiImamura · 2021-03-16T06:40:44Z

@crcrpar Could you review this PR if you have time?

HideakiImamura · 2021-03-16T23:12:23Z

@0x41head Thanks for the PR. The CI failure would be fixed by merging the master branch.

HideakiImamura

I tried to execute the commands in examples/kubernetes/mlflow/README.md, but the worker pod failed as follows. It seems that examples/kubernetes/mlflow/pytorch_lightning_distributed.py uses a legacy interface of val_percent_check. Do you have any idea?

(venv) mamu@HideakinoMacBook-puro mlflow % kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          41s
postgres-0            1/1     Running     0          41s
study-creator-92jgf   0/1     Completed   0          41s
worker-b2mh5          0/1     Error       2          41s
worker-nrnb2          0/1     Error       2          41s
(venv) mamu@HideakinoMacBook-puro mlflow % kubectl logs worker-b2mh5
pytorch_lightning_distributed.py:128: ExperimentalWarning: MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.
  callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
[W 2021-03-16 23:18:31,031] Trial 4 failed because of the following error: TypeError("__init__() got an unexpected keyword argument 'val_percent_check'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'
Traceback (most recent call last):
  File "pytorch_lightning_distributed.py", line 128, in <module>
    callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 394, in optimize
    show_progress_bar=show_progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 76, in _optimize
    progress_bar=progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 268, in _run_trial
    raise func_err
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'

Update to latest.

0x41head · 2021-03-17T06:53:48Z

@HideakiImamura , I am unable to reproduce your errors in my local environment. Neither worker-b2mh5 nor worker-nrnb2 comes up as a worker pod. I am using minikube with docker as my driver.

kali@kali:~/Desktop/optuna/examples/kubernetes/mlflow$ kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          25m
postgres-0            1/1     Running     0          25m
study-creator-g5x79   0/1     Completed   0          25m

HideakiImamura · 2021-03-22T09:16:50Z

Sorry for the late reply. The worker-xxx pods are activated by the study-creator-xxx and will be deleted after completion. For example

mamu@HideakinoMacBook-puro ~ % kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     1          5d9h
postgres-0            1/1     Running     1          5d9h
study-creator-92jgf   0/1     Completed   0          5d9h

What if you run kubectl get pod a few tens of seconds after the command is executed?

0x41head · 2021-03-22T16:52:41Z

Thank you for commenting. I looked into the matter and realized you were absolutely correct. However, after running the same set of commands on the current master optuna branch, I got the same errors.

kali@kali:~/mo/optuna/examples/kubernetes/mlflow$ kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          4m11s
postgres-0            1/1     Running     0          4m11s
study-creator-7h6d8   0/1     Completed   0          4m12s
worker-42pnk          0/1     Error       1          4m12s
worker-mnl7v          0/1     Error       1          4m12s

which leads me to believe that these errors might not be caused by this PR.

HideakiImamura · 2021-03-23T11:53:52Z

In the current master branch, the worker pods fail due to the following errors. The error reason seems to be different from the above error. We need to fix both of them in another PRs. I'll create another issue to be fixed. (Thanks to this PR, I was able to notice the bug. Thanks!)

mamu@HideakinoMacBook-puro mlflow % kubectl logs worker-kg66q
/usr/local/lib/python3.7/site-packages/optuna/_experimental.py:83: ExperimentalWarning: MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.
  ExperimentalWarning,
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
/usr/local/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:25: DeprecationWarning: Argument `val_percent_check` is now set by `limit_val_batches` since v0.8.0 and this argument will be removed in v0.10.0
  warnings.warn(*args, **kwargs)

  | Name  | Type       | Params
-------------------------------------
0 | model | Sequential | 72 K  
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /usr/src/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s][W 2021-03-23 11:49:21,342] Setting status of trial#4 as TrialState.FAIL because of the following error: <HTTPError 522: 'Origin Connection Time-out'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 734, in _run_trial
    result = func(trial)
  File "pytorch_lightning_distributed.py", line 110, in objective
    trainer.fit(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
    results = self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1185, in run_pretrain_routine
    self.reset_val_dataloader(ref_model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "pytorch_lightning_distributed.py", line 91, in val_dataloader
    datasets.MNIST(DIR, train=False, download=True, transform=transforms.ToTensor()),
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 70, in __init__
    self.download()
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 137, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 249, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 83, in download_url
    raise e
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 71, in download_url
    reporthook=gen_bar_updater()
  File "/usr/local/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 522: Origin Connection Time-out
Traceback (most recent call last):
  File "pytorch_lightning_distributed.py", line 128, in <module>
    callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 339, in optimize
    func, n_trials, timeout, catch, callbacks, gc_after_trial, None
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 682, in _optimize_sequential
    self._run_trial_and_callbacks(func, catch, callbacks, gc_after_trial)
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 713, in _run_trial_and_callbacks
    trial = self._run_trial(func, catch, gc_after_trial)
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 734, in _run_trial
    result = func(trial)
  File "pytorch_lightning_distributed.py", line 110, in objective
    trainer.fit(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
    results = self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1185, in run_pretrain_routine
    self.reset_val_dataloader(ref_model)
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "pytorch_lightning_distributed.py", line 91, in val_dataloader
    datasets.MNIST(DIR, train=False, download=True, transform=transforms.ToTensor()),
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 70, in __init__
    self.download()
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 137, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 249, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 83, in download_url
    raise e
  File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 71, in download_url
    reporthook=gen_bar_updater()
  File "/usr/local/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/local/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 522: Origin Connection Time-out
0it [00:31, ?it/s]

0x41head · 2021-03-23T12:02:06Z

Sorry, I didn't compare the logs of the two branches. I will look into this error ASAP.

HideakiImamura · 2021-03-23T12:18:41Z

@0x41head Are you interested in fixing the bug (#2513)? I think the same technique in #2505 can be applied.

0x41head · 2021-03-23T12:20:57Z

Sure @HideakiImamura 👍

0x41head · 2021-03-23T14:42:30Z

@HideakiImamura , a quick change from MNIST to FashionMNIST in kubernetes/mlflow/pytorch_lightning_distributed.py ,in the
master branch, brings me back to the original error that we faced due to this PR. Do I make a new PR for the error code 522 or do I make all the changes in this PR ?

kali@kali:~/Desktop/optuna/examples/kubernetes/mlflow$ kubectl logs worker-99lcl
pytorch_lightning_distributed.py:128: ExperimentalWarning: MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.
  callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
[W 2021-03-23 14:33:32,529] Trial 2 failed because of the following error: TypeError("__init__() got an unexpected keyword argument 'val_percent_check'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'
Traceback (most recent call last):
  File "pytorch_lightning_distributed.py", line 128, in <module>
    callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
  File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 394, in optimize
    show_progress_bar=show_progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 76, in _optimize
    progress_bar=progress_bar,
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 268, in _run_trial
    raise func_err
  File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
    value_or_values = func(trial)
  File "pytorch_lightning_distributed.py", line 106, in objective
    callbacks=[metrics_callback],
  File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
    return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'

HideakiImamura · 2021-03-24T06:12:33Z

Thanks for the investigation. I think we should fix both of bugs in another PR, and after that go back to this PR.

The reason of TypeError: __init__() got an unexpected keyword argument 'val_percent_check' seems to be Lightning-AI/pytorch-lightning#2213. Replacing val_percent_check with limit_val_batches will work.

Merge with master

0x41head · 2021-03-29T17:23:23Z

@HideakiImamura #2514 fixes the issues we faced during this PR.
Logs:

kali@kali:~/Desktop/optuna/examples/kubernetes/mlflow$ kubectl get pod
NAME                  READY   STATUS      RESTARTS   AGE
mlflow-0              1/1     Running     0          18m
postgres-0            1/1     Running     0          18m
study-creator-755nc   0/1     Completed   0          18m
worker-k4sb9          0/1     Completed   0          18m
worker-lpxzd          0/1     Completed   0          18m

HideakiImamura

Thanks for the update and sorry for the late reply. LGTM!

himkt

I'm not familiar with kubernetes but confirmed that jobs were successfully launched on my local minikube with the Dockerfile. Thank you @0x41head for updating the dependency!

0x41head and others added 3 commits March 12, 2021 10:13

Updated pytorch lib

0f67bc8

Delete Dockerfile~

d82dcbd

Delete .Dockerfile.un~

daa2f5c

crcrpar added the example label Mar 12, 2021

crcrpar linked an issue Mar 12, 2021 that may be closed by this pull request

Update Dockerfile of MLflow Kubernetes examples #2049

Closed

Merge pull request #16 from optuna/master

f08446d

Update for cython fix

HideakiImamura assigned crcrpar and HideakiImamura Mar 16, 2021

HideakiImamura reviewed Mar 16, 2021

View reviewed changes

Merge pull request #17 from optuna/master

b05f0ec

Update to latest.

This was referenced Mar 23, 2021

Kubernetes + MLflow example fails due to the MNIST download error #2513

Closed

Should we use FashionMNIST instead of MNIST? #2469

Closed

0x41head added 2 commits March 29, 2021 14:33

Merge pull request #18 from 0x41head/master

690ebf5

Merge with master

Merge pull request #19 from 0x41head/master

27146e9

Merge with master

himkt self-assigned this Mar 31, 2021

crcrpar removed their assignment Mar 31, 2021

HideakiImamura approved these changes Mar 31, 2021

View reviewed changes

himkt approved these changes Mar 31, 2021

View reviewed changes

himkt merged commit 473a5bd into optuna:master Mar 31, 2021

0x41head deleted the mlflow_docker branch March 31, 2021 14:52

HideakiImamura added this to the v2.7.0 milestone Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated `Dockerfile` of MLflow Kubernetes examples #2472

Updated `Dockerfile` of MLflow Kubernetes examples #2472

0x41head commented Mar 12, 2021

HideakiImamura commented Mar 16, 2021

HideakiImamura commented Mar 16, 2021

HideakiImamura left a comment

0x41head commented Mar 17, 2021 •

edited

Loading

HideakiImamura commented Mar 22, 2021

0x41head commented Mar 22, 2021 •

edited

Loading

HideakiImamura commented Mar 23, 2021

0x41head commented Mar 23, 2021

HideakiImamura commented Mar 23, 2021

0x41head commented Mar 23, 2021

0x41head commented Mar 23, 2021 •

edited

Loading

HideakiImamura commented Mar 24, 2021

0x41head commented Mar 29, 2021

HideakiImamura left a comment

himkt left a comment

Updated Dockerfile of MLflow Kubernetes examples #2472

Updated Dockerfile of MLflow Kubernetes examples #2472

Conversation

0x41head commented Mar 12, 2021

Motivation

Description of the changes

HideakiImamura commented Mar 16, 2021

HideakiImamura commented Mar 16, 2021

HideakiImamura left a comment

Choose a reason for hiding this comment

0x41head commented Mar 17, 2021 • edited Loading

HideakiImamura commented Mar 22, 2021

0x41head commented Mar 22, 2021 • edited Loading

HideakiImamura commented Mar 23, 2021

0x41head commented Mar 23, 2021

HideakiImamura commented Mar 23, 2021

0x41head commented Mar 23, 2021

0x41head commented Mar 23, 2021 • edited Loading

HideakiImamura commented Mar 24, 2021

0x41head commented Mar 29, 2021

HideakiImamura left a comment

Choose a reason for hiding this comment

himkt left a comment

Choose a reason for hiding this comment

Updated `Dockerfile` of MLflow Kubernetes examples #2472

Updated `Dockerfile` of MLflow Kubernetes examples #2472

0x41head commented Mar 17, 2021 •

edited

Loading

0x41head commented Mar 22, 2021 •

edited

Loading

0x41head commented Mar 23, 2021 •

edited

Loading