Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook SageMaker_SSH_Notebook.ipynb fails due to docker-compose #43

Closed
djmarti opened this issue Nov 14, 2023 · 5 comments
Closed

Notebook SageMaker_SSH_Notebook.ipynb fails due to docker-compose #43

djmarti opened this issue Nov 14, 2023 · 5 comments

Comments

@djmarti
Copy link

djmarti commented Nov 14, 2023

The notebook SageMaker_SSH_Notebook.ipynb throws an error related to docker compose:


INFO:sagemaker.local.image:docker command: docker-compose -f /tmp/tmpxkrcbq9c/docker-compose.yaml up --build --abort-on-container-exit

time="2023-11-14T20:11:59Z" level=warning msg="a network with name sagemaker-local exists but was not created by compose.\nSet `external: true` to use an existing network"
network sagemaker-local was found but has incorrect label com.docker.compose.network set to ""

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/local/image.py:296, in _SageMakerContainer.train(self, input_data_config, output_data_config, hyperparameters, environment, job_name)
    295 try:
--> 296     _stream_output(process)
    297 except RuntimeError as e:
    298     # _stream_output() doesn't have the command line. We will handle the exception
    299     # which contains the exit code and append the command line to it.

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/sagemaker/local/image.py:984, in _stream_output(process)
    983 if exit_code != 0:
--> 984     raise RuntimeError("Process exited with code: %s" % exit_code)
    986 return exit_code

RuntimeError: Process exited with code: 1

Not sure what is causing the error. I was able to run the same notebook content just three months ago. Any hint or suggestion will be greatly appreciated.

@djmarti djmarti changed the title Notebook SageMaker_SSH_Notebook.ipynb fails due to docker.compose Notebook SageMaker_SSH_Notebook.ipynb fails due to docker-compose Nov 14, 2023
@ivan-khvostishkov
Copy link
Contributor

Hi, @djmarti , thanks for bringing up this important observation. The issue is probably rooted in the recent changes of docker-compose: docker/compose#10797 .

Please, downgrade the version as a workaround:

sudo curl -L "https://github.com/docker/compose/releases/download/v2.18.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

There's nothing that we can do on the SageMaker SSH Helper side, I'll keep this issue open until SageMaker Notebooks will get an update.

@djmarti
Copy link
Author

djmarti commented Nov 17, 2023

Thanks Ivan for your prompt response and for the workaround. I think I gave a misleading hint. I am still unable to run the notebook after downgrading docker-compose to version 2.18.1. I checked that the version of docker-compose is the expected one:

$ whereis docker-compose
docker-compose: /usr/local/bin/docker-compose
$ docker-compose -v
Docker Compose version v2.18.1

But now I get an error that smells like a permission error:

e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:50,818 sagemaker_pytorch_container.training INFO     Invoking user training script.
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:50,875 botocore.credentials INFO     Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:51,010 sagemaker-training-toolkit ERROR    Reporting training FAILURE
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:51,010 sagemaker-training-toolkit ERROR    Framework Error: 
e13eeylbz4-algo-1-c64ep  | Traceback (most recent call last):
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/trainer.py", line 88, in train
e13eeylbz4-algo-1-c64ep  |     entrypoint()
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_container/training.py", line 153, in main
e13eeylbz4-algo-1-c64ep  |     train(environment.Environment())
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_pytorch_container/training.py", line 100, in train
e13eeylbz4-algo-1-c64ep  |     entry_point.run(uri=training_environment.module_dir,
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/entry_point.py", line 92, in run
e13eeylbz4-algo-1-c64ep  |     files.download_and_extract(uri=uri, path=environment.code_dir)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/files.py", line 138, in download_and_extract
e13eeylbz4-algo-1-c64ep  |     s3_download(uri, dst)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/sagemaker_training/files.py", line 174, in s3_download
e13eeylbz4-algo-1-c64ep  |     s3.Bucket(bucket).download_file(key, dst)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/boto3/s3/inject.py", line 277, in bucket_download_file
e13eeylbz4-algo-1-c64ep  |     return self.meta.client.download_file(
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/boto3/s3/inject.py", line 190, in download_file
e13eeylbz4-algo-1-c64ep  |     return transfer.download_file(
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/boto3/s3/transfer.py", line 326, in download_file
e13eeylbz4-algo-1-c64ep  |     future.result()
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/futures.py", line 103, in result
e13eeylbz4-algo-1-c64ep  |     return self._coordinator.result()
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/futures.py", line 266, in result
e13eeylbz4-algo-1-c64ep  |     raise self._exception
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/tasks.py", line 269, in _main
e13eeylbz4-algo-1-c64ep  |     self._submit(transfer_future=transfer_future, **kwargs)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/s3transfer/download.py", line 354, in _submit
e13eeylbz4-algo-1-c64ep  |     response = client.head_object(
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
e13eeylbz4-algo-1-c64ep  |     return self._make_api_call(operation_name, kwargs)
e13eeylbz4-algo-1-c64ep  |   File "/opt/conda/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
e13eeylbz4-algo-1-c64ep  |     raise error_class(parsed_response, operation_name)
e13eeylbz4-algo-1-c64ep  | botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
e13eeylbz4-algo-1-c64ep  | 
e13eeylbz4-algo-1-c64ep  | An error occurred (403) when calling the HeadObject operation: Forbidden
e13eeylbz4-algo-1-c64ep  | 2023-11-17 01:34:51,011 sagemaker-training-toolkit ERROR    Encountered exit_code 1

A permission error is surprising because I didn't have issues before and because there haven't been any changes in my setup.

@ivan-khvostishkov
Copy link
Contributor

Looking into exception stack trace, I see that it's again something related to SageMaker itself rather than to SSH Helper. It's downloading the code from S3, most likely from the default bucket that looks like s3://sagemaker-eu-west-1-555555555555/ . Could you check that this bucket exists, you can access this bucket from your notebook instance (e.g. by running aws s3 cp command from the Terminal) and it's located in the same region as your notebook?

If the above steps don't help, please, raise a support case:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html

@djmarti djmarti closed this as completed Dec 13, 2023
@djmarti
Copy link
Author

djmarti commented Dec 13, 2023

Apologies for the long delay. I retried with the exact same code and the problem is gone, which is consistent with your suggestion that this was something related to SageMaker. Everything works as expected, closing the ticket.

@ivan-khvostishkov
Copy link
Contributor

I've faced the similar message with HeadObject, but it looks like the notebook instance was running for a very long time. I've stopped and started this instance again and the issue is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants