[Bug Report] - SageMaker Pipelines to Run Jobs Locally #3635

fjpa121197 · 2022-10-25T09:43:13Z

Link to the notebook

Following example from: https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/local-mode/sagemaker-pipelines-local-mode.ipynb

My notebook with error (i made some modifications to it, minor ones): https://github.com/fjpa121197/aws-sagemaker-training/blob/main/sagemaker-pipelines-local-mode.ipynb

Describe the bug
I'm trying to follow this tutorial to run Sagemaker Pipelines locally and test them before using managed resources. I have created a pipeline definition that includes preprocessing, training and evaluation. I'm able to create the pipeline without any problem, but when executing the pipeline, I encountered error in the evaluation step. It is related to not being able to download the model.tar.gz file to the container and to the correct directory to use the model for evaluation.

Error:

Starting pipeline step: 'AbaloneEval'
Container jhais7c823-algo-1-7ko39  Creating
Container jhais7c823-algo-1-7ko39  Created
Attaching to jhais7c823-algo-1-7ko39
jhais7c823-algo-1-7ko39  | Traceback (most recent call last):
jhais7c823-algo-1-7ko39  |   File "/opt/ml/processing/input/code/evaluation.py", line 16, in <module>
jhais7c823-algo-1-7ko39  |     with tarfile.open(model_path) as tar:
jhais7c823-algo-1-7ko39  |   File "/miniconda3/lib/python3.8/tarfile.py", line 1603, in open
jhais7c823-algo-1-7ko39  |     return func(name, "r", fileobj, **kwargs)
jhais7c823-algo-1-7ko39  |   File "/miniconda3/lib/python3.8/tarfile.py", line 1667, in gzopen
jhais7c823-algo-1-7ko39  |     fileobj = GzipFile(name, mode + "b", compresslevel, fileobj)
jhais7c823-algo-1-7ko39  |   File "/miniconda3/lib/python3.8/gzip.py", line 173, in __init__
jhais7c823-algo-1-7ko39  |     fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
jhais7c823-algo-1-7ko39  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/processing/model/model.tar.gz'

jhais7c823-algo-1-7ko39 exited with code 1
Aborting on container exit...
Container jhais7c823-algo-1-7ko39  Stopping
Container jhais7c823-algo-1-7ko39  Stopped
Pipeline step 'AbaloneEval' FAILED. Failure message is: RuntimeError: Failed to run: ['docker-compose', '-f', 'C:\\Users\\FRANCI~1.PAR\\AppData\\Local\\Temp\\tmp188wz79r\\docker-compose.yaml', 'up', '--build', '--abort-on-container-exit']
Pipeline execution 1012b92d-36c6-4499-b898-d78d7a2bea8a FAILED because step 'AbaloneEval' failed.

I understand that the evaluation step definition is as follows:

Job Name:  script-abalone-eval-2022-10-25-09-04-44-205
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': <sagemaker.workflow.properties.Properties object at 0x000002647A7F1DC0>, 'LocalPath': '/opt/ml/processing/model', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'AppManaged': False, 'S3Input': {'S3Uri': <sagemaker.workflow.properties.Properties object at 0x000002647A11EB80>, 'LocalPath': '/opt/ml/processing/test', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-local-pipeline-tutorials/script-abalone-eval-2022-10-25-09-04-44-205/input/code/evaluation.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'evaluation', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-local-pipeline-tutorials/script-abalone-eval-2022-10-25-09-04-44-205/output/evaluation', 'LocalPath': '/opt/ml/processing/evaluation', 'S3UploadMode': 'EndOfJob'}}]

And my eval_args definition is as follows:

eval_args = script_eval.run(
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="code/evaluation.py",
)

where source for the first input refers to the step_train defined before and it should download the model artifacts, but it is not doing it. For the other defined input, it does download the test data to use, but not the model artificats.

Not sure if there is a replacement for: source=step_train.properties.ModelArtifacts.S3ModelArtifacts argument.

Am I doing something wrong? I don't think it is permission/policies related since it doesn't give any AccessDenied errors.

Im using sagemaker 2.113.0

Thanks in advance

The text was updated successfully, but these errors were encountered:

fjpa121197 · 2022-10-25T09:47:24Z

@pmhargis-aws
@kirit93

pmhargis-aws · 2022-10-25T19:14:46Z

Thanks for running the sample.

This reference is supposed to locate the model file on S3: 'source=step_train.properties.ModelArtifacts.S3ModelArtifacts'

Can you check the Training Job stats and check if the "model.tar.gz" file is in fact stored in S3 at the prescribed location?

fjpa121197 · 2022-10-25T20:09:56Z

Hi @pmhargis-aws

Is it possible to know the value of that reference? Based on this reference, about the properties attribute, it should point to the S3 location right? In what place can I check the Training Job stats? The last logging info line in there is the following: INFO:root:Stored trained model at /opt/ml/model/xgboost-model, this is from the training step container.

The location of that file is the following:
"s3://sagemaker-local-pipeline-tutorials/sagemaker-pipelines-local-mode-example-1/model/AbaloneTrain-1666688809-7519/"

I'm not sure where the last part comes from? AbaloneTrain-1666688809-7519, maybe from the name defined in TrainingStep()?

Does the reference source=step_train.properties.ModelArtifacts.S3ModelArtifacts point towards "s3://sagemaker-local-pipeline-tutorials/sagemaker-pipelines-local-mode-example-1/model/" and maybe because it has the last part AbaloneTrain-1666688809-7519, it is not able to download the model.tar.gz file?

kirit93 · 2022-10-26T15:36:57Z

@fjpa121197 - I just ran your notebook and it worked for me with no issues. Are you consistently seeing this?

fjpa121197 · 2022-10-26T17:05:16Z

Really? I mean, I haven't been able to run the notebook successfully in like 10 tries that I did yesterday.

Is there another way to make the same reference and be able to proceed with the evaluation?

Rainymood · 2023-04-08T05:58:06Z

This helped me, maybe it helps you: docker/docker-py#3099

I can run everything fine now, but before I was unable to get the sagemaker pipeline to work locally, even though I was able to build/run the individual containers in my sagemaker pipeline. I'm not 100% sure how this works but it seems like the sagemaker package that I used was still on an older version of Docker that did not have this fix (see link above).

win 10.0, 64 bit
python 3.8.0
sagemaker 2.144.0

pmhargis-aws · 2023-04-18T18:44:57Z

@Rainymood Thanks for the update. I wonder if myself and Kirit did not run into this problem since we are running samples on native MacOS. The docker fix you reference indicates a fix for Windows sockets.

fjpa121197 mentioned this issue Nov 4, 2022

TransformStep not handling ModelStep properties correctly - model.tar.gz file not being downloaded aws/sagemaker-python-sdk#3458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] - SageMaker Pipelines to Run Jobs Locally #3635

[Bug Report] - SageMaker Pipelines to Run Jobs Locally #3635

fjpa121197 commented Oct 25, 2022 •

edited

Loading

fjpa121197 commented Oct 25, 2022 •

edited

Loading

pmhargis-aws commented Oct 25, 2022

fjpa121197 commented Oct 25, 2022 •

edited

Loading

kirit93 commented Oct 26, 2022

fjpa121197 commented Oct 26, 2022

Rainymood commented Apr 8, 2023

pmhargis-aws commented Apr 18, 2023

[Bug Report] - SageMaker Pipelines to Run Jobs Locally #3635

[Bug Report] - SageMaker Pipelines to Run Jobs Locally #3635

Comments

fjpa121197 commented Oct 25, 2022 • edited Loading

fjpa121197 commented Oct 25, 2022 • edited Loading

pmhargis-aws commented Oct 25, 2022

fjpa121197 commented Oct 25, 2022 • edited Loading

kirit93 commented Oct 26, 2022

fjpa121197 commented Oct 26, 2022

Rainymood commented Apr 8, 2023

pmhargis-aws commented Apr 18, 2023

fjpa121197 commented Oct 25, 2022 •

edited

Loading

fjpa121197 commented Oct 25, 2022 •

edited

Loading

fjpa121197 commented Oct 25, 2022 •

edited

Loading