Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-13314]Revise recommendations to manage Python pipeline dependencies. #16938

Merged
merged 12 commits into from
Mar 29, 2022
46 changes: 44 additions & 2 deletions website/www/site/content/en/documentation/runtime/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,11 @@ For optimal user experience, we also recommend you use the latest released versi

### Building and pushing custom containers

Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways:
Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of three ways:

1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables.
2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions).

3. **[Modifying](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.
#### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles}

1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from).
Expand Down Expand Up @@ -171,6 +171,48 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_

By default, no licenses/notices are added to the docker images.

#### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
Beam offers a way to provide your own custom container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this multi-stage build?

Copy link
Contributor Author

@AnandInguva AnandInguva Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Using multi stage build process to copy required artifacts from Apache Beam's base image to the provided custom image


1. Copy necessary artifacts from Apache Beam base image to your image.
```
# This can be any container image,
FROM python:3.7-bullseye

# Install SDK. (needed for Python SDK)
RUN pip install --no-cache-dir apache-beam[gcp]==2.35.0

# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.7_sdk:2.35.0 /opt/apache/beam /opt/apache/beam
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is sufficient, IIRC /opt/apache/beam this only contains the boot program? all the base_image_requirements are in site_packages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are in dist_packages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the requirements will be installed at runtime , when we pip-install the staged apache beam sdk here

err := pipInstallPackage(files, workDir, sdkWhlFile, false, false, []string{"gcp"})


# Perform any additional customizations if desired

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

```
>**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time.
>The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br>
tvalentyn marked this conversation as resolved.
Show resolved Hide resolved
>**Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they used to run the pipeline**.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider: "Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline."



2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker.
```
export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0"
export IMAGE_NAME="myremoterepo/mybeamsdk"
export TAG="latest"

# Optional - pull the base image into your local Docker daemon to ensure
# you have the most up-to-date version of the base image locally.
docker pull "${BASE_IMAGE}"

docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" .
```

3. If your runner is running remotely, retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image to your repository.
```
docker push "${IMAGE_NAME}:${TAG}"
```

## Running pipelines with custom container images {#running-pipelines}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,16 @@ If your pipeline uses public packages from the [Python Package Index](https://py
The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.

**Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"compile the all" -> "compile all"

## Custom Containers {#custom-containers}
AnandInguva marked this conversation as resolved.
Show resolved Hide resolved

You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines).

1. If you are using a custom container image, we recommend that you install the dependencies from the `--requirements_file` directly into your image at build time. In this case, you do not need to pass `--requirements_file` option at runtime, which will reduce the pipeline startup time.

# Add these lines with the path to the requirements.txt to the Dockerfile
COPY <path to requirements.txt> /tmp/requirements.txt
RUN python -m pip install -r /tmp/requirements.txt


## Local or non-PyPI Dependencies {#local-or-nonpypi}
Expand All @@ -53,7 +63,7 @@ If your pipeline uses packages that are not available publicly (e.g. packages th

1. Identify which packages are installed on your machine and are not public. Run the following command:

pip freeze
pip freeze

This command lists all packages that are installed on your machine, regardless of where they were installed from.

Expand Down Expand Up @@ -123,3 +133,20 @@ If your pipeline uses non-Python packages (e.g. packages that require installati
--setup_file /path/to/setup.py

**Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file.

## Pre-building SDK container image
tvalentyn marked this conversation as resolved.
Show resolved Hide resolved

In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is missing here. Let's add an introductory sentence.

In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via --requirements_file and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start.

To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below.
To pre-build the container image before the pipeline submission, follow the steps below.

1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled).
1. Provide the container engine. We support `local_docker` (requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled).


--prebuild_sdk_container_engine <execution_environment>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--prebuild_sdk_container_engine <execution_environment>
--prebuild_sdk_container_engine <container_engine>

2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this may not work on an arbitrary base image, the base image should follow the same contract to install dependencies in a setup_only mode as apache beam's base image https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L49

Copy link
Contributor

@tvalentyn tvalentyn Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. prebuild_sdk_container_engine (not enginer)
  2. Good point that the container needs to have the official entry point for this to work. I think in all container-customization mechanisms we suggest, one way or another we recommend, to use Beam's boot entry point.
  3. As a part of making prebuilding not experimental, I think we should remove prebuild_sdk_container_base_image and just use --sdk_container_image flag for this purpose. i don't see the need for two different flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the point 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y1chi took care of #3 in #17032. Thanks, @y1chi .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @y1chi pointed out, it may not work if the user doesn't follow apache beam's contract. But we do instruct them to follow the contract in some way.

So, I assume we can introduce this section as part of the instruction?

Copy link
Contributor

@tvalentyn tvalentyn Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove #2 now that we don't need a special flag and use the standard --sdk_container_image flag for this purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can keep 2 and update the pipeline option to --sdk_container_image=....


--sdk_container_image <location_to_base_image>
3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the remote registry by passing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using local_docker engine, provide a URL for the remote registry to which the image will be pushed by passing...


--docker_registry_push_url <IMAGE_URL>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused - what is a sample value of this param? Is it supposed to be the image name+tag or just the registry?

Copy link
Contributor Author

@AnandInguva AnandInguva Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a registry. We generate the image tag. Image name is coded as beam_python_prebuilt_sdk at [1].

May be it can worded as --docker_registry_push_url <registry_URL>

[1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give an example of the expected value? As a user reading this doc it is still not obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add an example. Also let me see if I can make the wording more simpler

> To use Docker, the `--sdk_container_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to add to the notes:

The pre-building feature requires the Apache Beam SDK for Python, version 2.25.0 or later.

The container images created during prebuilding will persist beyond the pipeline runtime.
Once your job is finished or stopped, you can remove the pre-built image from the container registry.

If your pipeline is using a custom container image, most likely you will not benefit from prebuilding step as extra dependencies can be preinstalled in the custom image at build time. If you still would like to use prebuilding with custom images, use Apache Beam SDK 2.38.0 or newer and supply your custom image in via the --sdk_container_image pipeline option.

**NOTE**: This feature is available only for the `DataflowRunner`.