Skip to content

Commit

Permalink
[BEAM-13314]Revise recommendations to manage Python pipeline dependen…
Browse files Browse the repository at this point in the history
…cies. (#16938)

Co-authored-by: tvalentyn <tvalentyn@users.noreply.github.com>
  • Loading branch information
AnandInguva and tvalentyn authored Mar 29, 2022
1 parent dc5e209 commit c84818d
Show file tree
Hide file tree
Showing 2 changed files with 81 additions and 3 deletions.
46 changes: 44 additions & 2 deletions website/www/site/content/en/documentation/runtime/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,11 @@ For optimal user experience, we also recommend you use the latest released versi

### Building and pushing custom containers

Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways:
Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of three ways:

1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables.
2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions).

3. **[Modifying](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners.
#### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles}

1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from).
Expand Down Expand Up @@ -172,6 +172,48 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_

By default, no licenses/notices are added to the docker images.

#### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image}
Beam offers a way to provide your own custom container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image.

1. Copy necessary artifacts from Apache Beam base image to your image.
```
# This can be any container image,
FROM python:3.7-bullseye
# Install SDK. (needed for Python SDK)
RUN pip install --no-cache-dir apache-beam[gcp]==2.35.0
# Copy files from official SDK image, including script/dependencies.
COPY --from=apache/beam_python3.7_sdk:2.35.0 /opt/apache/beam /opt/apache/beam
# Perform any additional customizations if desired
# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
```
>**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time.
>The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br>
>**Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline.**

2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker.
```
export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0"
export IMAGE_NAME="myremoterepo/mybeamsdk"
export TAG="latest"
# Optional - pull the base image into your local Docker daemon to ensure
# you have the most up-to-date version of the base image locally.
docker pull "${BASE_IMAGE}"
docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" .
```

3. If your runner is running remotely, retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image to your repository.
```
docker push "${IMAGE_NAME}:${TAG}"
```

## Running pipelines with custom container images {#running-pipelines}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,17 @@ If your pipeline uses public packages from the [Python Package Index](https://py
The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers.

**Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.
> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned.
## Custom Containers {#custom-containers}

You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines).

1. If you are using a custom container image, we recommend that you install the dependencies from the `--requirements_file` directly into your image at build time. In this case, you do not need to pass `--requirements_file` option at runtime, which will reduce the pipeline startup time.

# Add these lines with the path to the requirements.txt to the Dockerfile
COPY <path to requirements.txt> /tmp/requirements.txt
RUN python -m pip install -r /tmp/requirements.txt


## Local or non-PyPI Dependencies {#local-or-nonpypi}
Expand All @@ -53,7 +64,7 @@ If your pipeline uses packages that are not available publicly (e.g. packages th

1. Identify which packages are installed on your machine and are not public. Run the following command:

pip freeze
pip freeze

This command lists all packages that are installed on your machine, regardless of where they were installed from.

Expand Down Expand Up @@ -123,3 +134,28 @@ If your pipeline uses non-Python packages (e.g. packages that require installati
--setup_file /path/to/setup.py

**Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file.

## Pre-building SDK container image

In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time.
However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start. To pre-build the container image before pipeline submission, provide the pipeline options mentioned below.
1. Provide the container engine. Beam supports `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled).

--prebuild_sdk_container_engine=<container_engine>

2. If using `local_docker` engine, provide a URL for the remote registry to which the image will be pushed by passing

--docker_registry_push_url=<remote_registry_url>
# Example: --docker_registry_push_url=<registry_name>/beam
# pre-built image will be pushed to the <registry_name>/beam/beam_python_prebuilt_sdk:<unique_image_tag>
# <unique_image_tag> tag is generated by Beam SDK.

**NOTE:** `docker_registry_push_url` must be a remote registry.
> The pre-building feature requires the Apache Beam SDK for Python, version 2.25.0 or later.
The container images created during prebuilding will persist beyond the pipeline runtime.
Once your job is finished or stopped, you can remove the pre-built image from the container registry.

>If your pipeline is using a custom container image, most likely you will not benefit from pre-building step as extra dependencies can be preinstalled in the custom image at build time. If you still would like to use pre-building with custom images, use Apache Beam SDK 2.38.0 or newer and
supply your custom image via `--sdk_container_image` pipeline option.

**NOTE**: This feature is available only for the `Dataflow Runner v2`.

0 comments on commit c84818d

Please sign in to comment.