-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-13314]Revise recommendations to manage Python pipeline dependencies. #16938
Changes from 11 commits
d00f744
68e851f
072ead7
422d9ba
f9ec6f4
98facb0
d2b55ba
c367ab3
f712244
758ee0a
860c4f0
9ad0ba9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -42,11 +42,11 @@ For optimal user experience, we also recommend you use the latest released versi | |||
|
||||
### Building and pushing custom containers | ||||
|
||||
Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways: | ||||
Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of three ways: | ||||
|
||||
1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. | ||||
2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). | ||||
|
||||
3. **[Modifying](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners. | ||||
#### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles} | ||||
|
||||
1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from). | ||||
|
@@ -171,6 +171,48 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ | |||
|
||||
By default, no licenses/notices are added to the docker images. | ||||
|
||||
#### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image} | ||||
Beam offers a way to provide your own custom container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image. | ||||
|
||||
1. Copy necessary artifacts from Apache Beam base image to your image. | ||||
``` | ||||
# This can be any container image, | ||||
FROM python:3.7-bullseye | ||||
|
||||
# Install SDK. (needed for Python SDK) | ||||
RUN pip install --no-cache-dir apache-beam[gcp]==2.35.0 | ||||
|
||||
# Copy files from official SDK image, including script/dependencies. | ||||
COPY --from=apache/beam_python3.7_sdk:2.35.0 /opt/apache/beam /opt/apache/beam | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if this is sufficient, IIRC /opt/apache/beam this only contains the boot program? all the base_image_requirements are in site_packages? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they are in dist_packages There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the requirements will be installed at runtime , when we pip-install the staged apache beam sdk here beam/sdks/python/container/piputil.go Line 164 in d46bd07
|
||||
|
||||
# Perform any additional customizations if desired | ||||
|
||||
# Set the entrypoint to Apache Beam SDK launcher. | ||||
ENTRYPOINT ["/opt/apache/beam/boot"] | ||||
|
||||
``` | ||||
>**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. | ||||
>The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br> | ||||
tvalentyn marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
>**Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline.** | ||||
|
||||
|
||||
2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. | ||||
``` | ||||
export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" | ||||
export IMAGE_NAME="myremoterepo/mybeamsdk" | ||||
export TAG="latest" | ||||
|
||||
# Optional - pull the base image into your local Docker daemon to ensure | ||||
# you have the most up-to-date version of the base image locally. | ||||
docker pull "${BASE_IMAGE}" | ||||
|
||||
docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . | ||||
``` | ||||
|
||||
3. If your runner is running remotely, retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image to your repository. | ||||
``` | ||||
docker push "${IMAGE_NAME}:${TAG}" | ||||
``` | ||||
|
||||
## Running pipelines with custom container images {#running-pipelines} | ||||
|
||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,6 +45,17 @@ If your pipeline uses public packages from the [Python Package Index](https://py | |
The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers. | ||
|
||
**Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them. | ||
> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned. | ||
|
||
## Custom Containers {#custom-containers} | ||
AnandInguva marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). | ||
|
||
1. If you are using a custom container image, we recommend that you install the dependencies from the `--requirements_file` directly into your image at build time. In this case, you do not need to pass `--requirements_file` option at runtime, which will reduce the pipeline startup time. | ||
|
||
# Add these lines with the path to the requirements.txt to the Dockerfile | ||
COPY <path to requirements.txt> /tmp/requirements.txt | ||
RUN python -m pip install -r /tmp/requirements.txt | ||
|
||
|
||
## Local or non-PyPI Dependencies {#local-or-nonpypi} | ||
|
@@ -53,7 +64,7 @@ If your pipeline uses packages that are not available publicly (e.g. packages th | |
|
||
1. Identify which packages are installed on your machine and are not public. Run the following command: | ||
|
||
pip freeze | ||
pip freeze | ||
|
||
This command lists all packages that are installed on your machine, regardless of where they were installed from. | ||
|
||
|
@@ -123,3 +134,25 @@ If your pipeline uses non-Python packages (e.g. packages that require installati | |
--setup_file /path/to/setup.py | ||
|
||
**Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file. | ||
|
||
## Pre-building SDK container image | ||
tvalentyn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. | ||
However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start. To pre-build the container image before pipeline submission, provide the pipeline options mentioned below. | ||
1. Provide the container engine. Beam supports `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled). | ||
|
||
--prebuild_sdk_container_engine=<container_engine> | ||
2. To pass a base image for pre-building dependencies, provide `--sdk_container_image`. If not, Apache beam's base [image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) would be used. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As discussed offline, let's remove this and line 156. |
||
|
||
--sdk_container_image=<location_to_base_image> | ||
3. If using `local_docker` engine, provide a URL for the remote registry to which the image will be pushed by passing | ||
|
||
--docker_registry_push_url=<remote_registry_url> | ||
# Example: --docker_registry_push_url=<registry_name>/beam | ||
tvalentyn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# pre-built image will be pushed to the <registry_name>/beam/beam_python_prebuilt_sdk:<unique_image_tag> | ||
# <unique_image_tag> tag is generated by Beam SDK. | ||
|
||
**NOTE:** `docker_registry_push_url` must be a remote registry. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @y1chi if the user uses pre-building and doesn't provide I recall it would fail with error something like this |
||
> To use Docker, the `--sdk_container_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion to add to the notes: The pre-building feature requires the Apache Beam SDK for Python, version 2.25.0 or later. The container images created during prebuilding will persist beyond the pipeline runtime. If your pipeline is using a custom container image, most likely you will not benefit from prebuilding step as extra dependencies can be preinstalled in the custom image at build time. If you still would like to use prebuilding with custom images, use Apache Beam SDK 2.38.0 or newer and supply your custom image in via the |
||
**NOTE**: This feature is available only for the `Dataflow Runner v2`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this multi-stage build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Using multi stage build process to copy required artifacts from Apache Beam's base image to the provided custom image