-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-13314]Revise recommendations to manage Python pipeline dependencies. #16938
Changes from 8 commits
d00f744
68e851f
072ead7
422d9ba
f9ec6f4
98facb0
d2b55ba
c367ab3
f712244
758ee0a
860c4f0
9ad0ba9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -42,11 +42,11 @@ For optimal user experience, we also recommend you use the latest released versi | |||
|
||||
### Building and pushing custom containers | ||||
|
||||
Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways: | ||||
Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of three ways: | ||||
|
||||
1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. | ||||
2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). | ||||
|
||||
3. **[Modifying](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners. | ||||
#### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles} | ||||
|
||||
1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from). | ||||
|
@@ -171,6 +171,48 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ | |||
|
||||
By default, no licenses/notices are added to the docker images. | ||||
|
||||
#### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image} | ||||
Beam offers a way to provide your own custom container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image. | ||||
|
||||
1. Copy necessary artifacts from Apache Beam base image to your image. | ||||
``` | ||||
# This can be any container image, | ||||
FROM python:3.7-bullseye | ||||
|
||||
# Install SDK. (needed for Python SDK) | ||||
RUN pip install --no-cache-dir apache-beam[gcp]==2.35.0 | ||||
|
||||
# Copy files from official SDK image, including script/dependencies. | ||||
COPY --from=apache/beam_python3.7_sdk:2.35.0 /opt/apache/beam /opt/apache/beam | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if this is sufficient, IIRC /opt/apache/beam this only contains the boot program? all the base_image_requirements are in site_packages? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they are in dist_packages There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the requirements will be installed at runtime , when we pip-install the staged apache beam sdk here beam/sdks/python/container/piputil.go Line 164 in d46bd07
|
||||
|
||||
# Perform any additional customizations if desired | ||||
|
||||
# Set the entrypoint to Apache Beam SDK launcher. | ||||
ENTRYPOINT ["/opt/apache/beam/boot"] | ||||
|
||||
``` | ||||
>**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. | ||||
>The version specified in the `RUN` instruction must match the version used to launch the pipeline.<br> | ||||
tvalentyn marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
>**Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they used to run the pipeline**. | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider: "Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline." |
||||
|
||||
|
||||
2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. | ||||
``` | ||||
export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" | ||||
export IMAGE_NAME="myremoterepo/mybeamsdk" | ||||
export TAG="latest" | ||||
|
||||
# Optional - pull the base image into your local Docker daemon to ensure | ||||
# you have the most up-to-date version of the base image locally. | ||||
docker pull "${BASE_IMAGE}" | ||||
|
||||
docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . | ||||
``` | ||||
|
||||
3. If your runner is running remotely, retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image to your repository. | ||||
``` | ||||
docker push "${IMAGE_NAME}:${TAG}" | ||||
``` | ||||
|
||||
## Running pipelines with custom container images {#running-pipelines} | ||||
|
||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -45,6 +45,16 @@ If your pipeline uses public packages from the [Python Package Index](https://py | |||||
The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers. | ||||||
|
||||||
**Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them. | ||||||
> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "compile the all" -> "compile all" |
||||||
## Custom Containers {#custom-containers} | ||||||
AnandInguva marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). | ||||||
|
||||||
1. If you are using a custom container image, we recommend that you install the dependencies from the `--requirements_file` directly into your image at build time. In this case, you do not need to pass `--requirements_file` option at runtime, which will reduce the pipeline startup time. | ||||||
|
||||||
# Add these lines with the path to the requirements.txt to the Dockerfile | ||||||
COPY <path to requirements.txt> /tmp/requirements.txt | ||||||
RUN python -m pip install -r /tmp/requirements.txt | ||||||
|
||||||
|
||||||
## Local or non-PyPI Dependencies {#local-or-nonpypi} | ||||||
|
@@ -53,7 +63,7 @@ If your pipeline uses packages that are not available publicly (e.g. packages th | |||||
|
||||||
1. Identify which packages are installed on your machine and are not public. Run the following command: | ||||||
|
||||||
pip freeze | ||||||
pip freeze | ||||||
|
||||||
This command lists all packages that are installed on your machine, regardless of where they were installed from. | ||||||
|
||||||
|
@@ -123,3 +133,20 @@ If your pipeline uses non-Python packages (e.g. packages that require installati | |||||
--setup_file /path/to/setup.py | ||||||
|
||||||
**Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file. | ||||||
|
||||||
## Pre-building SDK container image | ||||||
tvalentyn marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something is missing here. Let's add an introductory sentence. In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via |
||||||
To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
--prebuild_sdk_container_engine <execution_environment> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that this may not work on an arbitrary base image, the base image should follow the same contract to install dependencies in a setup_only mode as apache beam's base image https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L49 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to the point 3 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As @y1chi pointed out, it may not work if the user doesn't follow apache beam's contract. But we do instruct them to follow the contract in some way. So, I assume we can introduce this section as part of the instruction? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would remove #2 now that we don't need a special flag and use the standard --sdk_container_image flag for this purpose. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we can keep 2 and update the pipeline option to |
||||||
|
||||||
--sdk_container_image <location_to_base_image> | ||||||
3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the remote registry by passing | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If using |
||||||
|
||||||
--docker_registry_push_url <IMAGE_URL> | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am confused - what is a sample value of this param? Is it supposed to be the image name+tag or just the registry? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be a registry. We generate the image tag. Image name is coded as May be it can worded as [1]
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we give an example of the expected value? As a user reading this doc it is still not obvious. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can add an example. Also let me see if I can make the wording more simpler |
||||||
> To use Docker, the `--sdk_container_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion to add to the notes: The pre-building feature requires the Apache Beam SDK for Python, version 2.25.0 or later. The container images created during prebuilding will persist beyond the pipeline runtime. If your pipeline is using a custom container image, most likely you will not benefit from prebuilding step as extra dependencies can be preinstalled in the custom image at build time. If you still would like to use prebuilding with custom images, use Apache Beam SDK 2.38.0 or newer and supply your custom image in via the |
||||||
**NOTE**: This feature is available only for the `DataflowRunner`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this multi-stage build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Using multi stage build process to copy required artifacts from Apache Beam's base image to the provided custom image