[Bug]: The --impersonate_service_account
pipeline option may be accidentally used at runtime in Python BigQuery IO.
#32030
Labels
Milestone
What happened?
The
--impersonate_service_account
option allows a principal to submit Dataflow jobs on behalf of another account in a delegation chain. This option is used when creating a GCP credential at job submission:beam/sdks/python/apache_beam/internal/gcp/auth.py
Line 184 in 121ac71
beam/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py
Lines 278 to 285 in 121ac71
However some Beam IOs might store a copy of a pipeline options, which contains the impersonation credential. When a pipeline with such IOs becomes serialzed in the Runner API representation, and the IO DoFns deserialize on the runner, we might accidentally capture the
--impersonate_service_account
pipeline option and incorrectly use it at runtime.When this happens, the worker logs might have a line 'Impersonating <... service account name ...> " at runtime. Such logs should only be applicable at job submission.
Note that creating an impersonated credential might be necessary at job submission for other purposes than submitting a dataflow job, for example to do pre-submission validations of GCP resources. To this end the PR #26662 created a different mechanism to authenticate a BQ client. Unfortunately, it inadvertently caused pipelines using BQ IO to incorrectly execute the impersonation flow at runtime, starting from Apache Beam Python 2.49.0 SDK.
Workaround
To work around the issue until Beam 2.59.0 becomes available, add the following code to the beginning of the pipeline launcher:
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: