Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test]: Python ARM PostCommit failing after #28385 #29076

Closed
1 of 16 tasks
Abacn opened this issue Oct 19, 2023 · 9 comments
Closed
1 of 16 tasks

[Failing Test]: Python ARM PostCommit failing after #28385 #29076

Abacn opened this issue Oct 19, 2023 · 9 comments
Labels
awaiting triage bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test P1 permared python tests

Comments

@Abacn
Copy link
Contributor

Abacn commented Oct 19, 2023

What happened?

From #28385 (comment)

Python ARM postcommit were failing after this change:

Last passed run https://github.com/apache/beam/actions/runs/6499261265 at 223dded (before it was flaky on some of the py version but at least one py version could succeed)

First failing run https://github.com/apache/beam/actions/runs/6502267792/job/17661039890 at 0586161

there are 4 PRs merged and this is the only python change

the error message is

2023-10-13T04:08:33.8802524Z  WARNING. apache_beam.runners.dataflow.dataflow_runner:dataflow_runner.py:202
2023-10-13T03:48:37.128Z: JOB_MESSAGE_WARNING: A worker was unable to start up.
Error: Unable to pull container image due to error: image pull request failed with error:
Error response from daemon: manifest for us.gcr.io/apache-beam-testing/github-actions/beam_python3.9_sdk:20231013-001134756942144 not found: manifest unknown:
Failed to fetch "20231013-001134756942144" from request "/v2/apache-beam-testing/github-actions/beam_python3.9_sdk/manifests/20231013-001134756942144"..
This is likely due to an invalid SDK container image URL.
Please verify any provided SDK container image is valid and that Dataflow workers have permissions to pull image.

2023-10-13T04:08:33.8845726Z  ERROR apache_beam.runners.dataflow.dataflow_runner:dataflow_runner.py:770
Console URL: https://console.cloud.google.com/dataflow/jobs/us-central1/2023-10-12_20_46_29-13701384909164481751?project=apache-beam-testing

Another Python ARM test, https://github.com/apache/beam/actions/workflows/beam_Python_ValidatesContainer_Dataflow_ARM.yml is healthy though

Issue Failure

Failure: Test is continually failing

Issue Priority

Priority: 1 (unhealthy code / failing or flaky postcommit so we cannot be sure the product is healthy)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@AnandInguva
Copy link
Contributor

#29022 - tests are failing during staging where the stager downloads the packages that are mentioned in `requirements.txt. This is happening because of upgrade of pip. With newer upgrade of pip, there are mismatches with SHA for the cached packages vs the expected SHA.

Also, there are bigquery tests failing due to permission error.

image

@AnandInguva
Copy link
Contributor

AnandInguva commented Oct 19, 2023

https://github.com/apache/beam/actions/runs/6576849694 - that solves the permissions/IAM errors for bigquery tests but there are still some errors which needs investigation.

image

image

@AnandInguva
Copy link
Contributor

AnandInguva commented Oct 19, 2023

cc: @Abacn @ahmedabu98 This is the failing test �[31mFAILED�[0m apache_beam/io/gcp/bigquery_test.py::�[1mPubSubBigQueryIT::test_streaming_inserts�[0m - google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded I found in the https://ge.apache.org/s/lanaestwlqgxk/console-log/raw. Can you help debug it?

@Abacn
Copy link
Contributor Author

Abacn commented Oct 19, 2023

@AnandInguva thanks, taking a look

@Abacn
Copy link
Contributor Author

Abacn commented Oct 20, 2023

The latest run backs to "flaky" instead of "permared" state: https://github.com/apache/beam/actions/runs/6590602266/job/17907553487

BigtableIOWriteTest::test_bigtable_write is failing in 3 out of 4 Python versions due to pipeline timeout (>1h), which means the pipeline is stuck.

@AnandInguva
Copy link
Contributor

Do you know why they are flaky? I can take a look early monday why they are flaky and work on any solution needed. Do we open this issue until they are fully green?

@Abacn
Copy link
Contributor Author

Abacn commented Oct 20, 2023

There is a known Bigtable client issue: #28715 likely hit by this (internal tracker 302688125)

@Abacn
Copy link
Contributor Author

Abacn commented Oct 30, 2023

Until the client issue is resolved, let me see if we can fail the bundle when it stucks so lets the runner retry, as a workaround

@tvalentyn
Copy link
Contributor

looks like its resolved: https://github.com/apache/beam/actions/runs/6862673357

@github-actions github-actions bot added this to the 2.53.0 Release milestone Nov 14, 2023
@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting triage bug done & done Issue has been reviewed after it was closed for verification, followups, etc. failing test P1 permared python tests
Projects
None yet
Development

No branches or pull requests

4 participants