Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Python job is flaky #30513

Open
github-actions bot opened this issue Mar 5, 2024 · 39 comments · Fixed by #32171 or #32382
Open

The PostCommit Python job is flaky #30513

github-actions bot opened this issue Mar 5, 2024 · 39 comments · Fixed by #32171 or #32382

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2024

The PostCommit Python is failing over 50% of the time
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=is%3Afailure+branch%3Amaster to see the logs.

@shunping
Copy link
Contributor

It first failed on https://github.com/apache/beam/actions/runs/8210266873.

The failed task is :sdks:python:test-suites:portable:py38:portableWordCountSparkRunnerBatch.

Traceback:

INFO:apache_beam.utils.subprocess_server:Starting service with ('java' '-jar' '/runner/_work/beam/beam/runners/spark/3/job-server/build/libs/beam-runners-spark-3-job-server-2.56.0-SNAPSHOT.jar' '--spark-master-url' 'local[4]' '--artifacts-dir' '/tmp/beam-temp8q8022zi/artifactsg6e8usou' '--job-port' '56313' '--artifact-port' '0' '--expansion-port' '0')
INFO:apache_beam.utils.subprocess_server:Error: A JNI error has occurred, please check your installation and try again
INFO:apache_beam.utils.subprocess_server:Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/beam/vendor/grpc/v1p60p1/io/grpc/BindableService
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.defineClass1(Native Method)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
INFO:apache_beam.utils.subprocess_server:	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
INFO:apache_beam.utils.subprocess_server:	at java.security.AccessController.doPrivileged(Native Method)
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
INFO:apache_beam.utils.subprocess_server:	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.getDeclaredMethods0(Native Method)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.getMethod0(Class.java:3018)
INFO:apache_beam.utils.subprocess_server:	at java.lang.Class.getMethod(Class.java:1784)
INFO:apache_beam.utils.subprocess_server:	at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:670)
INFO:apache_beam.utils.subprocess_server:	at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:652)
INFO:apache_beam.utils.subprocess_server:Caused by: java.lang.ClassNotFoundException: org.apache.beam.vendor.grpc.v1p60p1.io.grpc.BindableService
INFO:apache_beam.utils.subprocess_server:	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
INFO:apache_beam.utils.subprocess_server:	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
INFO:apache_beam.utils.subprocess_server:	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
INFO:apache_beam.utils.subprocess_server:	... 19 more
ERROR:apache_beam.utils.subprocess_server:Started job service with ('java', '-jar', '/runner/_work/beam/beam/runners/spark/3/job-server/build/libs/beam-runners-spark-3-job-server-2.56.0-SNAPSHOT.jar', '--spark-master-url', 'local[4]', '--artifacts-dir', '/tmp/beam-temp8q8022zi/artifactsg6e8usou', '--job-port', '56313', '--artifact-port', '0', '--expansion-port', '0')
ERROR:apache_beam.utils.subprocess_server:Error bringing up service
Traceback (most recent call last):
  File "/runner/_work/beam/beam/sdks/python/apache_beam/utils/subprocess_server.py", line 175, in start
    raise RuntimeError(
RuntimeError: Service failed to start up with error 1
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/examples/wordcount.py", line 111, in <module>
    run()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/examples/wordcount.py", line 106, in run
    output | 'Write' >> WriteToText(known_args.output)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/pipeline.py", line 612, in __exit__
    self.result = self.run()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/pipeline.py", line 586, in run
    return self.runner.run_pipeline(self, self._options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/runner.py", line 192, in run_pipeline
    return self.run_portable_pipeline(
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/portable_runner.py", line 381, in run_portable_pipeline
    job_service_handle = self.create_job_service(options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/portable_runner.py", line 296, in create_job_service
    return self.create_job_service_handle(server.start(), options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/job_server.py", line 81, in start
    self._endpoint = self._job_server.start()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/job_server.py", line 110, in start
    return self._server.start()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/utils/subprocess_server.py", line 175, in start
    raise RuntimeError(
RuntimeError: Service failed to start up with error 1
> Task :sdks:python:test-suites:portable:py38:portableWordCountSparkRunnerBatch FAILED

@shunping
Copy link
Contributor

Added the owner of the commit whose post-commit job failed at the first time.
@damccorm

@damccorm
Copy link
Contributor

I think we can pretty comfortably rule out that change, it was to the yaml sdk which is unrelated to portableWordCountSparkRunnerBatch. Note that this runs on a schedule, not on commits, though none of the commits in that scheduled time look particularly harmful

@shunping
Copy link
Contributor

I see. It was red for the last two weeks and flaky before that too.

@kennknowles
Copy link
Member

Permared right now

@damccorm
Copy link
Contributor

Only sorta - each component job is actually not permared - e.g. there are 2 successes here, https://github.com/apache/beam/actions/runs/8873798546

The whole workflow is permared just because our flake percentage is so high

@kennknowles
Copy link
Member

Yea, let's work out how to get top-level signal.

@Abacn
Copy link
Contributor

Abacn commented Apr 29, 2024

The lowest and highest Python version (3.8, 3.11) are running more tests than (3.9, 3.10), could be those tests or task permared

@kennknowles
Copy link
Member

Could make sense to find a way to get separate top-level signal for Python versions, assuming we can use software engineering to share everything necessary so they don't get out of sync.

@Abacn
Copy link
Contributor

Abacn commented Apr 29, 2024

Yeah, we used to have this for Jenkins where each Python PostCommit had its own task

@liferoad
Copy link
Collaborator

liferoad commented May 27, 2024

The Vertex AI package version issue (we do not import this directly. So it should be fine.):


../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
-- | --
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | /runner/_work/beam/beam/build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33: DeprecationWarning: |  
  | After May 30, 2024, importing any code below will result in an error. |  
  | Please verify that you are explicitly pinning to a version of `google-cloud-aiplatform` |  
  | (e.g., google-cloud-aiplatform==[1.32.0, 1.49.0]) if you need to continue using this |  
  | library. |  
  |   |  
  | from vertexai.preview import ( |  
  | init, |  
  | remote, |  
  | VertexModel, |  
  | register, |  
  | from_pretrained, |  
  | developer, |  
  | hyperparameter_tuning, |  
  | tabular_models, |  
  | ) |  
  |  


@liferoad
Copy link
Collaborator

liferoad commented May 28, 2024

A new flaky test in py39 and this is related to #29617:

https://ge.apache.org/s/hb7syztoolfhu/console-log?page=17


=================================== FAILURES =================================== |  
-- | --
  | �[31m�[1m_______________ BigQueryQueryToTableIT.test_big_query_legacy_sql _______________�[0m |  
  | [gw3] linux -- Python 3.9.19 /runner/_work/beam/beam/build/gradleenv/1398941893/bin/python3.9 |  
  |   |  
  | self = <apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT testMethod=test_big_query_legacy_sql> |  
  |   |  
  | �[37m@pytest�[39;49;00m.mark.it_postcommit�[90m�[39;49;00m |  
  | �[94mdef�[39;49;00m �[92mtest_big_query_legacy_sql�[39;49;00m(�[96mself�[39;49;00m):�[90m�[39;49;00m |  
  | verify_query = DIALECT_OUTPUT_VERIFY_QUERY % �[96mself�[39;49;00m.output_table�[90m�[39;49;00m |  
  | expected_checksum = test_utils.compute_hash(DIALECT_OUTPUT_EXPECTED)�[90m�[39;49;00m |  
  | pipeline_verifiers = [�[90m�[39;49;00m |  
  | PipelineStateMatcher(),�[90m�[39;49;00m |  
  | BigqueryMatcher(�[90m�[39;49;00m |  
  | project=�[96mself�[39;49;00m.project,�[90m�[39;49;00m |  
  | query=verify_query,�[90m�[39;49;00m |  
  | checksum=expected_checksum)�[90m�[39;49;00m |  
  | ]�[90m�[39;49;00m |  
  | �[90m�[39;49;00m |  
  | extra_opts = {�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33mquery�[39;49;00m�[33m'�[39;49;00m: LEGACY_QUERY,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33moutput�[39;49;00m�[33m'�[39;49;00m: �[96mself�[39;49;00m.output_table,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33moutput_schema�[39;49;00m�[33m'�[39;49;00m: DIALECT_OUTPUT_SCHEMA,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33muse_standard_sql�[39;49;00m�[33m'�[39;49;00m: �[94mFalse�[39;49;00m,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33mwait_until_finish_duration�[39;49;00m�[33m'�[39;49;00m: WAIT_UNTIL_FINISH_DURATION_MS,�[90m�[39;49;00m |  
  | �[33m'�[39;49;00m�[33mon_success_matcher�[39;49;00m�[33m'�[39;49;00m: all_of(*pipeline_verifiers),�[90m�[39;49;00m |  
  | }�[90m�[39;49;00m |  
  | options = �[96mself�[39;49;00m.test_pipeline.get_full_options_as_args(**extra_opts)�[90m�[39;49;00m |  
  | >     big_query_query_to_table_pipeline.run_bq_pipeline(options)�[90m�[39;49;00m |  
  |   |  
  | �[1m�[31mapache_beam/io/gcp/big_query_query_to_table_it_test.py�[0m:178: |  
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |  
  | �[1m�[31mapache_beam/io/gcp/big_query_query_to_table_pipeline.py�[0m:103: in run_bq_pipeline |  
  | result = p.run()�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/testing/test_pipeline.py�[0m:115: in run |  
  | result = �[96msuper�[39;49;00m().run(�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/pipeline.py�[0m:560: in run |  
  | �[94mreturn�[39;49;00m Pipeline.from_runner_api(�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/pipeline.py�[0m:587: in run |  
  | �[94mreturn�[39;49;00m �[96mself�[39;49;00m.runner.run_pipeline(�[96mself�[39;49;00m, �[96mself�[39;49;00m._options)�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/direct/test_direct_runner.py�[0m:42: in run_pipeline |  
  | �[96mself�[39;49;00m.result = �[96msuper�[39;49;00m().run_pipeline(pipeline, options)�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/direct/direct_runner.py�[0m:117: in run_pipeline |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m �[94mimport�[39;49;00m fn_runner�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/__init__.py�[0m:18: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_runner�[39;49;00m �[94mimport�[39;49;00m FnApiRunner�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/fn_runner.py�[0m:68: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m �[94mimport�[39;49;00m execution�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/execution.py�[0m:62: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mportability�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfn_api_runner�[39;49;00m �[94mimport�[39;49;00m translations�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/portability/fn_api_runner/translations.py�[0m:55: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mworker�[39;49;00m �[94mimport�[39;49;00m bundle_processor�[90m�[39;49;00m |  
  | �[1m�[31mapache_beam/runners/worker/bundle_processor.py�[0m:69: in <module> |  
  | �[94mfrom�[39;49;00m �[04m�[96mapache_beam�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mrunners�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mworker�[39;49;00m �[94mimport�[39;49;00m operations�[90m�[39;49;00m |  
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |  
  |   |  
  | >   �[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[04m�[91m?�[39;49;00m�[90m�[39;49;00m |  
  | �[1m�[31mE   KeyError: '__pyx_vtable__'�[0m |  
  |   |  
  | �[1m�[31mapache_beam/runners/worker/operations.py�[0m:1: KeyError


@liferoad
Copy link
Collaborator

Last three runs are green now.

image

Close this for now.

@github-actions github-actions bot added this to the 2.57.0 Release milestone May 29, 2024
@shunping
Copy link
Contributor

Great. Thanks @liferoad

@github-actions github-actions bot reopened this May 30, 2024
Copy link
Contributor Author

Reopening since the workflow is still flaky

Copy link
Contributor Author

Reopening since the workflow is still flaky

@github-actions github-actions bot reopened this Aug 30, 2024
Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad
Copy link
Collaborator

2024-08-30T07:28:39.6571287Z if setup_options.setup_file is not None:
2024-08-30T07:28:39.6571763Z if not os.path.isfile(setup_options.setup_file):
2024-08-30T07:28:39.6572227Z > raise RuntimeError(
2024-08-30T07:28:39.6572923Z 'The file %s cannot be found. It was specified in the '
2024-08-30T07:28:39.6573578Z '--setup_file command line option.' % setup_options.setup_file)
2024-08-30T07:28:39.6574970Z �[1m�[31mE RuntimeError: The file /runner/_work/beam/beam/sdks/python/apache_beam/examples/complete/juliaset/src/setup.py cannot be found. It was specified in the --setup_file command line option.�[0m

https://productionresultssa6.blob.core.windows.net/actions-results/9f18d66f-dabf-46e8-8b29-ae50d075f3dd/workflow-job-run-912db29d-d57b-5850-6efb-b125ca814b95/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-08-30T14%3A06%3A43Z&sig=aqESnfP68oo0sF7TUtpq%2BNFgdgfCbq8Ey3q%2BFMLZtvI%3D&ske=2024-08-31T00%3A21%3A54Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-08-30T12%3A21%3A54Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-05-04&sp=r&spr=https&sr=b&st=2024-08-30T13%3A56%3A38Z&sv=2024-05-04

@tvalentyn
Copy link
Contributor

Currently failing test:

gradlew :sdks:python:test-suites:portable:py312:portableLocalRunnerJuliaSetWithSetupPy

@damccorm
Copy link
Contributor

damccorm commented Nov 1, 2024

This is red again - https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=branch%3Amaster

It looks like there are currently 2 issues:

  1. Python 3.9 job is failing, I think probably because of the mypy changes. example failure
  2. The TensorRT tests are failing. Originally, they were failing because of a mismatch between container/local python versions, but now they seem to be running into CUDA issues with the new container. example failure and corresponding failing Dataflow job

@damccorm damccorm reopened this Nov 1, 2024
@damccorm
Copy link
Contributor

damccorm commented Nov 1, 2024

@jrmccluskey would you mind taking a look at these?

@damccorm damccorm assigned jrmccluskey and unassigned liferoad Nov 1, 2024
@jrmccluskey
Copy link
Contributor

Failure in the 3.9 postcommit is apache_beam/examples/fastavro_it_test.py::FastavroIT::test_avro_it, will dive deeper into that shortly

@jrmccluskey
Copy link
Contributor

The problem in the TensorRT container is that we seem to have two different versions of CUDA installed, one at version 11.8 and the other at 12.1 (we want everything at 12.1)

@damccorm
Copy link
Contributor

damccorm commented Nov 4, 2024

Looks like after sickbaying TensorRT tests, there are still failures. https://ge.apache.org/s/27igat7sfmcsu/console-log/task/:sdks:python:test-suites:portable:py310:portableWordCountSparkRunnerBatch?anchor=60&page=1 is an example, it looks like we're failing because we're missing a class in the spark runner.

@Abacn would you mind taking a look? Its unclear why this is happening now, but I'm guessing it may be related to #32976 (and maybe some caching kept it from showing up?)

@Abacn
Copy link
Contributor

Abacn commented Nov 4, 2024

Looks like after sickbaying TensorRT tests, there are still failures. https://ge.apache.org/s/27igat7sfmcsu/console-log/task/:sdks:python:test-suites:portable:py310:portableWordCountSparkRunnerBatch?anchor=60&page=1 is an example, it looks like we're failing because we're missing a class in the spark runner.

@Abacn would you mind taking a look? Its unclear why this is happening now, but I'm guessing it may be related to #32976 (and maybe some caching kept it from showing up?)

It's bad gradle cache. Cannot reproduce locally on master branch. Also inspected the expansion jar.

For some reason, recently, Gradle cache for shadowJar breaks more frequently

@shunping
Copy link
Contributor

shunping commented Nov 11, 2024

It started to fail last week again (Friday days ago) since the distroless python sdk PR: 81f35ab (@damondouglas)

#21 [distroless 5/6] COPY --from=base /usr/lib/python3.9 /usr/lib/python3.9 |  
-- | --
  | #21 ERROR: failed to calculate checksum of ref 21e0551f-9179-41a9-b6c7-d487e40b7288::4b5lek0fokkw0omzyb94t5h7y: "/usr/lib/python3.9": not found

@shunping
Copy link
Contributor

There is no /usr/lib/python3.9 under in the image python:3.9-bookworm. I can only see python3 and python3.11 folders there, and I think we may need to copy the python3 one.

$ docker run -it python:3.9-bookworm bash
root@b730cccba5a8:/# ls -d /usr/lib/python*
/usr/lib/python3  /usr/lib/python3.11

root@b730cccba5a8:/# ls -d /usr/local/lib/python*
/usr/local/lib/python3.11  /usr/local/lib/python3.9

@damondouglas , could you confirm that?

@damccorm
Copy link
Contributor

@shunping I think Damon is on vacation, if there is a quick fix please go ahead and apply it, otherwise could you please revert and we can try again when Damon is back/after the 2.61.0 release

cc/ @Abacn

@shunping
Copy link
Contributor

sg, will see if the fix in my mind will can work.

@shunping
Copy link
Contributor

Ok, take another look at this.
The test started to fail at 11/06 6:32PM (https://github.com/apache/beam/actions/runs/11713650994), the last successful run was at 11/06 12:33PM (https://github.com/apache/beam/actions/runs/11708854671). There are two commits during this time internal:

  • Distro Python SDK: 81f35ab, which causes the previously mentioned error during docker image building (:sdks:python:container:py39:docker)
  • Kafka: eeebae1, which seems to be the reason for failure (:sdks:python:test-suites:portable:py39:postCommitPy39IT).

The Kafka error message is shown below:

FAILED apache_beam/io/external/xlang_kafkaio_it_test.py::CrossLanguageKafkaIOTest::test_local_kafkaio_populated_key - RuntimeError: Pipeline BeamApp-runner-1111115329-514dd26a_03822608-80d0-4037-bc13-11d632204f46 failed in state FAILED: java.lang.RuntimeException: Error received from SDK harness for instruction 3: org.apache.beam.sdk.util.UserCodeException: java.io.IOException: KafkaWriter : failed to send 1 records (since last report)
	at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
	at org.apache.beam.sdk.io.kafka.KafkaWriter$DoFnInvoker.invokeProcessElement(Unknown Source)
	at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:810)
	at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348)
	at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275)
	at org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1837)
	at org.apache.beam.fn.harness.FnApiDoFnRunner.access$3100(FnApiDoFnRunner.java:145)
	at org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.output(FnApiDoFnRunner.java:2695)
	at org.apache.beam.sdk.transforms.MapElements$2.processElement(MapElements.java:151)
	at org.apache.beam.sdk.transforms.MapElements$2$DoFnInvoker.invokeProcessElement(Unknown Source)
	at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:810)
	at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348)
	at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275)
	at org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:213)
	at org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.multiplexElements(BeamFnDataInboundObserver.java:172)
	at org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:136)
	at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:550)
	at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150)
	at org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: KafkaWriter : failed to send 1 records (since last report)
	at org.apache.beam.sdk.io.kafka.KafkaWriter.checkForFailures(KafkaWriter.java:183)
	at org.apache.beam.sdk.io.kafka.KafkaWriter.processElement(KafkaWriter.java:66)
Caused by: org.apache.kafka.common.errors.TimeoutException: Topic xlang_kafkaio_test_populated_key_e9df3a07-037f-45a1-afde-7cea599f9570 not present in metadata after 60000 ms.

@Abacn , could you check this and see if we need to roll it back?

@Abacn
Copy link
Contributor

Abacn commented Nov 11, 2024

Thanks for taking care of it. I am +1 for rollback. The first distroless PR was expected to be a no-op for 2.61.0 release. Good to know it broke something before release cut.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment