Watch stream missing job completion events #2238

headyj · 2024-05-23T13:29:46Z

What happened (please include outputs or screenshots):

Sometimes the watch stream seems to be missing job completion events. This is not easy to reproduce as 2 executions of the same code in a row might have different result.

Here is the code, which is watching a job status and printing the logs:

w = watch.Watch()
for event in w.stream(func=batchV1.list_namespaced_job, namespace=namespace, timeout_seconds=0):
  if event['object'].metadata.name == jobName:
    logging.info(event)
    if event['type'] == "ADDED":
      logging.info("job %s created, waiting for pod to be running...", jobName)
    if event["object"].status.ready:
      pods = coreV1.list_namespaced_pod(namespace=namespace,label_selector="job-name={}".format(jobName))
      logging.info("pod %s is ready", pods.items[0].metadata.name)
      for line in coreV1.read_namespaced_pod_log(name=pods.items[0].metadata.name, namespace=namespace, follow=True, _preload_content=False).stream():
        print(line.decode(),end = '')
    if event["object"].status.succeeded:
      logging.info("Finished pod stream.")
      w.stop()
    if not event["object"].status.active and event["object"].status.failed:
      w.stop()
      logging.error("Job Failed")
      sys.exit(1)

Sometimes, the script never ends even when the watched job is completed. The script itself is executed in the same Kubernetes cluster but in a different namespace. I tried setting multiple values for timeout_seconds but it doesn't help, the last event is when it becomes active:

[INFO] {'type': 'ADDED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': None, [...], 'ready': None, 'start_time': None [...]
[INFO] job my-job-1716468085 created, waiting for pod to be running...
[INFO] {'type': 'MODIFIED', 'object': {'api_version': 'batch/v1', [...] 'job-name': 'my-job-1716468085', [...] 'status': {'active': 1, [...], 'ready': 0, 'start_time': datetime.datetime(2024, 5, 23, 12, 41, 25, tzinfo=tzlocal()), [...]

The event is correctly updated on Kubernetes side, checking on k9s:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  35m   job-controller  Created pod: my-job-1716468085-9dq8d
  Normal  Completed         32m   job-controller  Job completed

What you expected to happen:

Job completion event should be catch and sent

How to reproduce it (as minimally and precisely as possible):

Just use the above code in python 3.12-slim docker image. As said above, the problem seems to be sporadic. I wasn't able to reproduce it another way yet but I will update this ticket if so.

Anything else we need to know?:

Environment:

Kubernetes version (kubectl version): v1.29 (EKS)
OS (e.g., MacOS 10.13.6): N/A
Python version (python --version) python 3.12-slim official docker image: https://hub.docker.com/_/python
Python client version (pip list | grep kubernetes): 29.0.0

The text was updated successfully, but these errors were encountered:

yliaog · 2024-06-05T20:25:27Z

Thanks for reporting the issue, please update the ticket when you can reproduce it reliably.

headyj added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch stream missing job completion events #2238

Watch stream missing job completion events #2238

headyj commented May 23, 2024

yliaog commented Jun 5, 2024

Watch stream missing job completion events #2238

Watch stream missing job completion events #2238

Comments

headyj commented May 23, 2024

yliaog commented Jun 5, 2024