Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown error in KubernetesJobWatcher. Failing #33066

Closed
2 tasks done
karakanb opened this issue Aug 3, 2023 · 7 comments
Closed
2 tasks done

Unknown error in KubernetesJobWatcher. Failing #33066

karakanb opened this issue Aug 3, 2023 · 7 comments
Labels

Comments

@karakanb
Copy link
Contributor

karakanb commented Aug 3, 2023

Apache Airflow version

2.6.3

What happened

I regularly see these logs in my scheduler logs every 10 minutes:

[2023-08-03T11:54:27.257+0000] {kubernetes_executor.py:114} ERROR - Unknown error in KubernetesJobWatcher. Failing2023-08-03 14:54:27.266	Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
2023-08-03 14:54:27.266	During handling of the above exception, another exception occurred:
2023-08-03 14:54:27.266	Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
2023-08-03 14:54:27.266	During handling of the above exception, another exception occurred:
2023-08-03 14:54:27.266	Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 105, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 161, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 165, in stream
    for line in iter_resp_lines(resp):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for seg in resp.stream(amt=None, decode_content=False):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 816, in read_chunked
    with self._error_catcher():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
2023-08-03 14:54:27.272	During handling of the above exception, another exception occurred:
2023-08-03 14:54:27.272	Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
2023-08-03 14:54:27.272	During handling of the above exception, another exception occurred:
2023-08-03 14:54:27.272	Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 105, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 161, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 165, in stream
    for line in iter_resp_lines(resp):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for seg in resp.stream(amt=None, decode_content=False):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 816, in read_chunked
    with self._error_catcher():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
[2023-08-03T11:54:27.869+0000] {kubernetes_executor.py:335} ERROR - Error while health checking kube watcher process for namespace airflow. Process died for unknown reasons

Not sure about the implications of this but I see these logs every time I need to investigate things, which makes it harder to debug issues. In the best case this is not really an issue and makes debugging hard, in the worst case it causes some issue that I haven't been able to identify yet.

What you think should happen instead

There should be no such log, this seems like an unexpected behavior.

How to reproduce

Deploy the official helm chart v1.9.0 using the following values file:

extraVolumeMounts: &gitsync_volume_mounts
  - name: shard1-data
    mountPath: /gitsync-client-repos

extraVolumes: &gitsync_volumes
  - name: shard1-data
    persistentVolumeClaim:
      claimName: shard1-data

# User and group of airflow user
uid: 50000
gid: 0

# Detailed default security context for airflow deployments
securityContexts:
  pod: {}
  containers: {}

# Airflow home directory
# Used for mount paths
airflowHome: /opt/airflow

# Airflow version (Used to make some decisions based on Airflow Version being deployed)
airflowVersion: "2.6.3"

# Images
images:
  airflow:
    repository: registry.gitlab.com/org/repo
    tag: "2.6.3"
    pullPolicy: IfNotPresent

  pod_template:
    # Note that `images.pod_template.repository` and `images.pod_template.tag` parameters
    # can be overridden in `config.kubernetes` section. So for these parameters to have effect
    # `config.kubernetes.worker_container_repository` and `config.kubernetes.worker_container_tag`
    # must be not set .
    repository: ~
    tag: ~
    pullPolicy: IfNotPresent
  flower:
    repository: ~
    tag: ~
    pullPolicy: IfNotPresent
  statsd:
    repository: quay.io/prometheus/statsd-exporter
    tag: v0.22.8
    pullPolicy: IfNotPresent
  redis:
    repository: redis
    tag: 7-bullseye
    pullPolicy: IfNotPresent
  gitSync:
    repository: registry.k8s.io/git-sync/git-sync
    tag: v3.6.3
    pullPolicy: IfNotPresent

# Ingress configuration
ingress:
  # Configs for the Ingress of the web Service
  web:
    # Enable web ingress resource
    enabled: true
    annotations:
      nginx.ingress.kubernetes.io/affinity: cookie
    hosts:
      - name: "airflow.mycompany.com"

    ingressClassName: "nginx"

  flower:
    enabled: false

executor: "CeleryKubernetesExecutor"
allowPodLaunching: true

env:
  - name: AIRFLOW__CORE__SECURE_MODE
    value: "True"
  - name: AIRFLOW__CORE__PARALLELISM
    value: "25"
  - name: AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG
    value: "12"
  - name: AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG
    value: "1"
  - name: AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT
    value: "60.0"
  - name: AIRFLOW__CELERY_BROKER_TRANSPORT_OPTIONS__VISIBILITY_TIMEOUT
    value: "64800"
  - name: AIRFLOW__CELERY__WORKER_CONCURRENCY
    value: "8"
  - name: AIRFLOW__API__AUTH_BACKENDS
    value: "airflow.api.auth.backend.basic_auth"
  - name: AIRFLOW__SCHEDULER__TASK_QUEUED_TIMEOUT
    value: "1200.0"

# Enables selected built-in secrets that are set via environment variables by default.
# Those secrets are provided by the Helm Chart secrets by default but in some cases you
# might want to provide some of those variables with _CMD or _SECRET variable, and you should
# in this case disable setting of those variables by setting the relevant configuration to false.
enableBuiltInSecretEnvVars:
  AIRFLOW__CORE__FERNET_KEY: true
  # For Airflow <2.3, backward compatibility; moved to [database] in 2.3
  AIRFLOW__CORE__SQL_ALCHEMY_CONN: true
  AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: true
  AIRFLOW_CONN_AIRFLOW_DB: true
  AIRFLOW__WEBSERVER__SECRET_KEY: true
  AIRFLOW__CELERY__CELERY_RESULT_BACKEND: true
  AIRFLOW__CELERY__RESULT_BACKEND: true
  AIRFLOW__CELERY__BROKER_URL: true
  AIRFLOW__ELASTICSEARCH__HOST: true
  AIRFLOW__ELASTICSEARCH__ELASTICSEARCH_HOST: true

# # Airflow database & redis config
data:
  metadataSecretName: airflow-metadata-db-connection
  brokerUrlSecretName: airflow-celery-redis

fernetKey: my-fernet-key
webserverSecretKey: my-webserver-key

# Airflow Worker Config
workers:
  # Number of airflow celery workers in StatefulSet
  replicas: 6
  persistence:
    # Enable persistent volumes
    enabled: false
    # Volume size for worker StatefulSet
    size: 50Gi
    # If using a custom storageClass, pass name ref to all statefulSets here
    storageClassName: nfs

  resources:
    limits:
      memory: 3000Mi
    requests:
      cpu: "500m"
      memory: 1800Mi

  extraVolumeMounts: *gitsync_volume_mounts
  extraVolumes: *gitsync_volumes

  logGroomerSidecar:
    # Whether to deploy the Airflow worker log groomer sidecar.
    enabled: false

  env: []


# Airflow scheduler settings
scheduler:
  replicas: 1
  podDisruptionBudget:
    enabled: true

  resources:
    limits:
      memory: "3Gi"
    requests:
      cpu: 500m
      memory: "1200Mi"

  extraVolumes: []
  extraVolumeMounts: []

  logGroomerSidecar:
    enabled: true
    retentionDays: 90
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 128Mi

# Airflow webserver settings
webserver:
  # Number of webservers
  replicas: 2
  podDisruptionBudget:
    enabled: true
    # config:
    #   minAvailable: 1

  networkPolicy:
    ingress:
      # Peers for webserver NetworkPolicy ingress
      from: []
      # Ports for webserver NetworkPolicy ingress (if `from` is set)
      ports:
        - port: "{{ .Values.ports.airflowUI }}"

  resources:
    requests:
      cpu: "1"
      memory: "1500Mi"
    limits:
      memory: "2200Mi"

  # Create initial user.
  defaultUser:
    enabled: false

  # Launch additional containers into webserver.
  extraContainers: []
  # Add additional init containers into webserver.
  extraInitContainers: []

# Airflow Triggerer Config
triggerer:
  enabled: true
  # Number of airflow triggerers in the deployment
  replicas: 1

  persistence:
    enabled: false

  extraVolumeMounts: *gitsync_volume_mounts
  extraVolumes: *gitsync_volumes

  resources:
    limits:
      memory: 2000Mi
    requests:
      cpu: 250m
      memory: 1200Mi

  logGroomerSidecar:
    enabled: false

# Airflow Dag Processor Config
dagProcessor:
  enabled: true
  replicas: 1

  resources:
    limits:
      memory: 1800Mi
    requests:
      cpu: 1
      memory: 1500Mi

  # Mount additional volumes into dag processor.
  extraVolumeMounts: *gitsync_volume_mounts
  extraVolumes: *gitsync_volumes

# StatsD settings
statsd:
  enabled: false

# Configuration for the redis provisioned by the chart
redis:
  enabled: false

registry:
  secretName: image-pull-creds

# Define any ResourceQuotas for namespace
quotas: {}

# Define default/max/min values for pods and containers in namespace
limits: []

# This runs as a CronJob to cleanup old pods.
cleanup:
  enabled: false
  # Run every 15 minutes
  schedule: "*/15 * * * *"
  # Command to use when running the cleanup cronjob (templated).
  command: ~
  # Args to use when running the cleanup cronjob (templated).
  args:
    [
      "bash",
      "-c",
      "exec airflow kubernetes cleanup-pods --namespace={{ .Release.Namespace }}",
    ]

  # jobAnnotations are annotations on the cleanup CronJob
  jobAnnotations: {}

  # Select certain nodes for airflow cleanup pods.
  nodeSelector: {}
  affinity: {}
  tolerations: []
  topologySpreadConstraints: []

  podAnnotations: {}

  # Labels specific to cleanup objects and pods
  labels: {}

  resources: {}
  #  limits:
  #   cpu: 100m
  #   memory: 128Mi
  #  requests:
  #   cpu: 100m
  #   memory: 128Mi

  # Create ServiceAccount
  serviceAccount:
    # Specifies whether a ServiceAccount should be created
    create: true
    # The name of the ServiceAccount to use.
    # If not set and create is true, a name is generated using the release name
    name: ~

    # Annotations to add to cleanup cronjob kubernetes service account.
    annotations: {}

  # When not set, the values defined in the global securityContext will be used
  securityContext: {}
  #  runAsUser: 50000
  #  runAsGroup: 0
  env: []

  # Specify history limit
  # When set, overwrite the default k8s number of successful and failed CronJob executions that are saved.
  failedJobsHistoryLimit: ~
  successfulJobsHistoryLimit: ~

# Configuration for postgresql subchart
# Not recommended for production
postgresql:
  enabled: false

# Config settings to go into the mounted airflow.cfg
#
# Please note that these values are passed through the `tpl` function, so are
# all subject to being rendered as go templates. If you need to include a
# literal `{{` in a value, it must be expressed like this:
#
#    a: '{{ "{{ not a template }}" }}'
#
# Do not set config containing secrets via plain text values, use Env Var or k8s secret object
# yamllint disable rule:line-length
config:
  core:
    dags_folder: '{{ include "airflow_dags" . }}'
    # This is ignored when used with the official Docker image
    load_examples: "False"
    executor: "{{ .Values.executor }}"
    # For Airflow 1.10, backward compatibility; moved to [logging] in 2.0
    colored_console_log: "False"
    remote_logging: '{{- ternary "True" "False" .Values.elasticsearch.enabled }}'
  logging:
    remote_logging: '{{- ternary "True" "False" .Values.elasticsearch.enabled }}'
    colored_console_log: "False"
  webserver:
    enable_proxy_fix: "True"
    warn_deployment_exposure: "False"
    # For Airflow 1.10
    rbac: "True"
  celery:
    flower_url_prefix: "{{ .Values.ingress.flower.path }}"
    worker_concurrency: 8
  scheduler:
    standalone_dag_processor: '{{ ternary "True" "False" .Values.dagProcessor.enabled }}'
    # statsd params included for Airflow 1.10 backward compatibility; moved to [metrics] in 2.0
    statsd_on: '{{ ternary "True" "False" .Values.statsd.enabled }}'
    statsd_port: 9125
    statsd_prefix: airflow
    statsd_host: '{{ printf "%s-statsd" .Release.Name }}'
    # `run_duration` included for Airflow 1.10 backward compatibility; removed in 2.0.
    run_duration: 41460
  celery_kubernetes_executor:
    kubernetes_queue: "kubernetes"
  # The `kubernetes` section is deprecated in Airflow >= 2.5.0 due to an airflow.cfg schema change.
  # The `kubernetes` section can be removed once the helm chart no longer supports Airflow < 2.5.0.
  kubernetes:
    namespace: "{{ .Release.Namespace }}"
    # The following `airflow_` entries are for Airflow 1, and can be removed when it is no longer supported.
    airflow_configmap: '{{ include "airflow_config" . }}'
    airflow_local_settings_configmap: '{{ include "airflow_config" . }}'
    pod_template_file: '{{ include "airflow_pod_template_file" . }}/pod_template_file.yaml'
    worker_container_repository: "{{ .Values.images.airflow.repository | default .Values.defaultAirflowRepository }}"
    worker_container_tag: "{{ .Values.images.airflow.tag | default .Values.defaultAirflowTag }}"
    multi_namespace_mode: '{{ ternary "True" "False" .Values.multiNamespaceMode }}'
  # The `kubernetes_executor` section duplicates the `kubernetes` section in Airflow >= 2.5.0 due to an airflow.cfg schema change.
  kubernetes_executor:
    namespace: "{{ .Release.Namespace }}"
    pod_template_file: '{{ include "airflow_pod_template_file" . }}/pod_template_file.yaml'
    worker_container_repository: "{{ .Values.images.airflow.repository | default .Values.defaultAirflowRepository }}"
    worker_container_tag: "{{ .Values.images.airflow.tag | default .Values.defaultAirflowTag }}"
    multi_namespace_mode: '{{ ternary "True" "False" .Values.multiNamespaceMode }}'
# yamllint enable rule:line-length

# Whether Airflow can launch workers and/or pods in multiple namespaces
# If true, it creates ClusterRole/ClusterRolebinding (with access to entire cluster)
multiNamespaceMode: false

# `podTemplate` is a templated string containing the contents of `pod_template_file.yaml` used for
# KubernetesExecutor workers. The default `podTemplate` will use normal `workers` configuration parameters
# (e.g. `workers.resources`). As such, you normally won't need to override this directly, however,
# you can still provide a completely custom `pod_template_file.yaml` if desired.
# If not set, a default one is created using `files/pod-template-file.kubernetes-helm-yaml`.
podTemplate: ~
# The following example is NOT functional, but meant to be illustrative of how you can provide a custom
# `pod_template_file`. You're better off starting with the default in
# `files/pod-template-file.kubernetes-helm-yaml` and modifying from there.
# We will set `priorityClassName` in this example:
# podTemplate: |
#   apiVersion: v1
#   kind: Pod
#   metadata:
#     name: placeholder-name
#     labels:
#       tier: airflow
#       component: worker
#       release: {{ .Release.Name }}
#   spec:
#     priorityClassName: high-priority
#     containers:
#       - name: base
#         ...

# Git sync
dags:
  persistence:
    # Annotations for dags PVC
    annotations: {}
    # Enable persistent volume for storing dags
    enabled: false
    # Volume size for dags
    size: 1Gi
    # If using a custom storageClass, pass name here
    storageClassName: nfs
    # access mode of the persistent volume
    accessMode: ReadWriteMany
    ## the name of an existing PVC to use
    existingClaim:
    ## optional subpath for dag volume mount
    subPath: ~
  gitSync:
    enabled: true

    # git repo clone url
    # ssh example: git@github.com:apache/airflow.git
    # https example: https://github.com/apache/airflow.git
    repo: git@github.com:org/dags.git
    branch: main
    rev: HEAD
    depth: 1
    # the number of consecutive failures allowed before aborting
    maxFailures: 1
    # subpath within the repo where dags are located
    # should be "" if dags are at repo root
    subPath: ""

    sshKeySecret: airflow-gitsync-dags-clone
    knownHosts: |-
      github.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCj7ndNxQowgcQnjshcLrqPEiiphnt+VTTvDP6mHBL9j1aNUkY4Ue1gvwnGLVlOhGeYrnZaMgRK6+PKCUXaDbC7qtbW8gIkhL7aGCsOr/C56SJMy/BCZfxd1nWzAOxSDPgVsmerOBYfNqltV9/hWCqBywINIR+5dIg6JTJ72pcEpEjcYgXkE2YEFXV1JHnsKgbLWNlhScqb2UmyRkQyytRLtL+38TGxkxCflmO+5Z8CSSNY7GidjMIZ7Q4zMjA2n1nGrlTDkzwDCsw+wqFPGQA179cnfGWOWRVruj16z6XyvxvjJwbz0wQZ75XK5tKSb7FNyeIEs4TT4jk+S4dhPeAUC5y+bDYirYgM4GC7uEnztnZyaVWQ7B381AK4Qdrwt51ZqExKbQpTUNn+EjqoTwvqNj4kqx5QUCI0ThS/YkOxJCXmPUWZbhjpCg56i+2aB6CmK2JGhn57K5mj0MNdBXA4/WnwH6XoPWJzK5Nyu2zB3nAZp+S5hpQs+p1vN1/wsjk=
      gitlab.com ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCsj2bNKTBSpIYDEGk9KxsGh3mySTRgMtXL583qmBpzeQ+jqCMRgBqB98u3z++J1sKlXHWfM9dyhSevkMwSbhoR8XIq/U0tCNyokEi/ueaBMCvbcTHhO7FcwzY92WK4Yt0aGROY5qX2UKSeOvuP4D6TPqKF1onrSzH9bx9XUf2lEdWT/ia1NEKjunUqu1xOB/StKDHMoX4/OKyIzuS0q/T1zOATthvasJFoPrAjkohTyaDUz2LN5JoH839hViyEG82yB+MjcFV5MU3N1l1QL3cVUCh93xSaua1N85qivl+siMkPGbO5xR/En4iEY6K2XPASUEMaieWVNTRCtJ4S8H+9
      bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDXCaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPWc/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==

    # interval between git sync attempts in seconds
    # high values are more likely to cause DAGs to become out of sync between different components
    # low values cause more traffic to the remote git repository
    wait: 5
    containerName: git-sync
    uid: 65533

    extraVolumeMounts: []
    env:
      - name: GIT_SYNC_SUBMODULES
        value: "off"

    resources:
      limits:
        memory: 180Mi
      requests:
        cpu: 100m
        memory: 128Mi

logs:
  persistence:
    enabled: true
    size: 50Gi
    storageClassName: nfs

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.1.0
apache-airflow-providers-celery==3.2.1
apache-airflow-providers-cncf-kubernetes==7.4.0
apache-airflow-providers-common-sql==1.6.1
apache-airflow-providers-discord==3.2.0
apache-airflow-providers-docker==3.7.1
apache-airflow-providers-elasticsearch==4.5.1
apache-airflow-providers-ftp==3.4.2
apache-airflow-providers-google==10.5.0
apache-airflow-providers-grpc==3.2.1
apache-airflow-providers-hashicorp==3.4.1
apache-airflow-providers-http==4.4.2
apache-airflow-providers-imap==3.2.2
apache-airflow-providers-microsoft-azure==6.1.2
apache-airflow-providers-mysql==5.1.1
apache-airflow-providers-odbc==4.0.0
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-redis==3.2.1
apache-airflow-providers-sendgrid==3.2.1
apache-airflow-providers-sftp==4.3.1
apache-airflow-providers-slack==7.3.2
apache-airflow-providers-snowflake==4.4.0
apache-airflow-providers-sqlite==3.4.2
apache-airflow-providers-ssh==3.7.1
apache-airflow-providers-tableau==4.2.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Kubernetes v1.27.2

Anything else

LIterally every 10 minutes:
image

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@karakanb karakanb added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Aug 3, 2023
@potiuk
Copy link
Member

potiuk commented Aug 6, 2023

Unfortunately the exception information is swallowed. Is it possible that you find this line:

   self.log.exception("Unknown error in KubernetesJobWatcher. Failing")

In kubernetes_executor_utils and replace it with

   self.log.exception("Unknown error in KubernetesJobWatcher. Failing", exc_info=True)

That should give use more clues.

@potiuk potiuk added pending-response and removed needs-triage label for new issues that we didn't triage yet labels Aug 6, 2023
@karakanb
Copy link
Contributor Author

karakanb commented Aug 7, 2023

done that but doesn't seem like much changed:

$ cat /home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py | grep Unknown
                self.log.exception("Unknown error in KubernetesJobWatcher. Failing", exc_info=True)

here's what it says with that change:

[2023-08-07T10:13:19.098+0000] {kubernetes_executor.py:114} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 105, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 161, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 165, in stream
    for line in iter_resp_lines(resp):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for seg in resp.stream(amt=None, decode_content=False):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 816, in read_chunked
    with self._error_catcher():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 105, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/executors/kubernetes_executor.py", line 161, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 165, in stream
    for line in iter_resp_lines(resp):
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for seg in resp.stream(amt=None, decode_content=False):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 816, in read_chunked
    with self._error_catcher():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
[2023-08-07T10:13:19.432+0000] {kubernetes_executor.py:335} ERROR - Error while health checking kube watcher process for namespace airflow. Process died for unknown reasons

it looks like the same thing to me, maybe I have done the wrong change somewhere?

@potiuk
Copy link
Member

potiuk commented Aug 14, 2023

I think this issue has been reported long time ago and happened in past versions of k8s, you can try to find similar issues - kubernetes-client/python#972 is one of them but there are a number of related. From a quick search it does not seem that issue had any reasonable resolution, but my best guess and suggestion to you @karakanb will be to upgrade to latest k8s version. 1.27 has .4 version currently - not sure if that will help but there is a chance it will..

@karakanb
Copy link
Contributor Author

I am already on 1.27.2 and the issue is still there unfortunately. I guess this has nothing to do with Airflow then, I'll close the issue.

@julienlau
Copy link

same issue here with kube 1.26.11 and airflow 2.7.0.

@barun-mazumdar
Copy link

barun-mazumdar commented May 13, 2024

same issue here as well with kube 1.28.5 and airflow 2.8.2. Is there any configuration variable which i can change?

@sharon-clue
Copy link

i'm with kube v1.29.4 and airflow 2.8.2
did anyone find a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants