Prevent KubernetesJobWatcher getting stuck on resource too old #23521

ecerulm · 2022-05-06T10:11:11Z

Currently in airflow 2.3.0 the scheduler will get into an infinite loop if the k8s API ever response with a 410 (too old resource version / resource version too old).

This currently happen in airflow deployments in separate EKS 1.21 clusters, where every hour or so the scheduler will be stuck printing the error below in a tight loop, and will require manual restart.

Below is a log except from a scheduler in such state (tight loop of too old resource version error). As you can see the now my watch begins starting at resource_version: 380528372 repeats itself at 9:58:19,950, then one second later at 09:58:21,012, the again half second later at 09:58:21,541. Always asking k8s to start a watch from the same exact resourceVersion that will always return too old resource version

[2022-05-07 09:58:19,941] {kubernetes_executor.py:288} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2022-05-07 09:58:19,950] {kubernetes_executor.py:126} INFO - Event: and now my watch begins starting at resource_version: 380528372
[2022-05-07 09:58:19,965] {kubernetes_executor.py:111} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 380528372 (382477719)
Process KubernetesJobWatcher-164543:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 380528372 (382477719)
[2022-05-07 09:58:21,002] {kubernetes_executor.py:288} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2022-05-07 09:58:21,012] {kubernetes_executor.py:126} INFO - Event: and now my watch begins starting at resource_version: 380528372
[2022-05-07 09:58:21,025] {kubernetes_executor.py:111} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 380528372 (382477719)
Process KubernetesJobWatcher-164544:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 380528372 (382477719)
[2022-05-07 09:58:21,531] {kubernetes_executor.py:288} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2022-05-07 09:58:21,541] {kubernetes_executor.py:126} INFO - Event: and now my watch begins starting at resource_version: 380528372
[2022-05-07 09:58:21,556] {kubernetes_executor.py:111} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 380528372 (382477719)
Process KubernetesJobWatcher-164545:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 102, in run
    self.resource_version = self._run(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 145, in _run
    for event in list_worker_pods():
  File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 380528372 (382477719)

The situation is as follows

KubernetesJobWatcher will start a get/list/watch cycle starting at resource_version=0 using the kubernetes-client/python library kubernetes.watch.stream()
The client will eventually get disconnected
KubernetesJobWatcher will try to start a new get/list/watch cycle starting at what KubernetesJobWatcher thinks it's the latest resource version 380528372.
- KubernetesJobWatcher tries hard to keep track of the last resource_version seen but there are inherent limitations on the k8s api (like the initial get/list does not guarantee ordering)
- even if it did a perfect job tracking resource version the watch from a specific resource version is not guaranteed to work as stated in Kubernetes documentation > efficient detection of changes

A given Kubernetes server will only preserve a historical record of changes for a limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned. [ecerulm: using the resourceVersion that was returned in the get/list operation, not the one that KubernetesJobWatcher saved]

This new attempt to watch from resourceVersion 380528372 will fail with a 410 resource too old (in my scenario this happens with high probability after the scheduler has been up for 1 hour) . Again this is expected non-exceptional case, there is no guarantee that you can restart the watch for resource version 380528372.
When the watch operation fails immediately with kubernetes.client.exceptions.ApiException: (410) Reason: Expired: too old resource version: 380528372 (382477719) KubernetesJobWatcher will die
Then a new KubernetesJobWatcher will start that tries again to start the watch from exactly the same resource version 380528372 and that will again fail for the same reason
From this point on the scheduler will be stuck with no progress, will continue in a tight infinite loop of starting KubernetesJobWatcher, letting it do a watch that will always fails, noticing that KubernetesJobWatcher died again, starting it again, etc

This PR solves the situation (already tested for 10 hours in one of my airflow deployments in EKS) by always starting a fresh watch (resourceVersion=0) after KubernetesJobWatcher dies.

closes: #21087
related: #15500
related: #17629
related: #22407
related: #12644

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named {pr_number}.significant.rst, in newsfragments.

ephraimbuddy · 2022-05-06T12:22:16Z

You may want to look at this comment #15500 (comment)

If the watch fails because "resource too old" the KubernetesJobWatcher should not retry with the same resource version as that will end up in loop where there is no progress.

ncapeta · 2022-05-07T18:05:14Z

Any possibility to get a release with this ?

potiuk · 2022-05-07T18:17:39Z

Any possibility to get a release with this ?

I am about to prepare next release of providers very soon. If it gets merged before that it will go to that release - if not, then it will have to wait for the next one.

eladkal · 2022-05-07T18:24:13Z

I am about to prepare next release of providers very soon. If it gets merged before that it will go to that release - if not, then it will have to wait for the next one.

The files modified by this PR are of Airflow core so it doesn't matter for provider release :)
I'll mark it for 2.3.1

potiuk · 2022-05-07T18:49:06Z

The files modified by this PR are of Airflow core so it doesn't matter for provider release :)

Ah. Right. The Provider vs Core Kubernetes strikes back :)

ncapeta · 2022-05-07T19:24:27Z

Tks, this is making in some instances 2.3 not usable at all as the schedulers after 1h need to be deleted

potiuk · 2022-05-07T19:36:56Z

Tks, this is making in some instances 2.3 not usable at all as the schedulers after 1h need to be deleted

Yeah. 2.3.1 with some teething problems og 2.3.0 solved is coming soon.

ncapeta · 2022-05-11T03:20:11Z

any potential release date ?

potiuk · 2022-05-11T06:30:10Z

any potential release date ?

Whatever we tell you we might not hold to because we might not be able to find/fix all important issues by then. And if you base any "business" decisions of yours on what we tell, then you will have complaints. So better not to give you false hopes for OSS project.

Just watch how many issues are left here (some might get added though still):

https://github.com/apache/airflow/milestone/54

However being an OSS project it has an advantage, that you can just take the change and apply it yourself in your environment while waiting. It's a few line change that you can easily apply on installed Airflow - just find the installed sources and update them in the way that they will persist in your environment. This way you can handle your "business" and you cannot tell "I am stuck becasue they have not released yet"..

* 'main' of github.com:apache/airflow: Revert "Fix k8s pod.execute randomly stuck indefinitely by logs consumption (apache#23497) (apache#23618)" (apache#23656) Rename cluster_policy to task_policy (apache#23468) [FEATURE] google provider - BigQueryInsertJobOperator log query (apache#23648) Fix k8s pod.execute randomly stuck indefinitely by logs consumption (apache#23497) (apache#23618) Fixed test and remove pytest.mark.xfail for test_exc_tb (apache#23650) Added kubernetes version (1.24) in README.md(for Main version(dev)), … (apache#23649) Add `RedshiftDeleteClusterOperator` support (apache#23563) Added postgres 14 to support versions(including breeze) (apache#23506) Don't run pre-migration checks for downgrade (apache#23634) Add index for event column in log table (apache#23625) Simplify flash message for _airflow_moved tables (apache#23635) Fix assuming "Feature" answer on CI when generating docs (apache#23640) Fix typo issue (apache#23633) [FEATURE] add K8S 1.24 support (apache#23637) [FEATURE] update K8S-KIND to 0.13.0 (apache#23636) Prevent KubernetesJobWatcher getting stuck on resource too old (apache#23521) Make provider doc preparation a bit more fun :) (apache#23629)

…e#23521) * Prevent KubernetesJobWatcher getting stuck on resource too old If the watch fails because "resource too old" the KubernetesJobWatcher should not retry with the same resource version as that will end up in loop where there is no progress. * Reset ResourceVersion().resource_version to 0 (cherry picked from commit dee05b2)

ashb · 2022-05-16T11:02:46Z

Does this change mean that we potentially miss some events?

i.e. if the version n is too old, and we reset it to 0, and the next event is m, then any pods state changes in n to m won't ever be seen?

ecerulm · 2022-05-16T14:05:26Z

Yes, you potentially miss old event just like before. The only difference is that the scheduler does not get into an infinite loop because of it.

If airflow gets a 410 from kubernetes api that already is a pretty good indicator that some event has been lost but there is nothing airflow can do about it at that point, right?

the question remains if kubernetes offers the right API with the right guarantees to implement this kind of scheduling reliably. Or if other approach would be better. This PR was only about preventing the scheduler from entering an infinite loop in which progress was impossible.

* Prevent KubernetesJobWatcher getting stuck on resource too old If the watch fails because "resource too old" the KubernetesJobWatcher should not retry with the same resource version as that will end up in loop where there is no progress. * Reset ResourceVersion().resource_version to 0 (cherry picked from commit dee05b2)

* Clean up in-line f-string concatenation (#23591) * Apply specific ID collation to root_dag_id too (#23536) In certain databases there is a need to set the collation for ID fields like dag_id or task_id to something different than the database default. This is because in MySQL with utf8mb4 the index size becomes too big for the MySQL limits. In past pull requests this was handled [#7570](https://github.com/apache/airflow/pull/7570), [#17729](https://github.com/apache/airflow/pull/17729), but the root_dag_id field on the dag model was missed. Since this field is used to join with the dag_id in various other models ([and self-referentially](https://github.com/apache/airflow/blob/451c7cbc42a83a180c4362693508ed33dd1d1dab/airflow/models/dag.py#L2766)), it also needs to have the same collation as other ID fields. This can be seen by running `airflow db reset` before and after applying this change while also specifying `sql_engine_collation_for_ids` in the configuration. Other related PRs [#19408](https://github.com/apache/airflow/pull/19408) * Add doc and sample dag for EC2 (#23547) * Helm chart 1.6.0rc1 (#23548) * Add sample dag and doc for S3ListOperator (#23449) * Add sample dag and doc for S3ListOperator * Fix doc * 19943 Grid view status filters (#23392) * Move tree filtering inside react and add some filters * Move filters from context to utils * Fix tests for useTreeData * Fix last tests. * Add tests for useFilters * Refact to use existing SimpleStatus component * Additional fix after rebase. * Update following bbovenzi code review * Update following code review * Fix tests. * Fix page flickering issues from react-query * Fix side panel and small changes. * Use default_dag_run_display_number in the filter options * Handle timezone * Fix flaky test Co-authored-by: Brent Bovenzi <brent.bovenzi@gmail.com> * Improve caching for multi-platform images. (#23562) This is another attempt to improve caching performance for multi-platform images as the previous ones were undermined by a bug in buildx multiplatform cache-to implementattion that caused the image cache to be overwritten between platforms, when multiple images were build. The bug is created for the buildx behaviour at https://github.com/docker/buildx/issues/1044 and until it is fixed we have to prpare separate caches for each platform and push them to separate tags. That adds a bit overhead on the building step, but for now it is the simplest way we can workaround the bug if we do not want to manually manipulate manifests and images. * Use inclusive words in apache airflow project (#23090) * Add exception to catch single line private keys (#23043) * Add sample dag and doc for S3ListPrefixesOperator (#23448) * Add sample dag and doc for S3ListPrefixesOperator * Fix static checks * Update min requirements for rich to 12.4.1 (#23604) * Add exportContext.offload flag to CLOUD_SQL_EXPORT_VALIDATION. (#23614) * Make Breeze help generation indepdent from having breeze installed (#23612) Generation of Breeze help requires breeze to be installed. However if you have locally installed breeze with different dependencies and did not run self-upgrade, the results of generation of the images might be different (for example when different rich version is used). This change works in the way that: * you do not have to have breeze installed at all to make it work * it always upgrades to latest breeze when it is not installed * but this only happens when you actually modified some breeze code * Add Quicksight create ingestion Hook and Operator (#21863) * Add Quicksight create ingestion Hook and Operator Co-authored-by: eladkal <45845474+eladkal@users.noreply.github.com> * Add slim images to docker-stack docs index (#23601) * Fixed Kubernetes Operator large xcom content Defect (#23490) * [FEATURE] google provider - split GkeStartPodOperator execute (#23518) * Implement send_callback method for CeleryKubernetesExecutor and LocalKubernetesExecutor (#23617) * Fix: Exception when parsing log #20966 (#23301) * UnicodeDecodeError: 'utf-8' codec can't decode byte 0xXX in position X: invalid start byte File "/opt/work/python395/lib/python3.9/site-packages/airflow/hooks/subprocess.py", line 89, in run_command line = raw_line.decode(output_encoding).rstrip() # raw_line == b'\x00\x00\x00\x11\xa9\x01\n' UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 4: invalid start byte * Update subprocess.py * Update subprocess.py * Fix: Exception when parsing log #20966 * Fix: Exception when parsing log #20966 Another alternative is: try-catch it. e.g. ``` line = '' for raw_line in iter(self.sub_process.stdout.readline, b''): try: line = raw_line.decode(output_encoding).rstrip() except UnicodeDecodeError as err: print(err, output_encoding, raw_line) self.log.info("%s", line) ``` * Create test_subprocess.sh * Update test_subprocess.py * Added shell directive and license to test_subprocess.sh * Distinguish between raw and decoded lines as suggested by @uranusjr * simplify test Co-authored-by: muhua <microhuang@live.com> * Make provider doc preparation a bit more fun :) (#23629) Previously you had to manually add versions when changelog was modified. But why not to get a bit more fun and get the versions bumped automatically based on your assesment when reviewing the provideers rather than after looking at the generated changelog. * Prevent KubernetesJobWatcher getting stuck on resource too old (#23521) * Prevent KubernetesJobWatcher getting stuck on resource too old If the watch fails because "resource too old" the KubernetesJobWatcher should not retry with the same resource version as that will end up in loop where there is no progress. * Reset ResourceVersion().resource_version to 0 * [FEATURE] update K8S-KIND to 0.13.0 (#23636) * [FEATURE] add K8S 1.24 support (#23637) * Fix typo issue (#23633) * Fix assuming "Feature" answer on CI when generating docs (#23640) We have now different answers posisble when generating docs, and for testing we assume we answered randomly during the generation of documentation. * Simplify flash message for _airflow_moved tables (#23635) Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Add index for event column in log table (#23625) * Don't run pre-migration checks for downgrade (#23634) These checks are only make sense for upgrades. Generally they exist to resolve referential integrity issues etc before adding constraints. In the downgrade context, we generally only remove constraints, so it's a non-issue. * Added postgres 14 to support versions(including breeze) (#23506) * Added postgres 14 to support versions(including breeze) * Add `RedshiftDeleteClusterOperator` support (#23563) Add support for `RedshiftDeleteClusterOperator`. This will help to clean resources using airflow operators when needed. In the current implementation, By default, I'm waiting until the cluster is completely removed to return immediately without waiting set `wait_for_completion` param to False - Add operator class - Add basic unit test - Add an example task - Add relevant documentation * Added kubernetes version (1.24) in README.md(for Main version(dev)), … (#23649) * Added kubernetes version (1.24) in README.md(for Main version(dev)), accidentally removed in merge cnflict. * Update README.md Co-authored-by: Jarek Potiuk <jarek@potiuk.com> * Fixed test and remove pytest.mark.xfail for test_exc_tb (#23650) * Fix k8s pod.execute randomly stuck indefinitely by logs consumption (#23497) (#23618) * [FEATURE] google provider - BigQueryInsertJobOperator log query (#23648) * Rename cluster_policy to task_policy (#23468) * Rename cluster_policy to task_policy * rename task_policy as example_task_policy. * Revert "Fix k8s pod.execute randomly stuck indefinitely by logs consumption (#23497) (#23618)" (#23656) This reverts commit ee342b85b97649e2e29fcf83f439279b68f1b4d4. * Prepare provider documentation 2022.05.11 (#23631) Co-authored-by: eladkal <45845474+eladkal@users.noreply.github.com> Co-authored-by: eladkal <45845474+eladkal@users.noreply.github.com> * AIP45 Remove dag parsing in airflow run local (#21877) * remove `--` in `./breeze build-docs` command (#23671) * Synchronize support for Postgres and K8S in docs (#23673) We just added support for Postgres 14 and K8S 1.24 and since we did not have any changes to support either in main we are bringing the support to 2.3 line as well. This documentation syncs all remaining places where it should be updated. * Migrate Dataproc to new system tests design (#22777) * Add wildcard possibility to `package-filter` parametere (#23672) the glob parameters (for example `apache-airflow-providers-*`) did not work because only fixed list of parameters was allowed. This PR converts the package-filter parameter to stop verifying the value passed - so autocomplete continues to work but you should still be able to use glob. It also removes few places where the parameters were used with `--` separator. * Replace "absolute()" with "resolve()" in pathlib objects (#23675) TIL that absolute() is an undocumented in Pathlib and that we should use resolve() instead. So this is it. * Upgrade `pip` to latest released 22.1.0 version (#23665) We are finally able to get rid of the annoying false-positive warnings and we have finally a chance on having warning-free installation during docker builds. * Shorten max pre-commit hook name length (#23677) When names are too long, pre-commit output looks very ugly and takes up 2x lines. Here I reduce max length just a little bit further so that pre-commit output renders properly on a macbook pro 16" with terminal window splitting screen horizontally. * remove stale serialized dags (#22917) * Move around overflow, position and padding (#23044) * Fix expand/collapse all buttons (#23590) * communicate via customevents * Handle open group logic in wrapper * fix tests * Make grid action buttons sticky * Add default toggle fn * fix splitting task id by '.' * fix missing dagrun ids * Update doc and sample dag for Quicksight (#23653) * Use func.count to count rows (#23657) * Add git_source to DatabricksSubmitRunOperator (#23620) The existing `DatabricksSubmitRunOperator` is extended with the support for the `git_source` parameter which allows users to run notebook tasks from files committed to git repositories. If specified, any notebook task that is part of the payload will clone the repository and check out the commit, tag, or the tip of the specified branch. This is an alternative to dev repos ([docs](https://docs.databricks.com/repos/index.html)) where the checkout/update would have to be triggered manually. Public documentation for the feature available here: https://docs.databricks.com/dev-tools/api/latest/jobs.html (NB: as noted in the docs, the feature is currently in public preview). * Disable Flower by default from docker-compose (#23685) * Fix property name in breeze Shell Params (#23696) The rename from #23562 missed few shell_parms usage where it also should be replaced. * Clarify that bundle extras should not be used for PyPi installs (#23697) The bundle extras we have are only used for development and they should not be used to install airflow from PyPI. This update to documentation clarifies it. Closes: #23692 * Add environment check and build image check for more Breeze commands (#23687) Several commands of Breeze depends on docker, docker compose being available as well as breeze image. They will work fine if you "just" built the image but they might benefit from the image being rebuilt (to make sure all latest dependencies are installed in the image). The common checks done in "shell" command for that are now extracted to common utils and run as first thing in those commands that need it. * Add UI tests for /utils and /components (#23456) * Add UI tests for /utils and /components * add test for Table * Address PR feedback * Fix window prompt var * Fix TaskName test from rebase * fix lint errors * Add slim image to docs/docker-stack/README.md (#23710) * Use profiles to disable flower in docker-compose (#23709) * Ensure execution_timeout as timedelta (#23655) * Handle invalid date parsing in webserver views. (#23161) * Handle invalid date from query parameters in views. * Add tests. * Use common parsing helper. * Add type hint. * Remove unwanted error check. * Fix extra_links endpoint. * Add fields to CLOUD_SQL_EXPORT_VALIDATION. (#23724) * Add doc and sample dag for GCSToS3Operator (#23730) * Fix grid details header text overlap (#23728) Move top margin to each breadcrumb component to make sure that there is no overlap when the header wraps with long names. * Add version to migration prefix (#23564) We don't really need the alembic revision id in the filename. having version instead is much more useful. having both of them takes up too much space. * Add typing for airflow/configuration.py (#23716) * Add typing for airflow/configuration.py The configuraiton.py did not have typing information and it made it rather difficult to reason about it-especially that it went a few changes in the past that made it rather complex to understand. This PR adds typing information all over the configuration file * Remove titles from link buttons (#23736) * Disable flower in chart by default (#23737) * Add AWS project structure tests (re: AIP-47) (#23630) * Speech To Text assets & system tests migration (AIP-47) (#23643) Co-authored-by: Wojciech Januszek <januszek@google.com> * Add 'reschedule' to the serialized fields for the BaseSensorOperator (#23674) fix #23411 * Updated MongoDB logo (#23746) As per https://www.mongodb.com/brand-resources * Fix broken main branch (#23751) main branch is broken since https://github.com/apache/airflow/pull/23630 needed rebase before merge as https://github.com/apache/airflow/pull/23730 added the missing example dag * Allow more parameters to be piped through via execute_in_subprocess (#23286) * Increase timeout for Helm Chart executor upgrade tests (#23759) * Fix task log is not captured (#23684) when StandardTaskRunner runs tasks with exec Issue: https://github.com/apache/airflow/issues/23540 * Helm chart 1.6.0rc2 (#23754) * Fix doc description of [core] parallelism config setting (#23768) * Change `Github` to `GitHub` (#23764) * Add tagging image as latest for CI image wait (#23775) The "wait for image" step lacked --tag-as-latest which made the subsequent "fix-ownership" step run sometimes far longer than needed - because it rebuilt the image for fix-ownership case. Also the "fix-ownership" command has been changed to just pull the image if one is missing locally rather than build. This command might be run in an environment where the image is missing or any other image was build (for example in jobs where an image was build for different Python version) in this case the command will simply use whatever Python version is available (it does not matter), or in case no image is available, it will pull the image as the last resort. * Fix auto upstream dep when expanding non-templated field (#23771) If you tried to expand via xcom into a non-templated field without explicitly setting the upstream task dependency, the scheduler would crash because the upstream task dependency wasn't being set automatically. It was being set only for templated fields, but now we do it for both. * clearer method name in scheduler_job.py (#23702) * Fallback to parse dag_file when no dag in the db (#23738) * cleanup usage of `get_connections()`` from test suite (#23757) The function is deprecated and raises warnings https://github.com/apache/airflow/pull/10192 Replacing the usage with `get_connection()` * Maintain grid view selection on filtering upstream (#23779) * Maintain grid selection on filter upstream The grid view selection was being cleared when clicking "Filter Upstream". The selection should persist. Also, added a left margin to the "Reset root" button * fix linting * Fix ``SqliteHook`` compatibility with SQLAlchemy engine (#23790) Same as https://github.com/apache/airflow/pull/19508 but for Sqlite as described in https://docs.sqlalchemy.org/en/14/dialects/sqlite.html#connect-strings to be able to create a Sqlalchemy engine from the URI itself. Without this, it currently fails with the following error due to how we create URI in Connections. An absolute path is denoted by starting with a slash, means you need four slashes: ``` url = sqlite://%2Ftmp%2Fsqlite.db def create_connect_args(self, url): if url.username or url.password or url.host or url.port: > raise exc.ArgumentError( "Invalid SQLite URL: %s\n" "Valid SQLite URL forms are:\n" " sqlite:///:memory: (or, sqlite://)\n" " sqlite:///relative/path/to/file.db\n" " sqlite:////absolute/path/to/file.db" % (url,) ) E sqlalchemy.exc.ArgumentError: Invalid SQLite URL: sqlite://%2Ftmp%2Fsqlite.db E Valid SQLite URL forms are: E sqlite:///:memory: (or, sqlite://) E sqlite:///relative/path/to/file.db E sqlite:////absolute/path/to/file.db ``` * Fix python version used for cache preparaation (#23785) Cache preparation on CI used default (Python 3.7) version of the image. It had an influence on time of "full build needed" only and for users who wanted to build breeze image for Python version different than default Python 3.7. It had no big influence on "main" builds" because in main we are build images with "upgrade-to-newer-dependencies" which takes longer anyway. * Add `dttm` searchable field in audit log (#23794) * Further speed up fixing ownership in CI (#23782) After #23775 I noticed that there is yet another small improvement area in the CI buld speed. Currently build-ci-image builds and push only "commit-tagged" images, but "fix-ownership" requires the "latest" image to run. This PR adds --tag-as-latest option also to build-image and build-prod-image commands - similarly as for the pull-image and pull-prod-image. This will retag the "commit" images as latest in the build-ci-images step and allow to save 1m on pulling the latest image before fix-ownership (bringing it back to 1s overhead) * Modify db clean to also catch the ProgrammingError exception (#23699) * Update the DMS Sample DAG and Docs (#23681) * postgres_operator_howto_guide.rst (#23789) Saying "**the** PostgreSQL database" confused me. I thought it was implying that a user could/should connect to the airflow metadata db * Support host_name on Datadog provider (#23784) This is required to use other Datadog tenants like app.datadoghq.eu * Cloud SQL assets & system tests migration (AIP-47) (#23583) * Unbreak main after missing classes were added (#23819) * Fix python version command (#23818) * update CloudSqlInstanceImportOperator to CloudSQLImportInstanceOperator (#23800) * Reformat the whole AWS documentation (#23810) * Fix error when SnowflakeHook take empty list in `sql` param (#23767) * Grid data: do not load all mapped instances (#23813) * only get necessary task instances * add comment * encode_ti -> get_task_summary * Fix regression in ignoring symlinks (#23535) * [Issue#22846] allow option to encode or not encode UUID when uploading from Cassandra to GCS (#23766) * Fix provider import error matching (#23825) * Fix secrets rendered in UI when task is not executed. (#22754) * Fix retrieval of deprecated non-config values (#23723) It turned out that deprecation of config values did not work as intended. While deprecation worked fine when the value was specified in configuration value it did not work when `run_as_user` was used. In those cases the "as_dict" option was used to generate temporary configuratin and this temporary configuration contained default value for the new configuration value - for example it caused that the generated temporary value contained: ``` [database] sql_alchemy_conn=sqlite:///{AIRFLOW_HOME}/airflow.db ``` Even if the deprecated `core/sql_alchemy_conn` was set (and no new `database/sql_alchemy_conn` was set at the same time. This effectively rendered the old installation that did not convert to the new "database" configuration not working for run_as_user, because the tasks run with "run_as_user" used wrong, empty sqlite database instaead of the one configured for Airflow. Also during adding tests, it turned out that the mechanism was also not working as intended before - in case `_CMD` or `_SECRET` were used as environment variables rather than configuration. In those cases both _CMD and _SECRET should be evaluated during as_dict() evaluation, because the "run_as_user" might have not enough permission to run the command or retrieve secret. The _cmd and _secret variables were only evaluated during as_dict() when they were in the config file (note that this only happens when include_cmd, include_env, include_secret are set to True). The changes implemented in this PR fix both problems: * the _CMD and _SECRET env vars are evaluated during as_dict when the respective include_* is set * the defaults are only set for the values that have deprecations in case the deprecations have no values set in either of the ways: * in config file * in env variable * in _cmd (via config file or env variable) * in _secret (via config file or env variable) Fixes: #23679 * Automatically reschedule stalled queued tasks in CeleryExecutor (v2) (#23690) Celery can lose tasks on worker shutdown, causing airflow to just wait on them indefinitely (may be related to celery/celery#7266). This PR expands the "stalled tasks" functionality which is already in place for adopted tasks, and adds the ability to apply it to all tasks such that these lost/hung tasks can be automatically recovered and queued up again. * Document fix for broken elasticsearch logs with 2.3.0+ upgrade (#23821) In certain upgrade paths, Airflow isn't given an opportunity to track the old `log_id_template`, so document the fix for folks who run into trouble. * Add tool to automaticaly update status of AIP-47 issues. (#23745) * Self upgrade when refreshing images (#23686) When you have two branches, you should sefl-upgrade breeze to make sure you use the version that is tied with your branch. Usually we have two active branches - main and the last released line, so switching between then is not unlikely for maintainers. * Exclude missing tasks from the gantt view (#23627) * Exclude missing tasks from the gantt view Stops the gantt view from crashing if a task no longer exists in a DAG but there are TaskInstances for that task. * Fix tests * Don't use the root logger in KPO _suppress function (#23835) * Update Production Guide for Helm Chart docs (#23836) Explain that db initialization is not necessary if using the helm chart. * Helm chart 1.6.0 is released; bump chart version to 1.7.0-dev (#23840) * Add missing "airflow-constraints-reference" parameter (#23844) The build commands were missing "airflow-constraints-reference" parameter and it always defaulted to constraints-main * Better fix for constraint-reference (#23845) The previous fix (#23844) broke main on package verification as the package verification used the same parameter that was set to empty. This change rmeoves some remnant from the "bash" version where we had to check if variable was empty and also making the "constraint" parameters accepting default values from the current branch to be used also for build commands. * Mask sensitive values for not-yet-running TIs (#23807) Alternative approach to #22754. Resolves #22738. * Add limit for JPype1 (#23847) The JPype1 limit has to be introduced because otherwise the 1.4.0 JPype1 breaks our ARM builds. The 1.4.0 did not release the sdist version of the package. This made our cache refresh job to fail as 1.4.0 version cannot be installed on ARM image. The issue is captured in https://github.com/jpype-project/jpype/issues/1069 * Add "no-issue-needed" rule directly in CONTRIBUTING.rst (#23802) The rule was not really explained directly where you'd expect it, it was hidden deeply in "triage" process where many contributors would not even get to. This PR adds appropriate explanation and also explains that discussions is the preferred way to discuss things in Airflow rather than issues. * Handler parameter from `JdbcOperator` to `JdbcHook.run` (#23817) * Doc: Add column names for DB Migration Reference (#23853) Before the automation: https://airflow.apache.org/docs/apache-airflow/2.2.5/migrations-ref.html Currently (with missing column names): https://airflow.apache.org/docs/apache-airflow/2.3.0/migrations-ref.html * Fix exception trying to display moved table warnings (#23837) If you still have an old dangling table from the 2.2 migration this would fail. Make it more resilient and cope with both styles of moved table name * Update sample dag and doc for RDS (#23651) * Fix DataprocJobBaseOperator not being compatible with dotted names (#23439). (#23791) * job_name parameter is now sanitized, replacing dots by underscores. * Upgrade `pip` to 22.1.1 version (just released) (#23854) * Add better feedback to Breeze users about expected action timing (#23827) There are a few actions in Breeze that might take more or less time when invoked. This is mostly when you need to upgrade Breeze or update to latest version of the image because some dependedncies were added or image was modified. While we have improved significantly the waiting time involved now (and caching problems have been fixed to make it as fast possible), there are still a few situations that you need to have a good connectivity and a little time to run the upgrade. Which is often not something you would like to loose your time on in a number of cases when you need to do things fast. Usually Breeeze does not force the user to perform such long actions - it allows to continue without doing them (either by timeout or by letting user answer "no" to question asked. Previously Breeze have not informed the user about the exepcted time of running such operation, but with this change it tells what is the expected delay - thus allowing the user to make informed action whether they want to run the upgrade or not. * Fix UnboundLocalError when sql is empty list in DbApiHook (#23816) * Fix UnboundLocalError when sql is empty list in DatabricksSqlHook (#23815) * Add number of node params only for single-node cluster in RedshiftCreateClusterOperator (#23839) * Sql to gcs with exclude columns (#23695) * Add support for associating custom tags to job runs submitted via EmrContainerOperator (#23769) Co-authored-by: Sandeep Kadyan <sandeep.kadyan@publicissapient.com> * Add Deferrable Databricks operators (#19736) * Fix Amazon EKS example DAG raises warning during Imports (#23849) Co-authored-by: eladkal <45845474+eladkal@users.noreply.github.com> * Fix databricks tests (#23856) * Add __wrapped__ property to _TaskDecorator (#23830) Co-authored-by: Sanjay Pillai <sanjaypillai11 [at] gmail.com> * Highlight task states by hovering on legend row (#23678) * Rework the legend row and add the hover effect. * Move horevedTaskState to state and fix merge conflicts. * Add tests. * Order of item in the LegendRow, add no_status support * Clean up f-strings in logging calls (#23597) * update K8S-KIND to 0.14.0 (#23859) * Replaced all days_ago functions with datetime functions (#23237) Co-authored-by: Dev232001 <thedevhooda@gmail.com> * Add clear DagRun endpoint. (#23451) * Ignore the DeprecationWarning in test_days_ago (#23875) Co-authored-by: alexkru <alexkru@wix.com> * Speed up Breeze experience on Mac OS (#23866) This change should significantly speed up Breeze experience (and especially iterating over a change in Breeze for MacOS users - independently if you are using x86 or arm architecture. The problem with MacOS with docker is particularly slow filesystem used to map sources from Host to Docker VM. It is particularly bad when there are multiple small files involved. The improvement come from two areas: * removing duplicate pycache cleaning * moving MyPy cache to docker volume When entering breeze we are - just in case - cleaning .pyc and __pychache__ files potentially generated outside of the docker container - this is particularly useful if you use local IDE and you do not have bytecode generation disabled (we have it disabled in Breeze). Generating python bytecode might lead to various problems when you are switching branches and Python versions, so for Breeze development where the files change often anyway, disabling them and removing when they are found is important. This happens at entering breeze and it might take a second or two depending if you have locally generated. It could happen that __init script was called twice (depending which script was called - therefore the time could be double the one that was actually needed. Also if you ever generated provider packages, the time could be much longer, because node_modules generated in provider sources were not excluded from searching (and on MacOS it takes a LOT of time). This also led to duplicate time of exit as the initialization code installed traps that were also run twice. The traps however were rather fast so had no negative influence on performance. The change adds a guard so that initialization is only ever executed once. Second part of the change is moving the cache of mypy to a docker volume rather than being used from local source folder (default when complete sources are mounted). We were already using selective mount to make sure MacOS filesystem slowness affects us in minimal way - but with this change, the cache will be stored in docker volume that does not suffer from the same problems as mounting volumes from host. The Docker volume is preserved until the `docker stop` command is run - which means that iterating over a change should be WAY faster now - observed speed-up were around 5x speedups for MyPy pre-commit. * Add default task retry delay config (#23861) * Move MappedOperator tests to mirror code location (#23884) At some point during the development of AIP-42 we moved the code for MappedOperator out of baseoperator.py to mappedoperator.py, but we didn't move the tests at the same time * Enable clicking on DAG owner in autocomplete dropdown (#23804) PR#18991 introduced directly navigating to a DAG when selecting one from the typeahead search results. Unfortunately, the search results also includes DAG owner names, and selecting one of those navigates to a DAG with that name, which almost certainly doesn't exist. This extends the autocompletion endpoint to return the type of result, and adjusts the typeahead selection to use this to know which way to navigate. * Document LocalKubernetesExecutor support in chart (#23876) * Avoid extra questions in `breeze build image` command. (#23898) Fixes: #23867 * Update INTHEWILD.md (#23892) * Split contributor's quick start into separate guides. (#23762) The foldable parts were not good. They made links not to work as well as they were not too discoverable. Fixes: #23174 * Avoid printing exception when exiting tests command (#23897) Fixes: #23868 * Move string arg evals to `execute()` in `EksCreateClusterOperator` (#23877) Currently there are string-value evaluations of `compute`, `nodegroup_role_arn`, and `fargate_pod_execution_role_arn` args in the constructor of `EksCreateClusterOperator`. These args are all listed as a template fields so it's entirely possible that the value(s) passed in to the operator is a Jinja expression or an `XComArg`. Either of these value types could cause a false-negative `ValueError` (in the case of unsupported `compute` values) or a `false-positive` (in the the cases of explicit checks for the *arn values) since the values themselves have not been rendered. This PR moves the evaluations of these args to the `execute()` scope. * Update .readthedocs.yml (#23903) String instead of Int see https://docs.readthedocs.io/en/stable/config-file/v2.html * Make --file command in static-checks autocomplete file name (#23896) The --verbose and --dry-dun commands caused n --files command to fail and the flag was "artifficial" -it was equivalent to bool flag. the actual files were taken from arguments. This PR fixes it by turning the arguments into multiple ``--file`` commands - each with its own completioin for local files. * Chart: Update default airflow version to `2.3.1` (#23913) * Fix Breeze documentation typo (#23919) * Update environments documentation links (#23920) * `2.3.1` has been released (#23912) * Make CI and PROD image builds consistent (#23841) Simple refactoring to make the jobs more consistent. * Alphabetizes two tables (#23923) The rest of the page has consistently alphabetized tables. This commit fixes three `extras` that were not alphabetized. * Use "remote" pod when patching KPO pod as "checked" (#23676) When patching as "checked", we have to use the current version of the pod otherwise we may get an error when trying to patch it, e.g.: ``` Operation cannot be fulfilled on pods \"test-kubernetes-pod-db9eedb7885c40099dd40cd4edc62415\": the object has been modified; please apply your changes to the latest version and try again" ``` This error would not cause a failure of the task, since errors in `cleanup` are suppressed. However, it would fail to patch. I believe one scenario when the pod may be updated is when retrieving xcom, since the sidecar is terminated after extracting the value. Concerning some changes in the tests re the "already_checked" label, it was added to a few "expected pods" recently, when we changed it to patch even in the case of a successful pod. Since we are changing the "patch" code to patch with the latest read on the pod that we have (i.e. using the `remote_pod` variable), and no longer the pod object stored on `k.pod`, the label no longer shows up in those tests (that's because in k.pod isn't actually a read of the remote pod, but just happens to get mutated in the patch function before it is used to actually patch the pod). Further, since the `remote_pod` is a local variable, we can't observe it in tests. So we have to read the pod using k8s api. _But_, our "find pod" function excludes "already checked" pods! So we have to make this configurable. So, now we have a proper integration test for the "already_checked" behavior (there was already a unit test). * Clarify manual merging of PR in release doc (#23928) It was not clear to me what this really means * Fix broken main (#23940) main breaks with `Traceback: /usr/local/lib/python3.7/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) tests/providers/amazon/aws/hooks/test_cloud_formation.py:31: in <module> class TestCloudFormationHook(unittest.TestCase): tests/providers/amazon/aws/hooks/test_cloud_formation.py:67: in TestCloudFormationHook @mock_cloudformation /usr/local/lib/python3.7/site-packages/moto/__init__.py:30: in f module = importlib.import_module(module_name, "moto") /usr/local/lib/python3.7/importlib/__init__.py:127: in import_module return _bootstrap._gcd_import(name[level:], package, level) /usr/local/lib/python3.7/site-packages/moto/cloudformation/__init__.py:1: in <module> from .models import cloudformation_backends /usr/local/lib/python3.7/site-packages/moto/cloudformation/models.py:18: in <module> from .parsing import ResourceMap, OutputMap /usr/local/lib/python3.7/site-packages/moto/cloudformation/parsing.py:17: in <module> from moto.apigateway import models # noqa # pylint: disable=all /usr/local/lib/python3.7/site-packages/moto/apigateway/__init__.py:1: in <module> from .models import apigateway_backends /usr/local/lib/python3.7/site-packages/moto/apigateway/models.py:9: in <module> from openapi_spec_validator import validate_spec E ModuleNotFoundError: No module named 'openapi_spec_validator' ` Fix is already in placed in moto https://github.com/spulec/moto/pull/5165 but version 3.1.11 wasn't released yet * Update INSTALL_PROVIDERS_FROM_SOURCES instructions. (#23938) * Add typing to Azure Cosmos Client Hook (#23941) New release of Azure Cosmos library has added typing information and it broke main builds with mypy verification. * Remove redundant register exit signals in `dag-processor` command (#23886) * Disable rebase workflow (#23943) The change of the release workflow in #23928 removed the reason why we should have rebase workflow possible. We only needed to do rebase when we merged test branch into stable branch and since we are doing it manually, there is no more reeason to have it in the GitHub UI. * Prevent UI from crashing if grid task instances are null (#23939) * UI fix for null task instances * improve tests without global vars * fix test data * Grid fix details button truncated and small UI tweaks (#23934) * Show details button and wrap on LegendRow. * Update following brent review * Fix display on small width * Rotate icon for a 'ReadLess' effect * Fix and speed up grid view (#23947) This fetches all TIs for a given task across dag runs, leading to signifincatly faster response times. It also fixes a bug where Nones were being passed to the UI when a new task was added to a DAG with exiting runs. * Removes duplicate code block (#23952) There's are two code blocks with identical text in the helm-chart docs. This commit removes one of them. * Update dep for databricks #23917 (#23927) * Use '--subdir' argument value for standalong dag processor. (#23864) * Revert "Add limit for JPype1 (#23847)" (#23953) This turned out to be mistake in manual submission. Fixed on JPype1 side. This reverts commit 3699be49b24ef5a0a8d8de81a149af2c5a7dc206. * Faster grid view (#23951) * Disallow calling expand with no arguments (#23463) * [FEATURE] KPO use K8S hook (#22086) * Add cascade to `dag_tag` to `dag` foreignkey (#23444) Bulk delete does not work if the cascade behaviour of a foreignkey is set on python side(relationship configuration). To allow bulk delete of dags we need to setup cascade deletion in the DB. The warning on query.delete at https://docs.sqlalchemy.org/en/14/orm/session_basics.html#selecting-a-synchronization-strategy stated that: The operations do not offer in-Python cascading of relationships - it is assumed that ON UPDATE CASCADE and/or ON DELETE CASCADE is configured for any foreign key references which require it, otherwise the database may emit an integrity violation if foreign key references are being enforced. Another alternative is avoiding bulk delete of dags but I prefer we support bulk deletes. This will break offline sql generation for mssql(already broken before now :) ). Also, since there's only one foreign key in `dag_tag` table, I assume that the foreign key would be named `dag_tag_ibfk_1` in `mysql`. This avoided having to query the db for the name. The foreignkey is explicitly named now, would be easy for future upgrades * DagFileProcessorManager: Start a new process group only if current process not a session leader (#23872) * Introduce `flake8-implicit-str-concat` plugin to static checks (#23873) * Fix UnboundLocalError when sql is empty list in ExasolHook (#23812) * Fix inverted section levels in best-practices.rst (#23968) This PR fixes inverted levels in the sections added to the "Best Practices" document in #21879. * Add support to specify language name in PapermillOperator (#23916) * Add support to specify language name in PapermillOperator * Replace getattr() with simple attribute access * [23945] Icons in grid view for different dag types (#23970) * Helm logo no longer a link (#23977) * Fix links in documentation (#23975) * fix links * added right link to breeze * Add TaskInstance State 'REMOVED' to finished states and success states (#23797) Now that we support dynamic task mapping, we should have the 'REMOVED' state of task instances as a finished state because for dynamic tasks with a removed task instance, the dagrun would be stuck in running state if 'REMOVED' state is not in finished states. * Remove `xcom_push` from `DockerOperator` (#23981) * Fix missing shorthand for docker buildx rm -f (#23984) Latest version of buildx removed -f as shorthand for --force flag. * use explicit --mount with types of mounts rather than --volume flags (#23982) The --volume flag is an old style of specifying mounts used by docker, the newer and more explicit version is --mount where you have to specify type, source, destination in the form of key/value pairs. This is more explicit and avoids some guesswork when volumes are mounted (for example seems that on WSL2 volume name might be guessed as path wrongly). The change explicitly specifies which of the mounts are bind mounts and which are volume mounts. Another nice side effect of this change is that when source is missing, docker will not automatically create directories with the missing name but it will fail. This is nicer because before it led to creating directories when they were missing (for example .bash_aliases and similar). This allows us to avoid some cleanups to account for those files being created - instead we simply skip those mounts if the file/folder does not exist. * Force colors in yarn test output in CI (#23986) * Fix breeze failures when there is no buildx installed on Mac (#23988) If you have no buildx plugin installed on Mac (for example when you use colima instead of Docker Desktop) the breeze check was failing - but buildx in fact is not needed to run typical breeze commands, and breeze already has support for it - it was just wrongly handled. * Replace generation of docker volumes to be done from python (#23985) The pre-commit to generate docker volumes in docker compose file is now written in Python and it also uses the newer "volume:" syntax to define the volumes mounted in the docker-compose. * Replace `use_task_execution_date` with `use_task_logical_date` (#23983) * Replace `use_task_execution_date` with `use_task_logical_date` We have some operators/sensors that use `*_execution_date` as the class parameters. This PR deprecate the usage of these parameters and replace it with `logical_date`. There is no change in functionality, under the hood the functionality already uses `logical_date` this is just about the parameters name as exposed to the users. * Remove pinning for xmltodict (#23992) We have now moto 3.1.9+ in constraints so we should remove the limit. Fixes: #23576 * Remove fixing cncf.kubernetes provider when generating constraints (#23994) When we yanked cncf.kubernetes provider, we pinned 3.1.2 temporarily for provider generation. This removes the pinning as we are already at 4.0.2 version * Add better diagnostics capabilities for pre-commits run via CI image (#23980) The pre-commits that require CI image run docker command under the hood that is highly optimized for performance (only mounts files that are necessary to be mounted) - in order to improve performance on Mac OS and make sure that artifacts are not left in the source code of Airflow. However that makes the command slightly more difficult to debug because they generate dynamically the docker command used, including the volumens that should be mounted when the docker command is run. This PR adds better diagnostics to the pre-commit scripts allowing VERBOSE="true" and DRY_RUN="true" variables that can help with diagnosing problems such as running the scripts on WSL2. It also fixes a few documentation bugs that have been missed after changing names of the image-related static checks and thanks to separating the common code to utility function it allows to set SKIP_IMAGE_PRE_COMMITS variable to true which will skip running all pre-commit checks that require breeze image to be available locally. * Disable fail-fast on pushing images to docker cache (#24005) There is an issue with pushing cache to docker registry that is connected to containerd bug but started to appear more frequently recently (as evidenced for example by https://git.luolix.topmunity/t/buildx-failed-with-error-cannot-reuse-body-request-must-be-retried/253178 ). The issue is still open in containerd: https://github.com/containerd/containerd/issues/5978. Until it if fixed, we disable fail-fast on pushing cache so that even if it happens, we just have to re-run that single python version that actually failed. Currently there is a much lower chance of success because all 4 build have to succeed. * Add automated retries on retryable condition for building images in CI (#24006) There is a flakiness in pushing cache images to ghcr.io, therefore we want to add automated retries when the images fail intermittently. The root cause of the problem is tracked in containerd: https://github.com/containerd/containerd/issues/5978 * Ensure @contextmanager decorates generator func (#23103) * Revert "Add automated retries on retryable condition for building images in CI (#24006)" (#24016) This reverts commit 7cf0e43b70eb1c57a90ee7e2ff14b03487ffb018. * Cleanup `BranchDayOfWeekOperator` example dag (#24007) * Cleanup BranchDayOfWeekOperator example dag There is no need for `dag=dag` when using context manager. * Added missing project_id to the wait_for_job (#24020) * Only run separate per-platform build when preparing build cache (#24023) Apparently pushing multi-platform images when building cache on CI has some problems recently, connected with ghcr.io being more vulnerable to race condition described in this issue: https://github.com/containerd/containerd/issues/5978 Apparently when two, different platform layers are pushed about the same time to ghcr.io, the error "cannot reuse body, request must be retried" is generated. However we actually do not even need to build the multiplatform latest images because as of recently we have separate cache for each platform, and the ghcr.io/:latest images are not used any more not even for docker builds. We we always build images rather than pull and we use --from-cache for that - specific per platform. The only image pulling we do is when we pull the :COMMIT_HASH images in CI- but those are single-platform images (amd64) and even if we add tests for arm, they will have different tag. Hopefully we can still build release images without causing the race condition too frequently - this is more likely because when we build images for cache we use machines with different performance characteristics and the same layers are pushed at different times from different platforms. * Preparing buildx cache is allowed without --push-image flag (#24028) The previous version of buildx cache preparation implied --push-image flag, but now this is completely separated (we do not push image, we just prepare cache), so when mutli-platform buildx preparation is run we should also allow the cache to run without --push-image flag. * Add partition related methods to GlueCatalogHook: (#23857) * "get_partition" to retrieve a Partition * "create_partition" to create a Partition * Adds foldable CI group for command output (#24026) * Add foldable groups in CI outputs in commands that need it (#24035) This is follow-up after #24026 which added capability of selectively deciding for each breeze command, whether the output of the command should be "foldable" group. All CI output has been reviewed, and the commands which "need" it were identified. This also fixes a problem introduced there - that the command itself was not "foldable" group itself. * Increase size of ARM build instance (#24036) Our ARM cache builds started to hang recently at yarn prod step. The most likely reason are limited resources we had for the ARM instance to run the docker build - it was rather small instance with 2GB RAM and it is likely not nearly enought to cope with recent changes related to Grid View where we likely need much more memory during the yarn build step. This change increases the instance memory to 8 GB (c6g.xlarge). Also this instance type gives 70% cost saving and has very low probability of being evicted (it's not in high demand in Ohio Region of AWS. Also the AMI used is refreshed with latest software (docker) * Remove unused [github_enterprise] from ref docs (#24033) * Add enum validation for [webserver]analytics_tool (#24032) * Support impersonation service account parameter for Dataflow runner (#23961) * Fix closing connection dbapi.get_pandas_df (#23452) * Light Refactor and Clean-up AWS Provider (#23907) * Removing magic numbers from exceptions (#23997) * Removing magic numbers from exceptions * Running pre-commit * Upgrade to pip 22.1.2 (#24043) Pip has been upgraded to version 22.1.2 12 minutes ago. Time to catch up. * Shaves-off about 3 minutes from usage of ARM instances on CI (#24052) Preparing airflow packages and provider packages does not need to be done on ARM and actually the ARM instance is idle while they are prepared during cache building. This change moves preparation of the packages to before the ARM instance is started which saves about 3 minutes of ARM instance time. * SSL Bucket, Light Logic Refactor and Docstring Update for Alibaba Provider (#23891) * Use KubernetesHook to create api client in KubernetesPodOperator (#20578) Add support for k8s hook in KPO; use it always (even when no conn id); continue to consider the core k8s settings that KPO already takes into account but emit deprecation warning about them. KPO historically takes into account a few settings from core airflow cfg (e.g. verify ssl, tcp keepalive, context, config file, and in_cluster). So to use the hook to generate the client, somehow the hook has to take these settings into account. But we don't want the hook to consider these settings in general. So we read them in KPO and if necessary patch the hook and warn. * Re-add --force-build flag (#24061) After #24052 we also need to add --force-build flag as for Python 3.7 rebuilding CI cache would have been silently ignored as no image building would be needed * Fix grid view for mapped tasks (#24059) * Fix StatD timing metric units (#21106) Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: Tzu-ping Chung <tp@astronomer.io> * Drop Python 3.6 compatibility objects/modules (#24048) * Remove hack from BigQuery DTS hook (#23887) * Spanner assets & system tests migration (AIP-47) (#23957) * Run the `check_migration` loop at least once (#24068) This is broken since 2.3.0. that's if a user specifies a migration_timeout of 0 then no migration is run at all. * Bump eventsource from 1.0.7 to 1.1.1 in /airflow/ui (#24062) Bumps [eventsource](https://github.com/EventSource/eventsource) from 1.0.7 to 1.1.1. - [Release notes](https://github.com/EventSource/eventsource/releases) - [Changelog](https://github.com/EventSource/eventsource/blob/master/HISTORY.md) - [Commits](https://github.com/EventSource/eventsource/compare/v1.0.7...v1.1.1) --- updated-dependencies: - dependency-name: eventsource dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Remove certifi limitations from eager upgrade limits (#23995) The certifi limitation was introduced to keep snowflake happy while performing eager upgrade because it added limits on certifi. However seems like it is not limitation any more in latest versions of snowflake python connector, so we can safely remove it from here. The only remaining limit is dill but this one still holds. * fix style of example block (#24078) * Handle occasional deadlocks in trigger with retries (#24071) Fixes: #23639 * Adds Pura Scents, edits The Dyrt (#24086) * Migrate Yandex example DAGs to new design AIP-47 (#24082) closes: #22470 * set color to operators in cloud_sql.py (#24000) * Migrate HTTP example DAGs to new design AIP-47 (#23991) closes: #22448 , #22431 * Make expand() error vague so it's not misleading (#24018) * Use github for postgres chart index (#24089) Bitnami's CloudFront CDN is seemingly having issues, so point at github direct instead until it is resolved. * Fix the link to google workplace (#24080) * Bring MappedOperator members in sync with BaseOperator (#24034) * Add note about Docker volume remount issues in WSL 2 (#24094) * Convert Athena Sample DAG to System Test (#24058) * Self-update pre-commit to latest versions (#24106) * Temporarily fix bitnami index problem (#24112) We started to experience "Internal Error" when installing Helm chart and apperently bitnami "solved" the problem by removing from their index software older than 6 months(!). This makes our CI fail but It is much worse. This renders all our charts useless for people to install This is terribly wrong, and I raised this in the issue here: https://github.com/bitnami/charts/issues/10539#issuecomment-1144869092 * Fix small typos in static code checks doc (#24113) - Trivial typo fix in the command to run static checks on the last commit - Update "run all tests" to "run all checks" where applicable for consistency * Really workaround bitnami chart problem (#24115) The original fix in #24112 did not work due to: * not updated lock * EOL characters at the end of multiline long URL This PR fixes it. * Reduce grid view API calls (#24083) * Reduce API calls from /grid - Separate /grid_data from /grid - Remove need for formatData - Increase default query stale time to prevent extra fetches - Fix useTask query keys * consolidate grid data functions * fix www tests test grid_data instead of /grid * Removing magic status code numbers from api_connecxion (#24050) * Do not support MSSQL less than v2017 in code (#24095) Our experimental support for MSSQL starts from v2017(in README.md) but we still support 2000 & 2005 in code. This PR removes this support, allowing us to use mssql.DATETIME2 in all MSSQL DB. * Rename Permissions to Permission Pairs. (#24065) * Note that yarn dev needs webserver in debug mode (#24119) * Note that yarn dev needs webserver -d * Update CONTRIBUTING.rst Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Use -D * Revert "Use -D" This reverts commit 94d63adcf36aac13f5d94c2d4cd651907d833794. Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * fixing SSHHook bug when using allow_host_key_change param (#24116) * Adds mssql volumes to "all" backends selection (#24123) The "stop" command of Breeze uses "all" backend to remove all volumes - but mssql has special approach where the volumes defined depend on the filesystem used and we need to add the specific docker-compose files to list of files used when we use stop command. * Breeze must create `hooks\` and `dags\` directories for bind mounts (#24122) Now that breeze uses --mount instead of --volume (the former of which does not create missing mount dirs like the latter does see docs here: https://docs.docker.com/storage/bind-mounts/#differences-between--v-and---mount-behavior) we need to create these directories explicitly. * AIP-47 | Migrate Trino example DAGs to new design (#24118) Co-authored-by: Josh Fell <48934154+josh-fell@users.noreply.github.com> Co-authored-by: Michael Peteuil <michael.peteuil@gmail.com> Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com> Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: pierrejeambrun <pierrejbrun@gmail.com> Co-authored-by: Brent Bovenzi <brent.bovenzi@gmail.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Edith Puclla <58795858+edithturn@users.noreply.github.com> Co-authored-by: nsAstro <102520074+nsAstro@users.noreply.github.com> Co-authored-by: ishiis <ishii.shunichi@gmail.com> Co-authored-by: Harpreet Singh <singhharpreet.chadha@gmail.com> Co-authored-by: eladkal <45845474+eladkal@users.noreply.github.com> Co-authored-by: rahulgoyal2987 <rahulgoyal338@gmail.com> Co-authored-by: raphaelauv <raphaelauv@users.noreply.github.com> Co-authored-by: mhenc <mhenc@google.com> Co-authored-by: Jakub Novák <kubus.novak@gmail.com> Co-authored-by: muhua <microhuang@live.com> Co-authored-by: Ruben Laguna <ruben.laguna@gmail.com> Co-authored-by: humit <jhjang1005@naver.com> Co-authored-by: Daniel Standish <15932138+dstandish@users.noreply.github.com> Co-authored-by: Gabriel Machado <gabriel.ms1@hotmail.com> Co-authored-by: Kanthi <subkanthi@gmail.com> Co-authored-by: pankajastro <98807258+pankajastro@users.noreply.github.com> Co-authored-by: Sebastian Chamena <43488475+schattian@users.noreply.github.com> Co-authored-by: Ping Zhang <pingzh@umich.edu> Co-authored-by: ishiis <shunichi.ishii@smarthr.co.jp> Co-authored-by: Bartłomiej Hirsz <bartek.hirsz@gmail.com> Co-authored-by: akolar-db <72745279+akolar-db@users.noreply.github.com> Co-authored-by: Kamil Breguła <mik-laj@users.noreply.github.com> Co-authored-by: Karthikeyan Singaravelan <tir.karthi@gmail.com> Co-authored-by: Niko <onikolas@amazon.com> Co-authored-by: Wojciech Januszek <wjanuszek@sigma.ug.edu.pl> Co-authored-by: Wojciech Januszek <januszek@google.com> Co-authored-by: David Caron <dcaron05@gmail.com> Co-authored-by: Ross Lawley <ross.lawley@gmail.com> Co-authored-by: Charles Machalow <csm10495@gmail.com> Co-authored-by: Chris Redekop <32752154+repl-chris@users.noreply.github.com> Co-authored-by: John Bampton <jbampton@users.noreply.github.com> Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com> Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com> Co-authored-by: Jian Yuan Lee <jianyuan@gmail.com> Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com> Co-authored-by: Gonzalo Peci <pecigonzalo@users.noreply.github.com> Co-authored-by: Dmytro Kazanzhy <dkazanzhy@gmail.com> Co-authored-by: Ian Buss <ianbuss@users.noreply.github.com> Co-authored-by: Xiao Fu <xiao.xfu24@gmail.com> Co-authored-by: Joel Ossher <73489824+joelossher@users.noreply.github.com> Co-authored-by: Mike Kravtsov <61209278+mkravtsov-fetchrewards@users.noreply.github.com> Co-authored-by: Ash Berlin-Taylor <ash@apache.org> Co-authored-by: Guilherme Martins Crocetti <24530683+gmcrocetti@users.noreply.github.com> Co-authored-by: 서재권(Data Platform) <90180644+jaegwonseo@users.noreply.github.com> Co-authored-by: Sandeep <sandeep.kadyan@gmail.com> Co-authored-by: Sandeep Kadyan <sandeep.kadyan@publicissapient.com> Co-authored-by: Eugene Karimov <13220923+eskarimov@users.noreply.github.com> Co-authored-by: Vedant Bhamare <55763604+Dark-Knight11@users.noreply.github.com> Co-authored-by: sanjayp <sanjaypillai11@gmail.com> Co-authored-by: Tzu-ping Chung <tp@astronomer.io> Co-authored-by: Dev232001 <thedevhooda@gmail.com> Co-authored-by: Alex Kruchkov <36231027+alexkruc@users.noreply.github.com> Co-authored-by: alexkru <alexkru@wix.com> Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com> Co-authored-by: Mark Norman Francis <norm@201created.com> Co-authored-by: Vincent Koc <koconder@users.noreply.github.com> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com> Co-authored-by: Igor Tavares <igorborgest@gmail.com> Co-authored-by: Marty Jackson <mfjackson2008@gmail.com> Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is> Co-authored-by: Kengo Seki <sekikn@apache.org> Co-authored-by: John Green <nhojjohn@users.noreply.github.com> Co-authored-by: David Skoda <dskoda1@binghamton.edu> Co-authored-by: Łukasz Wyszomirski <wyszomirski@google.com> Co-authored-by: Hubert Pietroń <94397721+hubert-pietron@users.noreply.github.com> Co-authored-by: Bernardo Couto <35502483+bernardocouto@users.noreply.github.com> Co-authored-by: viktorvia <86823020+viktorvia@users.noreply.github.com> Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: henriqueribeiro <henriqueribeiro@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chenglong Yan <alanx.yan@gmail.com> Co-authored-by: François de Metz <francois@2metz.fr> Co-authored-by: Paul Williams <pdw@udel.edu> Co-authored-by: James Timmins <james@astronomer.io> Co-authored-by: chethanuk-plutoflume <chethanuk@outlook.com>

ecerulm requested review from dstandish and jedcunningham as code owners May 6, 2022 10:11

boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels May 6, 2022

ecerulm mentioned this pull request May 6, 2022

Fix "Reason: Expired: too old resource version: 379140622 (380367990)" #23504

Closed

ecerulm marked this pull request as draft May 6, 2022 12:18

ecerulm force-pushed the resource_too_old_2 branch from ae99c95 to 359864d Compare May 6, 2022 12:46

ecerulm mentioned this pull request May 6, 2022

Handle kubernetes watcher stream disconnection #15500

Closed

ecerulm added 2 commits May 7, 2022 08:23

Prevent KubernetesJobWatcher getting stuck on resource too old

98194d8

If the watch fails because "resource too old" the KubernetesJobWatcher should not retry with the same resource version as that will end up in loop where there is no progress.

Reset ResourceVersion().resource_version to 0

ccf6cc2

ecerulm force-pushed the resource_too_old_2 branch from f7762ec to ccf6cc2 Compare May 7, 2022 07:12

ecerulm marked this pull request as ready for review May 7, 2022 10:11

eladkal added this to the Airflow 2.3.1 milestone May 7, 2022

This was referenced May 9, 2022

WIP - Enable kubernetes watch bookmarks. Handle 410 error in kubernetes_executor #23578

Closed

KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state #21087

Closed

potiuk approved these changes May 11, 2022

View reviewed changes

potiuk merged commit dee05b2 into apache:main May 11, 2022

ecerulm deleted the resource_too_old_2 branch May 11, 2022 07:48

ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label May 17, 2022

ephraimbuddy mentioned this pull request May 21, 2022

Status of testing of Apache Airflow 2.3.1rc1 #23852

Closed

61 tasks

rsevilla87 mentioned this pull request May 25, 2022

Move all airflow references to 2.3.1 version cloud-bulldozer/airflow-kubernetes#196

Merged

eladkal mentioned this pull request Nov 17, 2022

Network instabilities are able to freeze KubernetesJobWatcher #12644

Closed

taeyoung94 mentioned this pull request Mar 3, 2023

[Question] how to update the airflow version in Self-managed Apache Airflow deployment for EKS. awslabs/data-on-eks#120

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent KubernetesJobWatcher getting stuck on resource too old #23521

Prevent KubernetesJobWatcher getting stuck on resource too old #23521

ecerulm commented May 6, 2022 •

edited

Loading

ephraimbuddy commented May 6, 2022

ncapeta commented May 7, 2022

potiuk commented May 7, 2022 •

edited

Loading

eladkal commented May 7, 2022

potiuk commented May 7, 2022

ncapeta commented May 7, 2022

potiuk commented May 7, 2022

ncapeta commented May 11, 2022

potiuk commented May 11, 2022

ashb commented May 16, 2022

ecerulm commented May 16, 2022

Prevent KubernetesJobWatcher getting stuck on resource too old #23521

Prevent KubernetesJobWatcher getting stuck on resource too old #23521

Conversation

ecerulm commented May 6, 2022 • edited Loading

ephraimbuddy commented May 6, 2022

ncapeta commented May 7, 2022

potiuk commented May 7, 2022 • edited Loading

eladkal commented May 7, 2022

potiuk commented May 7, 2022

ncapeta commented May 7, 2022

potiuk commented May 7, 2022

ncapeta commented May 11, 2022

potiuk commented May 11, 2022

ashb commented May 16, 2022

ecerulm commented May 16, 2022

ecerulm commented May 6, 2022 •

edited

Loading

potiuk commented May 7, 2022 •

edited

Loading