-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change AirflowTaskTimeout to inherit BaseException #35653
Conversation
I do not think #35474 is fixed by it. We STILL need to handle the case where low-level c-program does not handle SIGALRM - cc: @Taragolis . This one only fixes the case when someone deliberately caught all exceptions and ignored them - which is I believe different issue that #35474 describes. |
Hmm, SIGALARM wouldn't be injected in the middle of C-extension execution but it would stay pending until the extension returns to Python execution at least right? So it won't be totally ignored, just very much delayed? Same goes for subprocesses i believe. |
Technically speaking it would be only until the C-code checks if the signal arrived and handle it. Properly run long running loop in C-code executed from Python should frequently check if a signal arrived (Python lets the c-code know that there is a signal recived but it's up to the C-code to see it and handle it). And yes it will be very much delayed, which in case of timeout it will be very much not happening :). And yes - I fully agree those two should be separate issues and solved separately - so removing it from commmit message makes sense. |
0d25ac1
to
3e93d39
Compare
An in general need to check other Airflow codebase (core), that we do not miss to catch try:
...
except AirflowException:
"Surprise! Surprise!" |
For example airflow/airflow/task/task_runner/standard_task_runner.py Lines 104 to 133 in 1e24a3c
|
3e93d39
to
14449d6
Compare
Good point. I would need a lot more help from someone with deeper knowledge of Airflow internals to understand and review full scope of this. I found a few such issues in taskinstance.py, pushed in latest patchset. Question there is if we should loosen constraint on those types to just accept |
Good point yes. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
I was reading through the discussion and found this PR now being a bit stale. Before merge though I would kindly request a re-base and an addition of a newsfragment (see https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#step-4-prepare-pr --> newsfragment). I would propose to make this into a functional release and not in a patch release, meaning probably 2.9.0. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
beb0898
to
2c24222
Compare
d323064
to
f70ff9a
Compare
Latest CI failures look unrelated to change, can someone please help rerun it. ( In the meantime, I'm really struggling to run the whole test suite myself locally. Both with and without breeze :( |
The main is broken for that one, and it's being fixed in #37559
It takes a lot of time, but its usually not expected to do so, usually you should run just related tests, and this is why we have CI to catch other issues, and in most cases you will just see the tests that failed. But generally it should work with Breeze (when you rebuild the image), I wonder what kind of struggles you have? It's not really possible to do anything with unknown struggles. |
@hterik before I do a full test, do you have any recommondation how to test? Did you use any example DAG and timeout? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR and really like it. My (manual) tests with LocalExecutor, SequentialExecutor, CeleryExecutor and KubernetesExecutor (in KinD) worked as expected with a manual timeout+sleep based in an example DAG.
I'd propose (as it is a "feature" change, to make this into 2.9.0 (not a patch to any 2.8.x line
Thanks @jscheffl , can you please retrigger the failing CI actions? I've run all the unit tests locally now and the CI failures are passing locally. |
No. You need to rebase for that - and you can do it yourself. |
Code that normally catches Exception should not implicitly ignore interrupts from AirflowTaskTimout. Fixes apache#35644 apache#35474
f70ff9a
to
41a6c6e
Compare
I clicked "Rebase" button - but it's always faster for you to do rather than wait for someone (in the future) @hterik |
(In this case it would have saved you 2 hours) |
So, GREEN! Now just another second reviewer needed here... then for me this is looking good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Code that normally catches Exception should not implicitly ignore interrupts from AirflowTaskTimout.
Fixes: #35644