Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-2921][AIRFLOW-2922] Fix two potential bugs in CeleryExecutor() #3773

Merged
merged 1 commit into from
Aug 22, 2018

Conversation

XD-DENG
Copy link
Member

@XD-DENG XD-DENG commented Aug 20, 2018

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Potential bug-1

If a task state becomes either SUCCESS or FAILURE or REVOKED, it will be removed from self.tasks and self.last_state. However, because line 108 is not indented properly, this task will be added back to self.last_state again.

This will not lead to any error. But may still be worth changing.

Solution: indent line 108.

Potential bug-2

Celery Task states normally change in ways like “PENDING -> STARTED -> SUCCESS/FAILURE” (http://docs.celeryproject.org/en/latest/reference/celery.states.html).

In lines 107 and 108, it’s task.state rather than state, i.e. it will reflect the latest real-time state of the Celery task.

Let’s imagine: task state becomes STARTED first (initial state is PENDING), then the if-elif-else block will be triggered (line 93). It’s possible the Celery task state becomes SUCCESS when whichever line between 94-105 is running. At line 108, the latest state of the task will be changed to SUCCESS rather than STARTED in self.last_state because it’s referring to the latest state task.state rather than variable state.

Then this task will be dead-locked because the if-elif-else block will never be triggered for it again. It has no chance to be ended properly with methods self.success() or self.fail(). Although the chance that this happens is quite low (the time window is very short), the chance is not zero.

Solution: change task.state to state in lines 107 and 108.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

Code Quality

  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

@codecov-io
Copy link

codecov-io commented Aug 20, 2018

Codecov Report

Merging #3773 into master will decrease coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3773      +/-   ##
==========================================
- Coverage   77.65%   77.64%   -0.01%     
==========================================
  Files         204      204              
  Lines       15841    15841              
==========================================
- Hits        12301    12300       -1     
- Misses       3540     3541       +1
Impacted Files Coverage Δ
airflow/executors/celery_executor.py 83.6% <0%> (-1.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7142ae0...e62f9b1. Read the comment docs.

@ashb
Copy link
Member

ashb commented Aug 20, 2018

Will need tests to cover this please.

@XD-DENG
Copy link
Member Author

XD-DENG commented Aug 20, 2018

Hi @ashb , sure I will. May I confirm whether the logic/reasoning is making sense to you?

Bug-1:
if a task state becomes either SUCCESS or FAILURE or REVOKED,
it will be removed from self.tasks() and self.last_state().
However, because line 108 is not indented properly,
this task will be added back to self.last_state() again.

Bug-2:
When the state is updated, it's referring to the latest
state `task.state` rather than variable `state`.
This may result in dead-lock if the state changed from
`STARTED` to `SUCCESS` after the if-elif-else block
started.

Test case is updated for fix to bug-1.
@XD-DENG XD-DENG changed the title [AIRFLOW-2921] Fix trivial bug in CeleryExecutor() [AIRFLOW-2921][AIRFLOW-2922] Fix two bugs in CeleryExecutor() Aug 20, 2018
@XD-DENG XD-DENG changed the title [AIRFLOW-2921][AIRFLOW-2922] Fix two bugs in CeleryExecutor() [AIRFLOW-2921][AIRFLOW-2922] Fix two potential bugs in CeleryExecutor() Aug 20, 2018
@XD-DENG
Copy link
Member Author

XD-DENG commented Aug 20, 2018

Hi @ashb,

  1. Test case has been updated for the potential bug-1, based on the existing test case.

  2. To demonstrate the potential bug-1, I have created another branch in which I updated the test case but did NOT fix the code. This bug is quite obvious. 'Reverse Test' Commit Link, 'Reverse Test' Result Link

  3. Regarding the potential bug-2, please refer to my detailed description in the main PR message. Currently I don't see a feasible way to reproduce the potential error/test it yet.

@XD-DENG
Copy link
Member Author

XD-DENG commented Aug 21, 2018

Hi @yrqls21 , would you like to give your review on this for fun? ;-)

Especially the potential bug-2 listed above. It's either interesting finding or my dumb misunderstanding.

@@ -54,6 +54,8 @@ def test_celery_integration(self):
self.assertNotIn('success', executor.tasks)
self.assertNotIn('fail', executor.tasks)

self.assertNotIn('success', executor.last_state)
self.assertNotIn('fail', executor.last_state)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case change is for potential bug-1 only.

self.log.info("Unexpected state: %s", task.state)
self.last_state[key] = task.state
self.log.info("Unexpected state: %s", state)
self.last_state[key] = state
Copy link
Member Author

@XD-DENG XD-DENG Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The indentation change is to fix potential bug-1.

  • Changing from task.state to state in both lines 107 and 108 are to fix potential bug-2

@KevinYang21
Copy link
Member

Great catch, we have actually a patch in Airbnb that covered the 2nd bug( involves more changes and we're baking it in our env).

I agree on the fix for the 1st bug and its test. Also agree that it is very hard to have test to guard the 2nd bug.

Tyvm for the fixes, LGTM. 🎉

@XD-DENG
Copy link
Member Author

XD-DENG commented Aug 22, 2018

Thanks @yrqls21 ! Luckily Big O is not involved this time ;-)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @XD-DENG

@Fokko Fokko merged commit 3a02345 into apache:master Aug 22, 2018
@XD-DENG
Copy link
Member Author

XD-DENG commented Aug 22, 2018

Thanks @Fokko

@XD-DENG XD-DENG deleted the patch-2 branch August 22, 2018 09:20
@ashb
Copy link
Member

ashb commented Aug 22, 2018

Does this mean that CeleryExecutor is totally broken in 1.10.0, and we need to get a 1.10.1 out quickly?

@XD-DENG
Copy link
Member Author

XD-DENG commented Aug 22, 2018

Hi @ashb , to my understanding:

  • bug 1 will not result in any explicit error. It was just missing some clean-ups.
  • The possibility that bug 2 resulting in any error is extremely low.

Personally I don’t think it’s that urgent.

@KevinYang21
Copy link
Member

@ashb I don't think they are so severe that we need to call celery executor broken. The 1st bug practically won't create problem and the 2nd bug creates a rare edge case that one task will get stuck in a deadlock, and can be corrected by doing something like clear the task instance

@ashb
Copy link
Member

ashb commented Aug 22, 2018

K. Just getting a sense of urgency. We might have a 1.10.1 anyway, and if we do I'll pull this in to it.

@XD-DENG
Copy link
Member Author

XD-DENG commented Aug 22, 2018

Thanks both @yrqls21 @ashb

ashb pushed a commit to ashb/airflow that referenced this pull request Sep 3, 2018
Bug-1:
if a task state becomes either SUCCESS or FAILURE or REVOKED,
it will be removed from self.tasks() and self.last_state().
However, because line 108 is not indented properly,
this task will be added back to self.last_state() again.

Bug-2:
When the state is updated, it's referring to the latest
state `task.state` rather than variable `state`.
This may result in dead-lock if the state changed from
`STARTED` to `SUCCESS` after the if-elif-else block
started.

Test case is updated for fix to bug-1.
galak75 pushed a commit to VilledeMontreal/incubator-airflow that referenced this pull request Nov 23, 2018
Bug-1:
if a task state becomes either SUCCESS or FAILURE or REVOKED,
it will be removed from self.tasks() and self.last_state().
However, because line 108 is not indented properly,
this task will be added back to self.last_state() again.

Bug-2:
When the state is updated, it's referring to the latest
state `task.state` rather than variable `state`.
This may result in dead-lock if the state changed from
`STARTED` to `SUCCESS` after the if-elif-else block
started.

Test case is updated for fix to bug-1.
aliceabe pushed a commit to aliceabe/incubator-airflow that referenced this pull request Jan 3, 2019
Bug-1:
if a task state becomes either SUCCESS or FAILURE or REVOKED,
it will be removed from self.tasks() and self.last_state().
However, because line 108 is not indented properly,
this task will be added back to self.last_state() again.

Bug-2:
When the state is updated, it's referring to the latest
state `task.state` rather than variable `state`.
This may result in dead-lock if the state changed from
`STARTED` to `SUCCESS` after the if-elif-else block
started.

Test case is updated for fix to bug-1.
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Jan 23, 2019
Bug-1:
if a task state becomes either SUCCESS or FAILURE or REVOKED,
it will be removed from self.tasks() and self.last_state().
However, because line 108 is not indented properly,
this task will be added back to self.last_state() again.

Bug-2:
When the state is updated, it's referring to the latest
state `task.state` rather than variable `state`.
This may result in dead-lock if the state changed from
`STARTED` to `SUCCESS` after the if-elif-else block
started.

Test case is updated for fix to bug-1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants