Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-2027] Only trigger sleep in scheduler after all files have parsed #2986

Conversation

aoen
Copy link
Contributor

@aoen aoen commented Jan 29, 2018

JIRA

Description

  • Here are some details about my PR, including screenshots of any UI changes:

The scheduler loop sleeps for 1 second every loop unnecessarily. Remove this sleep to slightly speed up scheduling, and instead do it once all files have been parsed. It can add up since it runs to every scheduler loop which runs # of dags to parse/scheduler parallelism times. We have seen it increase scheduling times by 10% in our production environment.

Also remove the unnecessary increased file processing interval in tests which slows them down.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Can't test this very well since it is a sleep interval, might mock out the sleep and test the calculated of sleep_length though. It has been running on the Airbnb Airflow cluster for many weeks now.

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":

    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"
  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

@saguziel

@aoen aoen force-pushed the ddavydov--open_source_disable_unecessary_sleep_in_scheduler_loop branch 3 times, most recently from 1d27d2c to f8edf4e Compare January 29, 2018 23:00
@Fokko
Copy link
Contributor

Fokko commented Jan 31, 2018

Please rebase against master to trigger a new build. The CI should be fixed.

@bolkedebruin
Copy link
Contributor

@aoen can you trigger the build again please?

@bolkedebruin
Copy link
Contributor

ping @aoen

@aoen aoen force-pushed the ddavydov--open_source_disable_unecessary_sleep_in_scheduler_loop branch from f8edf4e to 2685227 Compare March 6, 2018 20:57
@aoen
Copy link
Contributor Author

aoen commented Mar 6, 2018

Sorry for the delay, I've been working on another PR to parallelize Celery syncing on the Airflow that should have a higher impact on scheduler performance.

Re-pushed.

@aoen aoen force-pushed the ddavydov--open_source_disable_unecessary_sleep_in_scheduler_loop branch from 2685227 to 0ea7a85 Compare March 6, 2018 22:50
@Fokko
Copy link
Contributor

Fokko commented Mar 7, 2018

Can you rebase onto master? We had some issues with the apache-beam Python3 compatibility.

@aoen aoen force-pushed the ddavydov--open_source_disable_unecessary_sleep_in_scheduler_loop branch from 0ea7a85 to 24411e9 Compare March 7, 2018 19:10
@codecov-io
Copy link

codecov-io commented Mar 7, 2018

Codecov Report

Merging #2986 into master will increase coverage by <.01%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2986      +/-   ##
==========================================
+ Coverage   73.07%   73.08%   +<.01%     
==========================================
  Files         180      180              
  Lines       12578    12586       +8     
==========================================
+ Hits         9191     9198       +7     
- Misses       3387     3388       +1
Impacted Files Coverage Δ
airflow/jobs.py 80.07% <100%> (-0.02%) ⬇️
airflow/utils/dag_processing.py 88.88% <90.9%> (+0.44%) ⬆️
airflow/models.py 87.28% <0%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f9f460...24411e9. Read the comment docs.

@aoen
Copy link
Contributor Author

aoen commented Mar 7, 2018

@Fokko done, and thanks for looking into the apache-beam issues!

@asfgit asfgit closed this in 3c4f1fd Apr 9, 2018
@aoen
Copy link
Contributor Author

aoen commented Aug 13, 2018

There seems to be a couple of problems that cause the sleep to not trigger and Scheduler heartbeating/logs to be spammed:

  1. If all of the files are being processed in the queue, there is no sleep (can be fixed by sleeping for min_sleep even if there are no files)
  2. I have heard reports that some files can return a parsing time that is monotonically increasing for some reason (e.g. file actually parses in 1s each loop, but the reported duration seems to use the very time the file was parsed as the start time instead of the last time), I haven't confirmed this but it sounds problematic.

To unblock the release I'm reverting this PR for now. It should be re-added with tests/mocking.

@bolkedebruin @yrqls21

@bolkedebruin
Copy link
Contributor

@aoen unblock release? I don't follow?

@aoen
Copy link
Contributor Author

aoen commented Aug 13, 2018

This bug can cause severe issues if the particular edge cases are hit, I guess no one reported it in the current RC but it's there.

@bolkedebruin
Copy link
Contributor

K... but I assume "unblocking" means 1.10.1 not 1.10.0 as the vote has already passed?

@aoen
Copy link
Contributor Author

aoen commented Aug 13, 2018

Ah I missed that, would be for 1.10.1 then, yeah.

@KevinYang21
Copy link
Member

KevinYang21 commented Aug 13, 2018

@aoen
Agree on sleep for a minimal time interval(max(0, 1s-duration)), it should not be a huge problem to prolong the scheduler loop time to at least one second every loop( have that in my branch for a while and looks good until now).

I'm not so sure about monotonically increasing, what I can see is that if we have parse files faster than we set file path, then if file was removed/have import error and not updating the last_finish_time in the 2nd parsing round, we'll still use the last_finish_time from the 1st round and get some non-desirable duration value.

@cjgu
Copy link
Contributor

cjgu commented Aug 14, 2018

I hit this in production during testing of 1.10 but managed to avoid it by tweaking the sleep configurations.

It was triggered by having only min_file_process_interval = 60 set.

Solved by setting

min_file_process_interval = 0
min_file_parsing_loop_time = 60

aliceabe pushed a commit to aliceabe/incubator-airflow that referenced this pull request Jan 3, 2019
…arsed

Closes apache#2986 from aoen/ddavydov--open_source_disab
le_unecessary_sleep_in_scheduler_loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants