-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace numpy example with practical exercise demonstrating top-level code #35097
Conversation
I like the idea of mentioning and being explicit about expensive APIs but I think there is a value in mentioning the imports, because not many people are aware how big of an impact such imports might have. I believe I suggest to add the "expensive" operation as well, but also keep "slow import" example in - but replacing it with Some experiments: It takes some 0.4 - 1 s to import pandas on my MacOS:
And around ~ 0.3 s in my docker container:
Importing Pandas results in opening around 750 files:
The same exercise for numpy shows that it is much faster in container (~0.1s) and opens far less number of files:
Opened files:
Fragment of the strace for pandas - showing that it imports a lot of code.
|
One thing worth mentioning that even 0.3 s is pretty impactful. All our DAGs are parsed in DAG file process in a separately forked processes - so if such 'pandas' import happens at top level of all DAG files, by default the 0.3 s is an overhead for every dag every 30s - or min-parsing interval. We already deal with some of that https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-pre-import-modules (but only for airflow imports, not for other expensive imports). Also I think it's worth keeping import example for another case, explicitly in: I had seen quite a few examples where "organisation-level" imports were doing a lot of things - and it's been hidden from DAG developer (for example I saw pulling configuration from via expensive API call happening there). People often don't realize that just running Also showing that it can be mitigated by "local import" is a good "teaching" resource. |
I did add this blurb about imports:
Even the pandas example feels like it'd be a relatively small optimization compared to the kind of problems that more significant top level code can cause, and IMO including that example could cause more confusion than good for folks who are newer to Airflow. For example, this PR was inspired by this stackoverflow question. |
Yeah but it does matter and I think we should explain it to educate the users. Similarly as the user who asked the question about PEP8 conflicting with our advice, they might not realize they are heavily impacting performance of scheduler/DAG file processor. Over the last few years or so @uranusjr spend enormous effort on moving a number of imports of ours to make sure airflow imports faster, shaving 100s of milliseconds sometimes precisely because it actually matters. I also added answer to the same question in Slack of our (seems that the author have been asking in multiple places)
So while PEP8 - years ago was good advice, it seems that Python community recognised that and "top level imports for expensive modules" is the recommended practice (and explicitly showing how to do it makes sense IMHO). The advice from Ruff authors mentions So why don't we simply clarify it, for example this way - if we would like to give really precise advice to our users (I took it from some of our example DAGs) # It's ok to import modules that are not expensive to load at top-level of DAG file
import random
import pendulum
# Expensive imports should be avoided as top level imports because DAG is imported frequently
# even if it does not follow PEP8 advice (PEP8 have not foreseen that certain imports will be very expensive)
# DON'T DO THAT - import them locally instead (see below)
#
# import pandas
# import torch
# import tensorflow
#
...
@task()
def do_stuff_with_pandas_and_torch():
import pandas
import torch
# do some operations using pandas and torch
@task()
def do_stuff_with_tensorflow():
import tensorflow
# do some operations using tensorflow Would you find this confusing ? I think it would be rather help our users with decisions how to write the DAGs. |
FWIW I’ve been wondering if it’s worthwhile to implement some magic in Dagprocessor to automatically move imports inside task functions so people can write DAG files “normally” but receive the function-level import benefits. |
That would be rather super-magical if we manage to pull it off IMHO. I thin the most we should do is to detect and warn such expensive imports (which BTW. I think is a good idea) - but manipulating the sources or bytecode of the DAG files written by the user is very dangerous, Not only it will change line numbers for debugging but there are a number of edge cases - for example user might really have a good reason to import even expensive imports at module (top) level. There are also all the "transition" cases - DAG imports a utility code that imports tensorflow. This is equally expensive (the utility import is). Should we move the whole utility import to inside a task? Which task? Maybe the utility method also initializes some code that might be needed for all tasks (like setting variables needed to authenticate inside the organisation) etc. etc. We really do not want to get involved in those. But we couldpotentially warn if we see an import that we consider as "expensive" after DAG is parsed and warn the user. That would be very simple to impement and nice feature I think. We could even likely consider measuring (by some smarts or monkeypatching of python stdlib code I think) the time it takes to do imports and automatically flag imports that take (say) > 0.2s. Why not ? |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Docs building issues :) |
Woohoo! |
I'm not sure the numpy example is still accurate. I tested out importing numpy locally and it was more or less instantaneous. I think this can cause some confusion implying that no imports should be done at the top level.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.