-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCH
setting to autofix with conversion to string
#5559
Comments
Yeah, I agree this would be useful. I'm not yet certain how hard it is to get right, but I agree that it would be useful... flake8-type-checking has rules for this, which we could look at for reference: https://pypi.org/project/flake8-type-checking/.
Can you give an example of this, just to illustrate the problem? |
I updated OP with an example. |
Thanks. I think it will be difficult to reliably determine whether an annotation is required at runtime 😓 |
Haven't thought it all the way through, but can't you just see if the only uses of a symbol are inside non-local type annotations, while also applying the existing settings used to special case e.g. Pydantic? Maybe determining scoping is harder than I think. |
Can you explain what you mean in terms of this example? (I.e., why def make_bar():
class Bar:
...
def bar(x: Bar):
...
return bar
bar = make_bar()
print(get_type_hints(bar)) |
I'm not sure why Python fails in exactly this scenario. I guess the defining module for the object holding the annotations is known, so Python is able to look up the referent of string annotations if they exist in the defining module's |
But, does "whether we can quote it" depend on whether and when |
I would say just be conservative and always assume that In practice referencing local scope symbols is going to be pretty rare in 99% of Python code anyway. |
Ah I see -- so we would avoid quoting annotations that rely on symbols defined outside of the module scope. |
It's also a little less critical for the TCH rules in particular since we're only dealing with imported symbols. But I suppose if the import itself is within a function scope... |
…g import autosort) (#14675) ## Summary & Motivation Internal companion PR: dagster-io/internal#5847 - Turn on TID251 (banned imports) and configure it to ban `from __future__ import annotations`. - Turn on TCH suite of rules, which allows auto-sort to automatically place typing-only imports in a `TYPE_CHECKING` block. This PR was a journey. The goal was to get automatic sorting of imports only used in type annotations into `TYPE_CHECKING` blocks. Ruff offers the `TCH` set of rules for this, for which autofix was recently added. Unfortunately, it turned out to be more complicated than just turning on these rules, because the way things currently work, ruff will only autosort the imports if the annotations are already all strings, which most of the time means `from __future__ import annotations` is imported. The initial plan was actually opposite the current state of this PR-- the idea was to _inject_ `from __future__ import annotations` (using `I002`) into all Python modules for maximum `TCH` power. However, for reasons explained below, `from __future__ import annotations` (a) does not work well with Dagster in particular; (b) is not a good thing to add everywhere because it's not part of the future of Python. Therefore, I think the best thing to do for simplicity and future-proofing of the codebase is to ban use of `from __future__ import annotations`. It was imported in ~15 of our 2500 source files-- this PR removes them, adds needed quotes to affected type annotations, and turns on `TID251` to flag any future use of `from __future__ import annotations`. Without `from __future__ import annotations`, typing-only import auto-sort (`TCH`) is relatively limited. However, I have submitted a [ruff feature request](astral-sh/ruff#5559) to get a new ruff setting that will make it much more powerful for codebases not using `from __future__ import annotations`. --- The issues with string annotations, `TYPE_CHECKING`, and `from __future__ import annotations` are complex and merit a thorough explanation. Suppose I have the following module: ``` from dagster._core.definitions.events import AssetKey def do_something_to_asset(key: AssetKey): ... ``` `AssetKey` above is only used in a type annotation. However, if we move it to a `TYPE_CHECKING` block with no other changes, then our code will crash. That's because, by default, Python evaluates type annotations (in function sigs and module-scope variables, but _not_ local var annotations) as standard expressions at runtime. If the `AssetKey` import is in a `TYPE_CHECKING` block, then it won't execute at runtime, and when the `AssetKey` annotation is encountered, Python will be unable to evaluate the symbol `AssetKey` and will error. Therefore we can only move the `AssetKey` import to a typing block under two conditions (1) we turn it into a string annotation; (2) the evaluated expression never needs to be accessed. Let's consider both of these separately: ## 1. Turning annotation into a string There are two ways to do this: 1. Quote the `AssetKey` annotation. 2. Add `from __future__ import annotations` to the file. This will cause _all_ annotations in the file to be interpreted it as strings. It is equivalent to quoting all annotations in the file. Option (2) is smoother and eliminates the extra visual complexity of having some arguments quoted and others not. It also improves import time. (Improvement for Dagster is 10-20%; I performed a rough experiment timing `import dagster` on this branch vs master) This was one of the motivations for PEP 562, which introduced `from __future__ import annotations` and _was_ scheduled to become the standard way annotations are evaluated in Python 3.11-- but that plan has been changed, see below. Turning _all_ annotations into strings does not play nicely with most libs that do runtime processing of annotations. This includes Pydantic and... Dagster. The reason is that these libs need access to the result of the evaluated expression, not just a string: ``` from dagster import op from pandas import DataFrame @op def foo(bar: "DataFrame"): # This is currently an error ... ``` This problem can be somewhat mitigated by using `typing.get_type_hints` on functions that might have string annotations-- `get_type_hints` evaluates the annotations and returns them, providing the necessary runtime values. The problem is that it doesn't always work-- if string annotations refer to symbols in local scope, `get_type_hints` can't resolve them: ``` from typing import get_type_hints class Foo: ... def foo(x: "Foo"): ... # works to dereference "Foo" since referent `Foo` is in module-level scope get_type_hints(foo) # => {'x': <class '__main__.Foo'>} def make_bar(): class Bar: ... def bar(x: "Bar"): ... return bar bar = make_bar() # does not work to dereference "Bar" since referent `Bar` vanished with local scope get_type_hints(bar) # => NameError: name 'Bar' is not defined. ``` So string annotations, either explicit or via `from __future__ import annotations`, are flatly incompatible with any situation where (1) annotations are being created in a local scope; (2) an annotation references a symbol only available in that local scope; (3) the annotation must be operated on at runtime (and therefore evaluated with `get_type_hints`). For the above reasons (and because of the popularity of Pydantic), the Python Steering Committee nixed plans to adopt string annotations in Python 3.11. The new plan is described in [PEP 649](https://peps.python.org/pep-0649/) and will instead package all annotations in a lazily evaluated function. This both enables forward references _and_ ensures that the evaluated value of an annotation can always be accessed. Unfortunately, PEP 649 won't become part of Python until 3.12, and it's unclear to me whether it will be backportable via another `from __future__` import (but I suspect no). Since we need to support Python <3.12 for many years, that means we probably won't have access to this solution. ## 2. Evaluated expression never needs to be accessed Even if we had access to PEP 649 the problems discussed in the preceding section were solved, the ability to move "annotation-only" imports to a `TYPE_CHECKING` block is still limited by whether the evaluated value _ever_ needs to be accessed. If it does, you can't move an import to a `TYPE_CHECKING` block. For us, this mostly means that any type used in an annotation for a Dagster decorator should never be moved to a `TYPE_CHECKING` block: ``` # This is only used in an annotation but it will break dagster if it is moved to a `TYPE_CHECKING` class from foo import SomeDagsterConfigClass @asset def an_asset(config: SomeDagsterConfigClass): ... ``` Fortunately ruff allows configuring `TCH` rules to detect special decorators like `@asset`, so I don't think this limitation should block us. ## How I Tested These Changes ruff
I have something that I've been testing against Dagster here: #6001. The outputs look correct to me -- one example: --- a/examples/assets_smoke_test/assets_smoke_test/pure_python_assets.py
+++ b/examples/assets_smoke_test/assets_smoke_test/pure_python_assets.py
@@ -1,5 +1,9 @@
+from typing import TYPE_CHECKING
+
from dagster import SourceAsset, TableSchema, asset
-from pandas import DataFrame
+
+if TYPE_CHECKING:
+ from pandas import DataFrame
raw_country_populations = SourceAsset(
"raw_country_populations",
@@ -19,7 +23,7 @@
@asset
-def country_populations(raw_country_populations) -> DataFrame:
+def country_populations(raw_country_populations) -> "DataFrame":
country_populations = raw_country_populations.copy()
country_populations["change"] = (
country_populations["change"]
@@ -32,13 +36,13 @@ def country_populations(raw_country_populations) -> DataFrame:
@asset
-def continent_stats(country_populations: DataFrame) -> DataFrame:
+def continent_stats(country_populations: "DataFrame") -> "DataFrame":
result = country_populations.groupby("continent").agg({"pop2019": "sum", "change": "mean"})
return result
@asset
-def country_stats(country_populations: DataFrame, continent_stats: DataFrame) -> DataFrame:
+def country_stats(country_populations: "DataFrame", continent_stats: "DataFrame") -> "DataFrame":
result = country_populations.join(continent_stats, on="continent", lsuffix="_continent")
result["continent_pop_fraction"] = result["pop2019"] / result["pop2019_continent"]
return result But not sure how best to check whether the rule is having any unintended effects on runtime behavior. |
If there's an easy way for me to locally run a preview version of Ruff, I'm happy to run your enhanced rule on my private codebase and verify the runtime behavior. Obviously would be easier if someone else has a large open-source codebase to test on, though. |
Thanks! I can kick off a release build dry-run which will generate wheel files that you can download and |
If you'd like, you can click here, download the "Wheels" at the bottom, unzip, and |
I did a global find-and-replace to convert EDIT: The above find-and-replace loses some of the benefits of this rule. Replacing with
|
Oh yes, that case makes sense, we need handle it. Thanks so much for testing it out! |
The above outputs look good to me, as long as there's a setting to avoid applying the rule to functions with certain decorators (and IIRC this already exists).
Potential unintended runtime effects are unavoidable since people could always be accessing |
I'm blocked on supporting unions (e.g., when quoting |
Totally understandable. For now, is there value in shipping it with an exclusion to not add quotes to any types involving unions? It still seemed like a major improvement over |
…luated references (#6001) ## Summary This allows us to fix usages like: ```python from pandas import DataFrame def baz() -> DataFrame: ... ``` By quoting the `DataFrame` in `-> DataFrame`. Without quotes, moving `from pandas import DataFrame` into an `if TYPE_CHECKING:` block will fail at runtime, since Python tries to evaluate the annotation to add it to the function's `__annotations__`. Unfortunately, this does require us to split our "annotation kind" flags into three categories, rather than two: - `typing-only`: The annotation is only evaluated at type-checking-time. - `runtime-evaluated`: Python will evaluate the annotation at runtime (like above) -- but we're willing to quote it. - `runtime-required`: Python will evaluate the annotation at runtime (like above), and some library (like Pydantic) needs it to be available at runtime, so we _can't_ quote it. This functionality is gated behind a setting (`flake8-type-checking.quote-annotations`). Closes #5559.
Currently the
TCH
(flake8-type-checking
) rules require annotations to be stored as strings (i.e. not evaluated). Annotations are stored a strings if:from __future__ import annotations
(which causes evaluation of all annotations in the module to be skipped)One approach to getting
TCH
autofix working is to injectfrom __future__ import annotations
into all your source modules (usingI002
). This is not ideal because string annotations unavoidably break code that meets these conditions:Here is an example of code meeting these conditions:
We have plenty of code meeting these conditions in Dagster (mostly in tests), so my experiment in injecting
from __future__ import annotations
to enableTCH
autofix has failed. Other projects that use runtime type annotation introspection are likely to encounter the same issues.Therefore, a significant enhancement to
TCH
autofix would be to detect where annotations could be stored as strings, and perform a targeted conversion for you using quoting. Example:After applying
TCH
autofix:This would make
TCH
autofix a lot more useful for us and probably many other projects.The text was updated successfully, but these errors were encountered: