Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCH setting to autofix with conversion to string #5559

Closed
smackesey opened this issue Jul 6, 2023 · 20 comments · Fixed by #6001
Closed

TCH setting to autofix with conversion to string #5559

smackesey opened this issue Jul 6, 2023 · 20 comments · Fixed by #6001
Labels
core Related to core functionality

Comments

@smackesey
Copy link

smackesey commented Jul 6, 2023

Currently the TCH (flake8-type-checking) rules require annotations to be stored as strings (i.e. not evaluated). Annotations are stored a strings if:

  • The annotations are string literals (i.e. are quoted)
  • The annotations are in a module with from __future__ import annotations (which causes evaluation of all annotations in the module to be skipped)
  • The annotations are of local variables (the Python interpreter does not evaluate local variable annotations)

One approach to getting TCH autofix working is to inject from __future__ import annotations into all your source modules (using I002). This is not ideal because string annotations unavoidably break code that meets these conditions:

  1. annotations are being created in a local scope
  2. an annotation references a symbol only available in that local scope
  3. the annotation must be operated on at runtime

Here is an example of code meeting these conditions:

def make_bar():
    class Bar:
        ...

    def bar(x: "Bar"):
        ...
    return bar

bar = make_bar()

# `bar` has a string annotation-- if we want to operate on it, we need to evaluate it. This requires
# calling `get_type_hints`, but this fails because "Bar" can't be dereferenced since referent `Bar`
# vanished with local scope
get_type_hints(bar)  # => NameError: name 'Bar' is not defined.

We have plenty of code meeting these conditions in Dagster (mostly in tests), so my experiment in injecting from __future__ import annotations to enable TCH autofix has failed. Other projects that use runtime type annotation introspection are likely to encounter the same issues.

Therefore, a significant enhancement to TCH autofix would be to detect where annotations could be stored as strings, and perform a targeted conversion for you using quoting. Example:

from threading import Lock

def foo(lock: Lock):
    ...

After applying TCH autofix:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from threading import Lock

def foo(lock: "Lock"):
    ...

This would make TCH autofix a lot more useful for us and probably many other projects.

@charliermarsh
Copy link
Member

Yeah, I agree this would be useful. I'm not yet certain how hard it is to get right, but I agree that it would be useful... flake8-type-checking has rules for this, which we could look at for reference: https://pypi.org/project/flake8-type-checking/.

annotations are being created in a local scope.

Can you give an example of this, just to illustrate the problem?

@charliermarsh charliermarsh added the core Related to core functionality label Jul 6, 2023
@smackesey
Copy link
Author

Can you give an example of this, just to illustrate the problem?

I updated OP with an example.

@charliermarsh
Copy link
Member

Thanks. I think it will be difficult to reliably determine whether an annotation is required at runtime 😓

@smackesey
Copy link
Author

I think it will be difficult to reliably determine whether an annotation is required at runtime

Haven't thought it all the way through, but can't you just see if the only uses of a symbol are inside non-local type annotations, while also applying the existing settings used to special case e.g. Pydantic? Maybe determining scoping is harder than I think.

@charliermarsh
Copy link
Member

Can you explain what you mean in terms of this example? (I.e., why x: Bar can't be quoted.)

def make_bar():
    class Bar:
        ...

    def bar(x: Bar):
        ...

    return bar

bar = make_bar()

print(get_type_hints(bar))

@smackesey
Copy link
Author

smackesey commented Jul 14, 2023

Can you explain what you mean in terms of this example? (I.e., why x: Bar can't be quoted.)

  • With x: Bar (unquoted case), bar.__annotations__["x"] stores a direct reference to Bar.
  • When x: "Bar" (quoted case), bar.__annotations__["x"] stores the string "Bar".
  • When get_type_hints(bar) is applied, Python tries to resolve each annotation
    • If it's a type, just return it.
    • If it's as string, dereference it.
  • When Python tries to dereference the string "Bar", it comes up empty and throws a NameError.

I'm not sure why Python fails in exactly this scenario. I guess the defining module for the object holding the annotations is known, so Python is able to look up the referent of string annotations if they exist in the defining module's __dict__.

@charliermarsh
Copy link
Member

But, does "whether we can quote it" depend on whether and when get_type_hints is called? Or is there a more robust and general methodology we can follow?

@smackesey
Copy link
Author

smackesey commented Jul 14, 2023

But, does "whether we can quote it" depend on whether and when get_type_hints is called? Or is there a more robust and general methodology we can follow?

I would say just be conservative and always assume that get_type_hints might be called-- so if it references a local scope symbol, don't quote it.

In practice referencing local scope symbols is going to be pretty rare in 99% of Python code anyway.

@charliermarsh
Copy link
Member

Ah I see -- so we would avoid quoting annotations that rely on symbols defined outside of the module scope.

@charliermarsh
Copy link
Member

It's also a little less critical for the TCH rules in particular since we're only dealing with imported symbols. But I suppose if the import itself is within a function scope...

smackesey added a commit to dagster-io/dagster that referenced this issue Jul 21, 2023
…g import autosort) (#14675)

## Summary & Motivation

Internal companion PR: dagster-io/internal#5847

- Turn on TID251 (banned imports) and configure it to ban `from
__future__ import annotations`.
- Turn on TCH suite of rules, which allows auto-sort to automatically
place typing-only imports in a `TYPE_CHECKING` block.

This PR was a journey. The goal was to get automatic sorting of imports
only used in type annotations into `TYPE_CHECKING` blocks. Ruff offers
the `TCH` set of rules for this, for which autofix was recently added.
Unfortunately, it turned out to be more complicated than just turning on
these rules, because the way things currently work, ruff will only
autosort the imports if the annotations are already all strings, which
most of the time means `from __future__ import annotations` is imported.

The initial plan was actually opposite the current state of this PR--
the idea was to _inject_ `from __future__ import annotations` (using
`I002`) into all Python modules for maximum `TCH` power. However, for
reasons explained below, `from __future__ import annotations` (a) does
not work well with Dagster in particular; (b) is not a good thing to add
everywhere because it's not part of the future of Python.

Therefore, I think the best thing to do for simplicity and
future-proofing of the codebase is to ban use of `from __future__ import
annotations`. It was imported in ~15 of our 2500 source files-- this PR
removes them, adds needed quotes to affected type annotations, and turns
on `TID251` to flag any future use of `from __future__ import
annotations`.

Without `from __future__ import annotations`, typing-only import
auto-sort (`TCH`) is relatively limited. However, I have submitted a
[ruff feature request](astral-sh/ruff#5559) to
get a new ruff setting that will make it much more powerful for
codebases not using `from __future__ import annotations`.

---

The issues with string annotations, `TYPE_CHECKING`, and `from
__future__ import annotations` are complex and merit a thorough
explanation.

Suppose I have the following module:

```
from dagster._core.definitions.events import AssetKey

def do_something_to_asset(key: AssetKey):
    ...
```

`AssetKey` above is only used in a type annotation. However, if we move
it to a `TYPE_CHECKING` block with no other changes, then our code will
crash. That's because, by default, Python evaluates type annotations (in
function sigs and module-scope variables, but _not_ local var
annotations) as standard expressions at runtime. If the `AssetKey`
import is in a `TYPE_CHECKING` block, then it won't execute at runtime,
and when the `AssetKey` annotation is encountered, Python will be unable
to evaluate the symbol `AssetKey` and will error. Therefore we can only
move the `AssetKey` import to a typing block under two conditions (1) we
turn it into a string annotation; (2) the evaluated expression never
needs to be accessed. Let's consider both of these separately:

## 1. Turning annotation into a string

There are two ways to do this:

1. Quote the `AssetKey` annotation.
2. Add `from __future__ import annotations` to the file. This will cause
_all_ annotations in the file to be interpreted it as strings. It is
equivalent to quoting all annotations in the file.

Option (2) is smoother and eliminates the extra visual complexity of
having some arguments quoted and others not. It also improves import
time. (Improvement for Dagster is 10-20%; I performed a rough experiment
timing `import dagster` on this branch vs master) This was one of the
motivations for PEP 562, which introduced `from __future__ import
annotations` and _was_ scheduled to become the standard way annotations
are evaluated in Python 3.11-- but that plan has been changed, see
below.

Turning _all_ annotations into strings does not play nicely with most
libs that do runtime processing of annotations. This includes Pydantic
and... Dagster. The reason is that these libs need access to the result
of the evaluated expression, not just a string:

```
from dagster import op
from pandas import DataFrame

@op
def foo(bar: "DataFrame"):  # This is currently an error
    ...
```

This problem can be somewhat mitigated by using `typing.get_type_hints`
on functions that might have string annotations-- `get_type_hints`
evaluates the annotations and returns them, providing the necessary
runtime values. The problem is that it doesn't always work-- if string
annotations refer to symbols in local scope, `get_type_hints` can't
resolve them:

```
from typing import get_type_hints

class Foo:
    ...

def foo(x: "Foo"):
     ...

# works to dereference "Foo" since referent `Foo` is in module-level scope
get_type_hints(foo)  # => {'x': <class '__main__.Foo'>}

def make_bar():
    class Bar:
        ...

    def bar(x: "Bar"):
        ...
    return bar

bar = make_bar()

# does not work to dereference "Bar" since referent `Bar` vanished with local scope
get_type_hints(bar)  # => NameError: name 'Bar' is not defined.
```

So string annotations, either explicit or via `from __future__ import
annotations`, are flatly incompatible with any situation where (1)
annotations are being created in a local scope; (2) an annotation
references a symbol only available in that local scope; (3) the
annotation must be operated on at runtime (and therefore evaluated with
`get_type_hints`).

For the above reasons (and because of the popularity of Pydantic), the
Python Steering Committee nixed plans to adopt string annotations in
Python 3.11. The new plan is described in [PEP
649](https://peps.python.org/pep-0649/) and will instead package all
annotations in a lazily evaluated function. This both enables forward
references _and_ ensures that the evaluated value of an annotation can
always be accessed.

Unfortunately, PEP 649 won't become part of Python until 3.12, and it's
unclear to me whether it will be backportable via another `from
__future__` import (but I suspect no). Since we need to support Python
<3.12 for many years, that means we probably won't have access to this
solution.

## 2. Evaluated expression never needs to be accessed

Even if we had access to PEP 649 the problems discussed in the preceding
section were solved, the ability to move "annotation-only" imports to a
`TYPE_CHECKING` block is still limited by whether the evaluated value
_ever_ needs to be accessed. If it does, you can't move an import to a
`TYPE_CHECKING` block.

For us, this mostly means that any type used in an annotation for a
Dagster decorator should never be moved to a `TYPE_CHECKING` block:

```
# This is only used in an annotation but it will break dagster if it is moved to a `TYPE_CHECKING` class
from foo import SomeDagsterConfigClass

@asset
def an_asset(config: SomeDagsterConfigClass):
    ...
```

Fortunately ruff allows configuring `TCH` rules to detect special
decorators like `@asset`, so I don't think this limitation should block
us.

## How I Tested These Changes

ruff
@charliermarsh
Copy link
Member

I have something that I've been testing against Dagster here: #6001. The outputs look correct to me -- one example:

--- a/examples/assets_smoke_test/assets_smoke_test/pure_python_assets.py
+++ b/examples/assets_smoke_test/assets_smoke_test/pure_python_assets.py
@@ -1,5 +1,9 @@
+from typing import TYPE_CHECKING
+
 from dagster import SourceAsset, TableSchema, asset
-from pandas import DataFrame
+
+if TYPE_CHECKING:
+    from pandas import DataFrame

 raw_country_populations = SourceAsset(
     "raw_country_populations",
@@ -19,7 +23,7 @@


 @asset
-def country_populations(raw_country_populations) -> DataFrame:
+def country_populations(raw_country_populations) -> "DataFrame":
     country_populations = raw_country_populations.copy()
     country_populations["change"] = (
         country_populations["change"]
@@ -32,13 +36,13 @@ def country_populations(raw_country_populations) -> DataFrame:


 @asset
-def continent_stats(country_populations: DataFrame) -> DataFrame:
+def continent_stats(country_populations: "DataFrame") -> "DataFrame":
     result = country_populations.groupby("continent").agg({"pop2019": "sum", "change": "mean"})
     return result


 @asset
-def country_stats(country_populations: DataFrame, continent_stats: DataFrame) -> DataFrame:
+def country_stats(country_populations: "DataFrame", continent_stats: "DataFrame") -> "DataFrame":
     result = country_populations.join(continent_stats, on="continent", lsuffix="_continent")
     result["continent_pop_fraction"] = result["pop2019"] / result["pop2019_continent"]
     return result

But not sure how best to check whether the rule is having any unintended effects on runtime behavior.

@tylerlaprade
Copy link
Contributor

If there's an easy way for me to locally run a preview version of Ruff, I'm happy to run your enhanced rule on my private codebase and verify the runtime behavior. Obviously would be easier if someone else has a large open-source codebase to test on, though.

@charliermarsh
Copy link
Member

Thanks! I can kick off a release build dry-run which will generate wheel files that you can download and pip install. I'll post here with the link.

@charliermarsh
Copy link
Member

If you'd like, you can click here, download the "Wheels" at the bottom, unzip, and pip install the wheel that matches your platform.

@tylerlaprade
Copy link
Contributor

This looks much more comprehensive than when I made from __future__ import annotations required (which barely modified anything)! I really like the looks of it. However, a runtime error has been introduced that doesn't get detected by my typechecker:

image image image

@tylerlaprade
Copy link
Contributor

tylerlaprade commented Jul 23, 2023

I did a global find-and-replace to convert "(\w+)" \| None to $1 | None. This fixed all my runtime issues (at least until I re-ran Ruff).

EDIT: The above find-and-replace loses some of the benefits of this rule. Replacing with "$1 | None" is much better for two reasons:

  1. It retains the performance benefits of this rule.
  2. It does not get undone when re-running Ruff.

@charliermarsh
Copy link
Member

Oh yes, that case makes sense, we need handle it. Thanks so much for testing it out!

@smackesey
Copy link
Author

The above outputs look good to me, as long as there's a setting to avoid applying the rule to functions with certain decorators (and IIRC this already exists).

But not sure how best to check whether the rule is having any unintended effects on runtime behavior.

Potential unintended runtime effects are unavoidable since people could always be accessing .__annotations__ and performing arbitrary logic with it. But as long as those annotations aren't coming from local scope, they can be easily worked around by changing direct .__annotations__ access to use typing.get_type_hints.

@charliermarsh
Copy link
Member

I'm blocked on supporting unions (e.g., when quoting DataFrame in DataFrame | int, you need to quote the entire union, like "DataFrame | int"). There are just a bunch of edge cases and we don't have access to the full AST when constructing the fix (so we need to do manual lexing). This might require some new infrastructure to be fully robust.

@tylerlaprade
Copy link
Contributor

Totally understandable. For now, is there value in shipping it with an exclusion to not add quotes to any types involving unions? It still seemed like a major improvement over from __future__ import annotations when I tested the preview.

charliermarsh added a commit that referenced this issue Dec 13, 2023
…luated references (#6001)

## Summary

This allows us to fix usages like:

```python
from pandas import DataFrame

def baz() -> DataFrame:
    ...
```

By quoting the `DataFrame` in `-> DataFrame`. Without quotes, moving
`from pandas import DataFrame` into an `if TYPE_CHECKING:` block will
fail at runtime, since Python tries to evaluate the annotation to add it
to the function's `__annotations__`.

Unfortunately, this does require us to split our "annotation kind" flags
into three categories, rather than two:

- `typing-only`: The annotation is only evaluated at type-checking-time.
- `runtime-evaluated`: Python will evaluate the annotation at runtime
(like above) -- but we're willing to quote it.
- `runtime-required`: Python will evaluate the annotation at runtime
(like above), and some library (like Pydantic) needs it to be available
at runtime, so we _can't_ quote it.

This functionality is gated behind a setting
(`flake8-type-checking.quote-annotations`).

Closes #5559.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Related to core functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants