-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7746] Add python type hints #9056
Conversation
R: @aaltay |
2f1e9d4
to
d3c509a
Compare
CC: @udim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this. Some brief comments:
typing module import Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.
type checking pipeline construction I think the goal of JIRA-7060 is specifically to make type annotations useful to pipeline developers. Making tools like mypy work on the beam codebase, and user pipelines in general, is another (worthy) goal.
str vs unicode +1 to your approach. There may be cases where we want an explicit bytes type as well (which would be an alias of str in python 2).
@@ -44,7 +52,8 @@ def __init__(self): | |||
self.__class__, 'word_len_dist') | |||
self.empty_line_counter = Metrics.counter(self.__class__, 'empty_lines') | |||
|
|||
def process(self, element): | |||
def process(self, element, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the *args and **kwargs parameters added here?
reveal_type(lines) # PCollection[unicode*] | ||
|
||
def make_ones(x): | ||
# type: (T) -> Tuple[T, int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this not be inferred? Similarly below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what you mean?
Let me try to answer what I think you're asking about.
First, even for a simple function like make_ones
, mypy does not guess at the arguments and return type.
Next, as a user annotating pipelines I have a few of choices (all of them are valid):
- annotate functions, and let mypy infer the types of the
PTransforms
they are passed to - annotate
PTransforms
- annotate both, as a way to guard against unintentional incompatibilities
Option 2 would look like this:
def make_ones(x):
return (x, 1)
makemap = beam.Map(make_ones) # type: beam.Map[T, Tuple[T, int]]
To be clear, in option 2, mypy is not using the type of beam.Map
to infer the type of the make_ones
, it's simply assuming it's correct, since in our thought experiment it's unannotated. In other words, the following would not generate an error because make_ones
is untyped:
def make_ones(x):
return (x + 1, 1) # x can only be a numeric type!
makemap = beam.Map(make_ones) # type: beam.Map[str, Tuple[str, int]]
This module is very messy right now, so I apologize for any confusion that it's generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really unfortunate... I'd have expected more of mypy :(. In that case our own inference (which does handle this) is all the more important.
def __init__(self, topic=None, subscription=None, id_label=None, | ||
with_attributes=False, timestamp_attribute=None): | ||
def __init__(self, | ||
topic=None, # type: typing.Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep these, I wonder if it'd be easier to read if the comments were all aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave that up to the Beam Team, but we're still going to have the line length limit to contend with.
sdks/python/apache_beam/io/iobase.py
Outdated
@@ -972,14 +979,15 @@ def expand(self, pcoll): | |||
'or be a PTransform. Received : %r' % self.sink) | |||
|
|||
|
|||
class WriteImpl(ptransform.PTransform): | |||
class WriteImpl(ptransform.PTransform[InT, None]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or should this be a PTransform[pvalue.PCollection[InT], None]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, that would be redundant. It won't change the behavior of the type checks, and it would add verbosity to a lot of class definitions, since there are numerous PTransforms
throughout the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, also, it will require an import of the apache_beam.pvalue
module into all of those modules as well. you won't be able to protect it inside a typing.TYPE_CHECKING
block since it's used in the class definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But some transforms take and return other things than PCollections. Maybe if T is shorthand for PCollection[T] but that seems ambiguous.
Sorry, I don't completely follow this. IIUC, we can't replace |
I made a new ticket for the latter, and changed the subject of this PR: https://issues.apache.org/jira/browse/BEAM-7746 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertwb Please do not merge before I get a chance to understand exactly how this might affect Beam's type checking and hinting.
Specifically I'm worried about the use of typing.Generic and how that plays with IOTypeHints.
There's also the issue of using type comments instead of PEP 484 annotations that Python 3 parses and makes available at runtime.
mypy.ini
Outdated
@@ -0,0 +1,19 @@ | |||
[mypy] | |||
python_version = 2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tox test environment you added uses 3.7. Is 2.7 correct here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy needs python 3 to run, but this flag is for the version of the code that it's checking.
when the code is ready we'll want to run this in 3.x mode.
sdks/python/apache_beam/io/iobase.py
Outdated
"""Implements the writing of custom sinks.""" | ||
|
||
def __init__(self, sink): | ||
super(WriteImpl, self).__init__() | ||
self.sink = sink | ||
|
||
def expand(self, pcoll): | ||
# type: (pvalue.PCollection[InT]) -> pvalue.PCollection[None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, mypy will compare this annotation to the output of the plugin you wrote?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plugin only deals with the __or__
and __ror__
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would it insure that the __[r]or__
method agrees with the generic type of the PTransform which agrees with this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I believe it could be used to achieve that.
@@ -193,7 +200,7 @@ def __repr__(self): | |||
self.input_types, self.output_types) | |||
|
|||
|
|||
class WithTypeHints(object): | |||
class WithTypeHints(Generic[InT, OutT]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that if we add a second typing system here will create confusion. If I write a PTransform, do I specify an input type to Generic[] or do I use a @with_input_type decorator? Which one is enforced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I think the somewhat unsatisfactory answer is that for the time being you need both: one for runtime checking and the other for static checking, until such a time as they can become the same. I think trying to do that all in one PR is going to be too much.
off the top of my head, the order this should probably be done is:
- support runtime type hints using
typing
module instead oftypehints
: https://issues.apache.org/jira/browse/BEAM-7713 - add static type hints to the beam code and begin enforcing it using mypy: this PR (https://issues.apache.org/jira/browse/BEAM-7746) and possibly https://issues.apache.org/jira/browse/BEAM-7712
- support static validation of user pipelines (mypy plugin, etc)
- support runtime validations based on
typing
annotations
There are a lot of "ifs" surrounding step 4. We may need to get to python3-only first to avoid the pitfalls of type comments. We may find that step 3 makes it less important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) is already mostly there, for the user-facing API at least. How much of (2) is required for (3) and how invasive it is seems to still be an open question, which this PR is giving good insight to. (Long term, it'd also be nice to have (2) for development as well.) Given that mypy can't even do beam.Map(lambda x: (x, 1))
I think we'll have to rely a lot on (4).
All will be much nicer once we can have python-3 style annotations in the code itself. We can't do that for our own codebase until Python 2 is LTS only, but user code could start using it much sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of (2) is required for (3) and how invasive it is seems to still be an open question, which this PR is giving good insight to.
agreed.
Given that mypy can't even do beam.Map(lambda x: (x, 1)) I think we'll have to rely a lot on (4).
I think people who want to use static typing for their pipelines will have to stop using lambdas. So there will definitely be a split between those who are willing to make that adjustment and those who want to continue relying on the current interface for dynamic typing.
All will be much nicer once we can have python-3 style annotations in the code itself. We can't do that for our own codebase until Python 2 is LTS only, but user code could start using it much sooner.
When we get to that stage, we can use this tool to convert comments to annotations: https://github.com/ilevkivskyi/com2ann
sdks/python/setup.py
Outdated
"""Abstract method that is required to be overwritten""" | ||
|
||
def get_project_path(self, include_dists=[]): | ||
# with_2to3 = six.PY3 and getattr(self.distribution, 'use_2to3', False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this command class is based off of several in the distutils source, and it was unclear to me if we'd need to support this other case, so it's there as a reminder to figure that out.
To be crystal clear, this PR is very much a WIP. Just wanted to get eyes on it before I get too far ahead of myself.
I'm really curious to know what the general consensus is on this PR, implementation details aside. Do you all like the idea of adding type annotations? If I can get some subset of the current package passing, would you be willing to merge something like this in? What do you see as the major blockers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typing module import: Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.
Sorry, I don't completely follow this. IIUC, we can't replace
apache_beam.typehints
withtyping
until beam knows how to deal withtyping
internally, right?
Beam knows how to convert from typing to its internal representation (some fixes at #9109).
@robertwb Please do not merge before I get a chance to understand exactly how this might affect Beam's type checking and hinting.
Specifically I'm worried about the use of typing.Generic and how that plays with IOTypeHints.
There's also the issue of using type comments instead of PEP 484 annotations that Python 3 parses and makes available at runtime.
Yes, for sure.
Awesome. Getting that merged in will make this PR a lot easier. |
It's in. |
It's out :( I see it got reverted: 8afee95 Did it end up causing a regression? |
There is a JIRA link in the revert PR: https://issues.apache.org/jira/browse/BEAM-7798 -- It was causing test failures. |
Back in: #9125 |
I left a note on that PR about Separate question: Are the following two examples functionally equivalent in beam? coll = pipe | core.Map(lambda _, x: x.do_stuff(), self.thing) coll = pipe | core.Map(lambda _, x=self.thing: x.do_stuff()) I would hope that they are, but I could see there being some difference in how the transforms get serialized. The reason I ask is that it's going to be tricky to deal with the shifting signatures of methods like |
Here's my attempt at recreating what Beam does:
Passing self.thing as a default argument causes it to get pickled with the lambda and makes it a distinct object after unpickling. |
yes, any object will be a distinct object after unpickling:
But the "failure" of your test is just an implementation detail of Luckily, it is possible for distinctly different objects to behave the same (otherwise beam wouldn't work at all):
Anyway, IIUC, the function and the args passed to |
I'm getting very close to getting this to pass, but I'm waiting on |
9f334c5
to
e3a5471
Compare
#9179 awaiting your review for core.py Also, yes, the two examples are the same, unless some.thing is a deferred side input (e.g. pvalue.AsIterable(pcoll) in which only the first is valid (but an argument could be made we should accept the second as well). |
5ff6206
to
80932de
Compare
I just pushed a bunch of changes. I saw some updates to tickets related to this on Jira, so I thought I'd give a little update. This is a pretty big undertaking, and there are a lot of things in motion.
|
577bbd1
to
b649f8c
Compare
70357d5
to
b1a0478
Compare
Alright everyone, I'm happy with where this PR is now, and I think it's time to start figuring out how to get it merged. This PR is admittedly huge, but the daunting scope is mitigated by the fact that the intent of the PR is to avoid making any behavioral changes. I recommend reviewing this PR one commit at a time, as I've tried to organize it into bite-sized chunks, and I've segregated changes with possible unintentional runtime side-effects and made note of them for extra scrutiny. An additional consideration with respect to the size of this PR is that there is a critical mass required before mypy can be enabled in CI, so if we were to break up this PR, we wouldn't be able to enable the mypy lint until all of the changes are finally merged (the alternative is hundreds I'm really excited about this PR and I think you all are going to find the code base much easier to navigate and grok once it's in (I already provide this branch to others on the team who are trying to learn beam), not to mention the deep structural analysis that you get during testing (it's helped me to identify a number of real and potential bugs already, which you'll see in my commits). One final consideration: there will be a need for training and mentoring on static type analysis, I have some internal docs on the subject which I'm happy to share, and you can always drag me into a PR to help resolve issues. |
Thanks for doing this @chadrik! I'm very excited about this change :) One quick question: when we switch to python 3, is there a tool that will convert all the type comments to pep 484 type hints? |
Thanks for the huge amount of work that's gone into this PR, and I'm looking forwards to the benefits it'll bring to the codebase. In terms of getting this reviewed, I understand that mypy can't be enabled [1] until it's all in, but I think it's preferable to get individual commits merged and in incrementally than review commits incrementally and then try to get them all in at once in a single PR. In my spot checking it seems these commits are relatively orthogonal--is that basically the case? Perhaps a reviewer could create PRs cherry-picking commits that they are able to review in one sitting and we could get them in that way? Any other ideas? [1] Maybe we could enable it, but not make it a blocker, in which case we could track progress and at least manually track progress/slippage until we enforce it. |
Yes. https://github.com/ilevkivskyi/com2ann. I think we will need to manually fixup a bunch of style issues after running that, because when the comments become annotations they will exceed the line limit. Side note: I recently experimented with enabling pylint's auto-fix features (like spotlessApply), but it won't work unless we're willing to change Beams' rather "unique" style. IIRC, the main deal-breaker is Beams' mixture of 2-space and 4-space tabs depending on context (it's been a headache for me to setup properly in PyCharm as well. I have something that's close, but not perfect). EDIT: I just saw the script includes an option to automatically wrap signatures,
I like the idea of adding the lint job but not failing until we've gotten everything in. There are some commits that depend on others, but the relationship is not always obvious, even to me. How about this plan: we first merge in a PR that contains just type comments, so theoretically no chance of runtime changes. This will be ignore comments, base typing pass, and the mypy lint job with failure disabled (basically this PR through to c25792c). At this point I think it would be safe for reviewers to cherry-pick the more topical commits that remain, in any order they choose. |
Sounds good. That's still quite the big commit, but should be straightforward to review. (I'm not seeing that commit in the list though.) |
We also need to get buy-in on the list as this will be yet another "lint" that contributors will have to make happy (though generally a good thing, but it might require some education). |
First PR is here: #9915 |
Type annotations gained syntax support in 2 passes: function annotations
in 3.5 and variable annotations in 3.6. So yes, once we get to python 3.6
there will no longer be a need for type comments. Luckly, there is a tool
for automatically converting comments to annotations.
…On Wed, Oct 30, 2019 at 9:41 AM Alex Amato ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sdks/python/apache_beam/typehints/decorators.py
<#9056 (comment)>:
> @@ -193,7 +200,7 @@ def __repr__(self):
self.input_types, self.output_types)
-class WithTypeHints(object):
+class WithTypeHints(Generic[InT, OutT]):
Python3.6+ has some built in support for static typing, if I understand
correctly.
***@***.***/learn-how-to-use-static-type-checking-in-python-3-6-in-10-minutes-12c86d72677b
If we get beam onto python3 only, would it make more sense to use that
instead of this comment style type hints? I was curious if there was an
advantage to using the comment hints, other than the python2 support.
Don't get me wrong, I think this change will be quite helpful and should
proceed. But it could be a good stepping stone to the later python3 static
typing, as it will be much easier to implement given your initial type
hints.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9056?email_source=notifications&email_token=AAAPOEZFON75VJKN436T3CDQRG2KNA5CNFSM4ICSQA7KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCJYF7ZI#discussion_r340731259>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAPOEZ7ZAKTPQTWFM2VSDDQRG2KNANCNFSM4ICSQA7A>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. I focused mainly on the test infrastructure changes
@@ -1768,7 +1768,7 @@ class BeamModulePlugin implements Plugin<Project> { | |||
project.exec { commandLine virtualenvCmd } | |||
project.exec { | |||
executable 'sh' | |||
args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 --upgrade tox==3.11.1 grpcio-tools==1.3.5" | |||
args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 --upgrade tox==3.11.1 -r ${project.rootDir}/sdks/python/build-requirements.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather limit the packages installed in the base virtualenv.
setup.py
should contain the list of packages needed (under REQUIRED_TEST_PACKAGES
in this case). Note that it already lists future
as a dependency.
IIUC mypy is run via setup.py
so listing the dependencies there makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I touched on this in your other comment, but the big picture here is that python lacks a good solution for specifying build-time requirements: i.e. deps required to run setup.py
.
Entire ecosystems have been built up around solving this:
- https://poetry.eustace.io/
- https://snarky.ca/clarifying-pep-518/
- No way to specify build-time dependencies of setup.py pypa/pip#2381
One note from the last link:
Pip 10 and later support build dependencies, but they have to be specified in a pyproject.toml file.
To use that we'd have to change python setup.py sdist
to use pip
, which makes me nervous. Certainly more than I'd want to change in this PR.
@robertwb may have some other input. I noticed from some comments there was an earlier attempt to use the setup_requires
field of setup.py
, and then it was dismantled. IIRC, setup_requires
has a lot of crippling issues.
In my experience a build-requirements.txt
is by far the simplest solution.
@@ -143,6 +143,7 @@ disable = | |||
unnecessary-lambda, | |||
unnecessary-pass, | |||
unneeded-not, | |||
unsubscriptable-object, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this trigger? Perhaps a newer version of pylint could help? (pylint-dev/pylint#2416)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class _ListBuffer(List[bytes]):
"""Used to support parititioning of a list."""
def partition(self, n):
# type: (int) -> List[List[bytes]]
return [self[k::n] for k in range(n)] # <------ ERROR HERE
class SdfProcessSizedElements(DoOperation):
...
def progress_metrics(self):
# type: () -> beam_fn_api_pb2.Metrics.PTransform
with self.lock:
metrics = super(SdfProcessSizedElements, self).progress_metrics()
current_element_progress = self.current_element_progress()
if current_element_progress:
assert self.input_info is not None
metrics.active_elements.measured.input_element_counts[
self.input_info[1]] = 1 # <------ ERROR HERE
metrics.active_elements.fraction_remaining = (
current_element_progress.fraction_remaining)
return metrics
Generally speaking, there are a handful of checks that overlap between mypy and pylint where pylint does a worse job. In the second example above, self.input_info
is an Optional[Tuple]
that's initialized to None
, so pylint thinks it's unsubscriptable, but it's initialized to a tuple elsewhere and there's a line just above the error that says assert self.input_info is not None
, so it obviously is in this context. These are all subtleties that mypy understands but pylint does not, so I think we'll see a trend of moving responsibility for these structural tests to mypy.
Perhaps a newer version of pylint could help?
I'm already using the latest and greatest (one of the commits here ticks the version up from 2.4.2 to 2.4.3)
# TODO(BEAM-5414): latest grpcio-tools incompatible with latest protobuf 3.6.1. | ||
grpcio-tools>=1.3.5,<=1.14.2 | ||
mypy-protobuf==1.12 | ||
future==0.16.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addition of this file needs a discussion.
- This file needs a description. Which dependencies should be included here?
- Additionally, which dependencies should by setup.py, the gradle setupVirtualenv task, gen_protos.py, etc?
- Ideally dependencies should only be listed once, so I am for this kind of change.
- Why is
future
listed here? It's already insetup.py
. I saw that some tox envs needed it, but it seems that if tox installs the apache-beam tarball in the virtualenv it creates then that should include future as well. - Why is
mypy-protobuf
here? Only themypy
command from setup.py seems to need it, so it could be listed underREQUIRED_TEST_PACKAGES
there. - I do agree that the grpcio-tools dep should be listed in one place so maybe this file could serve that purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file needs a description. Which dependencies should be included here?
Good point. These are the requirements for building an sdist of beam. They happen to all be dependencies of gen_protos.py
, but conceptually this file could contain other build deps like cython. I chose to omit that since it's a bit more complicated since it's only used on linux and only for certain tests.
Why is future listed here? It's already in setup.py.
gen_protos.py
uses it to futurize the results of protoc
. The packages in install_requires
are not available at the time that gen_protos
runs.
Why is mypy-protobuf here? Only the mypy command from setup.py seems to need it, so it could be listed under REQUIRED_TEST_PACKAGES there.
Also required for gen_protos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is mypy-protobuf here?
To clarify, mypy-protobuf
is required at the time that gen_protos
runs (it's used by protoc
), but it's not technically required by anything other than py37-mypy
job. However, we'll want to include the .pyi files in sdist so that end users can get their benefits in PyCharm, mypy, etc. Which reminds me that I need to update setup.py to include pyi files in package_data
!
@@ -49,6 +49,164 @@ | |||
] | |||
|
|||
|
|||
def generate_urn_files(log, out_dir): | |||
""" | |||
Create a pyi stub for common_urns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the docstring.
IIUC, this does not generate a pyi stub, but several _urns.py files that are imported by common_urns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
@@ -160,16 +160,17 @@ commands = | |||
[testenv:py27-lint] | |||
# Checks for py2 syntax errors | |||
deps = | |||
-r build-requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required if we want tox
to work outside of gradle, and I really really do. Using gradle to run tox tasks adds a lot of overhead and I kind of hate it. gradle creates a complete virtualenv just to build an sdist, then it creates another virtualenv to run tox (and IIRC it does this every time). If I run tox directly it creates a single virtualenv for both sdist and the tox env and it reuses that virtualenv for subsequent runs of tox. Gradle also just seems slow to start up. Also, I can't pass individual tests to run when using gradle. Let's just say that as a python developer, the gradle experience is the thing I like least about the Beam project.
Yes, by adding the gen_proto requirements to build-requirements.txt
and installing that in the setupVirtualenv
task I'm making that overhead worse, but I certainly didn't start that problem: the code is currently installing tox
every time that the sdist
task is run, which is pointless. In other words, setupVirtualenv
is currently installing the union of all possible task requirements, and I'm extending that pattern.
I could add a feature to setupVirtualenv
to provide a requirements file, which would let us provide exactly the deps required for each task:
sdist
: grpcio-tools, mypy-protobuf, futuretox
: tox (tox will install the deps above if needed)
That would end up a net win, even with the deps added in this PR, because tox has a lot of deps. Should I do that here or in a followup PR?
I've been meaning to bring up some of these issues on the list. I assume the reason for the two step test process (sdist + tox, each with its own virtualenv) is to ensure that tox is using an sdist built in the same way every time, using python2, just as it would be for distribution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tox
runs outside of gradle. (At least for me). What is the issue there?
I assume the reason for the two step test process (sdist + tox, each with its own virtualenv) is to ensure that tox is using an sdist built in the same way every time, using python2, just as it would be for distribution?
sdist step is used for other things (for building artifacts for release). That is the reason for it to exists on its own.
tox, I do not believe we spent time on configuring it to reuse existing virtualenv. If that is an option we can try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tox runs outside of gradle. (At least for me). What is the issue there?
When tox is run on jenkins, it is invoked via gradle. The tox task is defined in BeamModulePlugin.groovy like this:
project.ext.toxTask = { name, tox_env ->
project.tasks.create(name) {
dependsOn 'setupVirtualenv'
dependsOn ':sdks:python:sdist'
doLast {
// Python source directory is also tox execution workspace, We want
// to isolate them per tox suite to avoid conflict when running
// multiple tox suites in parallel.
project.copy { from project.pythonSdkDeps; into copiedSrcRoot }
def copiedPyRoot = "${copiedSrcRoot}/sdks/python"
def distTarBall = "${pythonRootDir}/build/apache-beam.tar.gz"
project.exec {
executable 'sh'
args '-c', ". ${project.ext.envdir}/bin/activate && cd ${copiedPyRoot} && scripts/run_tox.sh $tox_env $distTarBall"
}
}
inputs.files project.pythonSdkDeps
outputs.files project.fileTree(dir: "${pythonRootDir}/target/.tox/${tox_env}/log/")
}
}
and it's used here:
https://github.com/apache/beam/blob/master/sdks/python/test-suites/tox/py2/build.gradle
sdist step is used for other things (for building artifacts for release). That is the reason for it to exists on its own.
I'm not saying that sdist
task shouldn't exist, I'm saying it adds overhead to use the sdist
task to generate the tarball for the tox task, instead of just letting tox generate its own sdist tarball, which it is designed to do and is the default behavior (Also, there's the copying of source files which adds more overhead and it's unclear why this exists since tox is designed to create isolated virtualenvs per task environment).
In the definition of toxTask
above, you see that it depends on setupVirtualenv
and also on sdist
, and sdist
itself depends on setupVirtualenv
. That's why we end up running setupVirtualenv
twice for each tox task run, and then tox creates yet another virtualenv to actually run the task. That's 3 virtualenvs per tox job! I've used travisCI to setup a number of tox-based CI jobs and the typical workflow is to ensure that tox is installed on the worker image and then theres only one virtualenv created per job, the one created by tox.
Anyway, long story short the build-requirements.txt
line is required for the tox tasks to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the gradle config is not the most efficient. The base requirements for running Python tests on gradle are pip, virtualenv, and 4 versions of the python interpreter. Tox is installed by setupVirtualenv. :sdks:python:sdist is run only once (IIUC) for all toxTasks, since running more than one sdist concurrently is racey and makes our tests flakey.
CC: @markflyhigh who set this up.
Thinking about this now, I have a hunch that the tarball in use doesn't ever include Cythonized sources (since they are created by sdist when the cython package is preset).
Source files are copied due to isolation requirements. The Cythonization process adds files which are automatically picked up if present. Perhaps a better way to do this is to exclude C files from sdist for the non-cython envs, but this is the way we do it currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we agree there's room for improvement. The question for the moment is, is my current solution ok?
Hmm... Especially once we move to Python 3, writing the return type part of |
Enabling |
Hello reviewers, Once I have those answers your feedback will be integrated into the smaller "part 1" PR here: #9915 thanks! |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
Keeping this as the master PR for typing. #10367 is currently under review. |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
EDIT: jump to the latest conversation here: #9056 (comment)
This is an early look at type annotations for the beam source. It is not attempting to address the issue of porting beam runtime typing from its own internal typehints module to the official tying module, so perhaps I should open a new ticket just for type hinting the source.
EDIT: done. https://issues.apache.org/jira/browse/BEAM-7746
There's a lot left to do, including getting this hooked up to tests, but I wanted to get it out there so that A) I can get feedback before I get too far along, and B) in case anyone wants to join forces with me.
Some points for discussion / debate:
type annotation style
In python 2.7, we're limited to type comments, and it can be quite difficult to keep them under the 80 character limit, since they're not very flexible. I'm following the lead of typing-heavy projects like mypy and using the following style for functions with more than one or two arguments:
typing module import
Some projects establish a convention to import only the typing objects required for the current module, e.g.:
Since beam has its own internal
typehints
module I'm importing thetyping
module in the same fashion to avoid conflicts/confusion. However, this approach has the downside of making some type comments a lot longer and less legible, e.g.typing.Union[typing.Iterable[str], str]
, which may make it impossible to stay below 80 characters in some cases.type checking pipeline construction
The type hints are useful first-and-foremost for developers of beam, but they should also prove useful for consumers of the beam API, i.e. developers of beam pipelines. Though it's a secondary concern at the moment, I'm trying to ensure that the use case of pipeline creation is covered, so that mypy will generate proper errors for invalid combinations of transforms prior to submission, but due to a bug/weakness this is harder than expected. I'm making progress writing a mypy plugin to patch up the problem, but it's not done yet. Using the results of this process for runtime typing (e.g. to lookup coders) is out of scope for this PR.
str vs unicode
Currently I'm glossing over this problem. The correct way to annotate strings for both python 2 and 3 is to use
typing.Text
andtyping.AnyStr
, but it's a huge pain in the ass, in particular because it's often unclear where unicode is valid or invalid (I've also run into inaccuracies in the python typeshed wrt strings). In my own projects, I just annotate everything asstr
and handle the few cases where mypy complains aboutunicode
by usingtyping.cast(str, whatever)
, since the problem will get a lot simpler when we switch to python3, andbytes
becomes the exception. I'm open to suggestions here.I'll probably think of more later, but I'm out of time and want to get this out the door!
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.