-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7746] Add python type hints (part 2) #10367
Conversation
Some of these were already reviewed in the overarching PR (#9056), but I did not include them in part 1 because I was trying to focus on type comments in part 1. I tried to give some context for each change in the commit description. It may be worth considering not squashing history when merging this. |
d25cf73
to
d7514b2
Compare
Thanks Chad. I decided to squash the previous PR because I saw some commits containing 'fixup' type comments. Sorry if that was not the intention (guess I should have checked..). Thanks for explicitly requesting the 'merge' option this time : ) |
No worries. I should have said something earlier. I got lazy at the very
end.
…On Thu, Dec 12, 2019 at 11:29 AM Pablo ***@***.***> wrote:
Thanks Chad. I decided to squash the previous PR because I saw some
commits containing 'fixup' type comments. Sorry if that was not the
intention (guess I should have checked..). Thanks for explicitly requesting
the 'merge' option this time : )
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10367?email_source=notifications&email_token=AAAPOEY22QERH3HK7OJFD2TQYKGINA5CNFSM4J2CDVPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGXYDIA#issuecomment-565150112>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAPOE6L4KKUP2UGEXRCNKTQYKGINANCNFSM4J2CDVPA>
.
|
Hi, now that the holidays are over, I'd like to bump this to the tops of people's review queue! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review, about half way done
@udim thanks for the review! very good questions. |
btw, I made some edits to my answers to clarify them, so you should continue the review via github rather than email. |
d7514b2
to
f15424d
Compare
@@ -808,7 +810,7 @@ class AppliedPTransform(object): | |||
|
|||
def __init__(self, | |||
parent, | |||
transform, # type: ptransform.PTransform | |||
transform, # type: Optional[ptransform.PTransform] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Pipeline.__init__
:
# Stack of transforms generated by nested apply() calls. The stack will
# contain a root node as an enclosing (parent) node for top transforms.
self.transforms_stack = [AppliedPTransform(None, None, '', None)]
Best way to deal with this may be a special RootAppliedTransform
subclass.
It can also possibly be None
in AppliedPTransform.from_runner_api()
:
transform = ptransform.PTransform.from_runner_api(proto.spec, context)
result = AppliedPTransform(
parent=None,
transform=transform,
full_label=proto.unique_name,
inputs=main_inputs)
This is because PTransform.from_runner_api()
returns Optional[PTransform]
@classmethod
def from_runner_api(cls,
proto, # type: Optional[beam_runner_api_pb2.FunctionSpec]
context # type: PipelineContext
):
# type: (...) -> Optional[PTransform]
if proto is None or not proto.urn:
return None
parameter_type, constructor = cls._known_urns[proto.urn]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the second issue, I should point that mypy knows that proto.spec
is not None
when we call PTransform. from_runner_api(proto.spec, context)
(because proto.spec
is always non-None), and we could almost use that knowledge to solve this problem with @overload
s of PTransform.from_runner_api()
, like this:
@overload
@classmethod
def from_runner_api(cls,
proto, # type: None
context # type: PipelineContext
):
# type: (...) -> None
pass
@overload
@classmethod
def from_runner_api(cls,
proto, # type: beam_runner_api_pb2.FunctionSpec
context # type: PipelineContext
):
# type: (...) -> PTransform
pass
@classmethod
def from_runner_api(cls, proto, context):
if proto is None or not proto.urn:
return None
Unfortunately, typing can't track whether the value of proto.urn
is an empty string, which means that the above overload strategy doesn't actually work. Is there any chance that this could be changed to if proto is None or proto.urn is None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it'd ever be None, given that an unset proto field is the default value of that field.
@@ -437,13 +439,15 @@ def invoke_start_bundle(self): | |||
# type: () -> None | |||
"""Invokes the DoFn.start_bundle() method. | |||
""" | |||
assert self.output_processor is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These asserts should be fast, but have you verified this doesn't impact performance (given this is called for every element of every transform). Or is there another way to declare this for critical parts of the code. (In particular, isinstance can be slow. It's not common convention to disable asserts in Python.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not done performance testing on this. Is there a way that we can invoke the perf suite on Jenkins from github?
The underlying issue in this case is that it's possible to instantiate a DoFnInvoker
without an output_processor
and that's considered ok (by us) as long as you don't call any of the methods that use the output_processor
. If you do, then it would raise an exception. The asserts obviously also raise an exception, but it serves as a way to communicate to mypy that you're aware of it, and it can adjust its type analysis within that scope.
Solutions
Easy: Judiciously add type: ignore
comments. I say "judiciously" because ignoring an error does not change the type analysis, so similar errors can pop up nearby in the same scope. In this particular case the methods are brief, so a type: ignore
comment would suffice.
Not Easy: Rework the code so there's a subclass of DoFnInvoker
that always has a non-optional output_processor
, and only this class possesses these "safe if you're careful" methods, which would, under that design, always be safe.
The easy solution is fine for this particular case (we're aware there could be an error, but we accept it), but it's not a general solution to this problem. Choosing the right solution for each case takes some consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get rid of output_processor
altogether in this code... (It's a bit ugly ever since it was introduced, this typing confirms that.) May not be the case everywhere though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, how about I change these to type: ignore
and you create a ticket to remove output_processor
? I would make the ticket but I don't think I have enough context to explain it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done. I ended up leaving one assert in invoke_process
because it was conditional, and by using one assert I avoided 3 ignores that cluttered up the code. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've finished going through all the files. Overall it looks good, just a couple more comments.
@@ -265,7 +265,7 @@ def finish(self): | |||
|
|||
class _StateBackedIterable(object): | |||
def __init__(self, | |||
state_handler, | |||
state_handler, # type: sdk_worker.CachingStateHandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 'Caching' part necessary here (even if it always is right now)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CachingStateHandler
does not inherit from StateHandler
nor does it implement its abstract methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe worth creating another issue for this. Could be a nice entry-level task.
@@ -1130,19 +1132,26 @@ def process(self, windowed_value): | |||
|
|||
@BeamTransformFactory.register_urn( | |||
DATA_INPUT_URN, beam_fn_api_pb2.RemoteGrpcPort) | |||
def create(factory, transform_id, transform_proto, grpc_port, consumers): | |||
def create_source_runner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These registered constructors (necessarily) all have the same signature. Is there a way to declare that in a common place? (The return type is always Operation, what type is not ever introspected.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way to reduce the noise here would be make the registration more object oriented.
Quick sketch:
class OpCreator(Generic[OperatorT]):
def __init__(
self,
factory, # type: BeamTransformFactory
transform_id, # type: str
transform_proto, # type: beam_runner_api_pb2.PTransform
consumers # type: Dict[str, List[operations.Operation]]
):
self.factory = factory
self.transform_id = transform_id
self.transform_proto = transform_proto
self.consumers = consumers
def create(self, parameter):
# type: (Any) -> OperatorT
raise NotImplementedError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an idea. It would still add a level of indirection and boilerplate...
@overload | ||
def register_urn(cls, | ||
urn, # type: str | ||
parameter_type, # type: None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: could we unify these and update all callers that currently pass None to pass bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be nice from a typing simplicity standpoint. not sure about the implications of that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can make the changes ignore OutputProcessor and update the code to reflect that pipeline=None is transient, this looks good to go into me. Thanks.
f15424d
to
9d81cce
Compare
Run all tests |
Everything looks good except a little bit of lint: https://builds.apache.org/job/beam_PreCommit_PythonLint_Commit/1894/ |
Minor adjustments to runtime code required to silence certain errors. Two common patterns: - explicitly return None from functions that also return non-None - assert that optional attributes are non-None before using them, if there are no other conditionals present to ensure this.
…e statically analyzed
…e (Tuple[str, ...])
…uild process This fixes numerous errors generated throughout the code because mypy cannot track the dynamic setattr binding that was done by common_urns. The change also necessitated updating a few doctrings to prevent this error: more than one target found for cross-reference u'DisplayData': apache_beam.transforms.display.DisplayData apache_beam.portability.api.beam_runner_api_pb2_urns.DisplayData
There are several places in the code where it is assumed that these are part of the abstract StateSpec.
…w.StateSampler.reset() statesampler_slow.StateSampler does not have _states_by_name attribute. Only its fast counterpart does.
This gives us a type that we can use to ensure all handlers meet the same protocol
Note that this means that tests that were previously being masked by other tests with the same name will now be run. There is a fix included for one such test.
9d81cce
to
f7f8792
Compare
Run all tests |
I'm a bit confused: I don't see the Jenkins jobs listed here any more. Why would that be? |
Run all tests |
(As per the discussion on the dev list, Apache Infra made a change to block all tests from being run by non-comitters.) |
Ah, I missed that conversation but that seems like a reasonable way to
conserve resources. I see the other PR that @udim made to resolve this for
me, thanks
…On Tue, Jan 14, 2020 at 11:24 AM Robert Bradshaw ***@***.***> wrote:
(As per the discussion on the dev list, Apache Infra made a change to
block all tests from being run by non-comitters.)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10367?email_source=notifications&email_token=AAAPOE6QB7WOSJWKR7MMHJTQ5YGQDA5CNFSM4J2CDVPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI5Z2DY#issuecomment-574332175>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAPOE4Q6CGX5T4OI2UMTJ3Q5YGQDANCNFSM4J2CDVPA>
.
|
If they're orthogonal enough, let's make 6. |
This is part 2 of #9056.
Unlike part 1 (#9915) this goes beyond simple type comments. It introduces changes that could affect runtime behavior, though I was careful to avoid doing so, unless it's noted as a bug.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.