diff --git a/website/www/site/content/en/blog/python-improved-annotations.md b/website/www/site/content/en/blog/python-improved-annotations.md new file mode 100644 index 0000000000000..775c5009264ca --- /dev/null +++ b/website/www/site/content/en/blog/python-improved-annotations.md @@ -0,0 +1,110 @@ +--- +layout: post +title: "Improved Annotation Support for the Python SDK" +date: 2020-08-21 00:00:01 -0800 +categories: + - blog + - python + - typing +authors: + - saavan +--- + + +The importance of static type checking in a dynamically +typed language like Python is not up for debate. Type hints +allow developers to leverage a strong typing system to: + - write better code, + - self-document ambiguous programming logic, and + - inform intelligent code completion in IDEs like PyCharm. + +This is why we're excited to announce upcoming improvements to +the `typehints` module of Beam's Python SDK, including support +for typed PCollections and Python 3 style annotations on PTransforms. + +# Improved Annotations +Today, you have the option to declare type hints on PTransforms using either +class decorators or inline functions. + +For instance, a PTransform with decorated type hints might look like this: +``` +@beam.typehints.with_input_types(int) +@beam.typehints.with_output_types(str) +class IntToStr(beam.PTransform): + def expand(self, pcoll): + return pcoll | beam.Map(lambda num: str(num)) + +strings = numbers | beam.ParDo(IntToStr()) +``` + +Using inline functions instead, the same transform would look like this: +``` +class IntToStr(beam.PTransform): + def expand(self, pcoll): + return pcoll | beam.Map(lambda num: str(num)) + +strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str) +``` + +Both methods have problems. Class decorators are syntax-heavy, +requiring two additional lines of code, whereas inline functions provide type hints +that aren't reusable across other instances of the same transform. Additionally, both +methods are incompatible with static type checkers like MyPy. + +With Python 3 annotations however, we can subvert these problems to provide a +clean and reusable type hint experience. Our previous transform now looks like this: +``` +class IntToStr(beam.PTransform): + def expand(self, pcoll: PCollection[int]) -> PCollection[str]: + return pcoll | beam.Map(lambda num: str(num)) + +strings = numbers | beam.ParDo(IntToStr()) +``` + +These type hints will actively hook into the internal Beam typing system to +play a role in pipeline type checking, and runtime type checking. + +So how does this work? + +## Typed PCollections +You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be +parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`). +- A PCollection with zero types is implicitly converted to `PCollection[Any]`. +- A PCollection with one type can have any nested type (e.g. `Union[int, str]`). + +Internally, Beam's typing system makes these annotations compatible with other +type hints by removing the outer PCollection container. + +## PBegin, PDone, None +Finally, besides PCollection, a valid annotation on the `expand(...)` method of a PTransform is +`PBegin` or `None`. These are generally used for PTransforms that begin or end with an I/O operation. + +For instance, when saving data, your transform's output type should be `None`. + +``` +class SaveResults(beam.PTransform): + def expand(self, pcoll: PCollection[str]) -> None: + return pcoll | beam.io.WriteToBigQuery(...) +``` + +# Next Steps +What are you waiting for.. start using annotations on your transforms! + +For more background on type hints in Python, see: +[Ensuring Python Type Safety](https://beam.apache.org/documentation/sdks/python-type-safety/). + +Finally, please +[let us know](https://beam.apache.org/community/contact-us/) +if you encounter any issues. diff --git a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md new file mode 100644 index 0000000000000..d9124909e3b3f --- /dev/null +++ b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md @@ -0,0 +1,154 @@ +--- +layout: post +title: "Performance-Driven Runtime Type Checking for the Python SDK" +date: 2020-08-21 00:00:01 -0800 +categories: + - blog + - python + - typing +authors: + - saavan +--- + + +In this blog post, we're announcing the upcoming release of a new, opt-in +runtime type checking system for Beam's Python SDK that's optimized for performance +in both development and production environments. + +But let's take a step back - why do we even care about runtime type checking +in the first place? Let's look at an example. + +``` +class MultiplyNumberByTwo(beam.DoFn): + def process(self, element: int): + return element * 2 + +p = Pipeline() +p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo()) +``` + +In this code, we passed a list of strings to a DoFn that's clearly intended for use with +integers. Luckily, this code will throw an error during pipeline construction because +the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with +the declared input type of `MultiplyNumberByTwo.process` which is `int`. + +However, what if we turned pipeline type checking off using the `no_pipeline_type_check` +flag? Or more realistically, what if the input PCollection to `MultiplyNumberByTwo` arrived +from a database, meaning that the output data type can only be known at runtime? + +In either case, no error would be thrown during pipeline construction. +And even at runtime, this code works. Each string would be multiplied by 2, +yielding a result of `['11', '22']`, but that's certainly not the outcome we want. + +So how do you debug this breed of "hidden" errors? More broadly speaking, how do you debug +any typing or serialization error in Beam? + +The answer is to use runtime type checking. + +# Runtime Type Checking (RTC) +This feature works by checking that actual input and output values satisfy the declared +type constraints during pipeline execution. If you ran the code from before with +`runtime_type_check` on, you would receive the following error message: + +``` +Type hint violation for 'ParDo(MultiplyByTwo)': requires but got for element +``` + +This is an actionable error message - it tells you that either your code has a bug +or that your declared type hints are incorrect. Sounds simple enough, so what's the catch? + +_It is soooo slowwwwww._ See for yourself. + + +| Element Size | Normal Pipeline | Runtime Type Checking Pipeline +| ------------ | --------------- | ------------------------------ +| 1 | 5.3 sec | 5.6 sec +| 2,001 | 9.4 sec | 57.2 sec +| 10,001 | 24.5 sec | 259.8 sec +| 18,001 | 38.7 sec | 450.5 sec + +In this micro-benchmark, the pipeline with runtime type checking was over 10x slower, +with the gap only increasing as our input PCollection increased in size. + +So, is there any production-friendly alternative? + +# Performance Runtime Type Check +There is! We developed a new flag called `performance_runtime_type_check` that +minimizes its footprint on the pipeline's time complexity using a combination of +- efficient Cython code, +- smart sampling techniques, and +- optimized mega type-hints. + +So what do the new numbers look like? + +| Element Size | Normal | RTC | Performance RTC +| ----------- | --------- | ---------- | --------------- +| 1 | 5.3 sec | 5.6 sec | 5.4 sec +| 2,001 | 9.4 sec | 57.2 sec | 11.2 sec +| 10,001 | 24.5 sec | 259.8 sec | 25.5 sec +| 18,001 | 38.7 sec | 450.5 sec | 39.4 sec + +On average, the new Performance RTC is 4.4% slower than a normal pipeline whereas the old RTC +is over 900% slower! Additionally, as the size of the input PCollection increases, the fixed cost +of setting up the Performance RTC system is spread across each element, decreasing the relative +impact on the overall pipeline. With 18,001 elements, the difference is less than 1 second. + +## How does it work? +There are three key factors responsible for this upgrade in performance. + +1. Instead of type checking all values, we only type check a subset of values, known as +a sample in statistics. Initially, we sample a substantial number of elements, but as our +confidence that the element type won't change over time increases, we reduce our +sampling rate (up to a fixed minimum). + +2. Whereas the old RTC system used heavy wrappers to perform the type check, the new RTC system +moves the type check to a Cython-optimized, non-decorated portion of the codebase. For reference, +Cython is a programming language that gives C-like performance to Python code. + +3. Finally, we use a single mega type hint to type-check only the output values of transforms +instead of type-checking both the input and output values separately. This mega typehint is composed of +the original transform's output type constraints along with all consumer transforms' input type +constraints. Using this mega type hint allows us to reduce overhead while simultaneously allowing +us to throw _more actionable errors_. For instance, consider the following error (which was +generated from the old RTC system): +``` +Runtime type violation detected within ParDo(DownstreamDoFn): Type-hint for argument: 'element' violated. Expected an instance of , instead found 9, an instance of . +``` + +This error tells us that the `DownstreamDoFn` received an `int` when it was expecting a `str`, but doesn't tell us +who created that `int` in the first place. Who is the offending upstream transform that's responsible for +this `int`? Presumably, _that_ transform's output type hints were too expansive (e.g. `Any`) or otherwise non-existent because +no error was thrown during the runtime type check of its output. + +The problem here boils down to a lack of context. If we knew who our consumers were when type +checking our output, we could simultaneously type check our output value against our output type +constraints and every consumers' input type constraints to know whether there is _any_ possibility +for a mismatch. This is exactly what the mega type hint does, and it allows us to throw errors +at the point of declaration rather than the point of exception, saving you valuable time +while providing higher quality error messages. + +So what would the same error look like using Performance RTC? It's the exact same string but with one additional line: +``` +[while running 'ParDo(UpstreamDoFn)'] +``` + +And that's much more actionable for an investigation :) + +# Next Steps +Go play with the new `performance_runtime_type_check` feature! + +It's in an experimental state so please +[let us know](https://beam.apache.org/community/contact-us/) +if you encounter any issues. diff --git a/website/www/site/data/authors.yml b/website/www/site/data/authors.yml index a5c966d51158f..0a9740b589098 100644 --- a/website/www/site/data/authors.yml +++ b/website/www/site/data/authors.yml @@ -160,4 +160,8 @@ pedro: rionmonster: name: Rion Williams email: rionmonster@gmail.com - twitter: rionmonster \ No newline at end of file + twitter: rionmonster +saavannanavati: + name: Saavan Nanavati + email: saavan.nanavati@utexas.edu + twitter: \ No newline at end of file