apache · udim · Aug 25, 2020 · Aug 21, 2020 · Aug 21, 2020 · Aug 21, 2020
diff --git a/website/www/site/content/en/blog/python-improved-annotations.md b/website/www/site/content/en/blog/python-improved-annotations.md
@@ -0,0 +1,110 @@
+---
+layout: post
+title:  "Improved Annotation Support for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog
+  - python
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+The importance of static type checking in a dynamically
+typed language like Python is not up for debate. Type hints
+allow developers to leverage a strong typing system to:
+ - write better code,
+ - self-document ambiguous programming logic, and
+ - inform intelligent code completion in IDEs like PyCharm.
+
+This is why we're excited to announce upcoming improvements to
+the `typehints` module of Beam's Python SDK, including support
+for typed PCollections and Python 3 style annotations on PTransforms.
+
+# Improved Annotations
+Today, you have the option to declare type hints on PTransforms using either
+class decorators or inline functions.
+
+For instance, a PTransform with decorated type hints might look like this:
+```
+@beam.typehints.with_input_types(int)
+@beam.typehints.with_output_types(str)
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+Using inline functions instead, the same transform would look like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll):
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr()).with_input_types(int).with_output_types(str)
+```
+
+Both methods have problems. Class decorators are syntax-heavy,
+requiring two additional lines of code, whereas inline functions provide type hints
+that aren't reusable across other instances of the same transform. Additionally, both
+methods are incompatible with static type checkers like MyPy.
+
+With Python 3 annotations however, we can subvert these problems to provide a
+clean and reusable type hint experience. Our previous transform now looks like this:
+```
+class IntToStr(beam.PTransform):
+    def expand(self, pcoll: PCollection[int]) -> PCollection[str]:
+        return pcoll | beam.Map(lambda num: str(num))
+
+strings = numbers | beam.ParDo(IntToStr())
+```
+
+These type hints will actively hook into the internal Beam typing system to
+play a role in pipeline type checking, and runtime type checking.
+
+So how does this work?
+
+## Typed PCollections
+You guessed it! The PCollection class inherits from `typing.Generic`, allowing it to be
+parameterized with either zero types (denoted `PCollection`) or one type (denoted `PCollection[T]`).
+- A PCollection with zero types is implicitly converted to `PCollection[Any]`.
+- A PCollection with one type can have any nested type (e.g. `Union[int, str]`).
+
+Internally, Beam's typing system makes these annotations compatible with other
+type hints by removing the outer PCollection container.
+
+## PBegin, PDone, None
+Finally, besides PCollection, a valid annotation on the `expand(...)` method of a PTransform is
+`PBegin` or `None`. These are generally used for PTransforms that begin or end with an I/O operation.
+
+For instance, when saving data, your transform's output type should be `None`.
+
+```
+class SaveResults(beam.PTransform):
+    def expand(self, pcoll: PCollection[str]) -> None:
+        return pcoll | beam.io.WriteToBigQuery(...)
+```
+
+# Next Steps
+What are you waiting for.. start using annotations on your transforms!
+
+For more background on type hints in Python, see:
+[Ensuring Python Type Safety](https://beam.apache.org/documentation/sdks/python-type-safety/).
+
+Finally, please
+[let us know](https://beam.apache.org/community/contact-us/)
+if you encounter any issues.
diff --git a/website/www/site/content/en/blog/python-performance-runtime-type-checking.md b/website/www/site/content/en/blog/python-performance-runtime-type-checking.md
@@ -0,0 +1,154 @@
+---
+layout: post
+title:  "Performance-Driven Runtime Type Checking for the Python SDK"
+date:   2020-08-21 00:00:01 -0800
+categories:
+  - blog
+  - python
+  - typing
+authors:
+  - saavan
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+In this blog post, we're announcing the upcoming release of a new, opt-in
+runtime type checking system for Beam's Python SDK that's optimized for performance
+in both development and production environments.
+
+But let's take a step back - why do we even care about runtime type checking
+in the first place? Let's look at an example.
+
+```
+class MultiplyNumberByTwo(beam.DoFn):
+    def process(self, element: int):
+        return element * 2
+
+p = Pipeline()
+p | beam.Create(['1', '2'] | beam.ParDo(MultiplyNumberByTwo())
+```
+
+In this code, we passed a list of strings to a DoFn that's clearly intended for use with
+integers. Luckily, this code will throw an error during pipeline construction because
+the inferred output type of `beam.Create(['1', '2'])` is `str` which is incompatible with
+the declared input type of `MultiplyNumberByTwo.process` which is `int`.
+
+However, what if we turned pipeline type checking off using the `no_pipeline_type_check`
+flag? Or more realistically, what if the input PCollection to `MultiplyNumberByTwo` arrived
+from a database, meaning that the output data type can only be known at runtime?
+
+In either case, no error would be thrown during pipeline construction.
+And even at runtime, this code works. Each string would be multiplied by 2,
+yielding a result of `['11', '22']`, but that's certainly not the outcome we want.
+
+So how do you debug this breed of "hidden" errors? More broadly speaking, how do you debug
+any typing or serialization error in Beam?
+
+The answer is to use runtime type checking.
+
+# Runtime Type Checking (RTC)
+This feature works by checking that actual input and output values satisfy the declared
+type constraints during pipeline execution. If you ran the code from before with
+`runtime_type_check` on, you would receive the following error message:
+
+```
+Type hint violation for 'ParDo(MultiplyByTwo)': requires <class 'int'> but got <class 'str'> for element
+```
+
+This is an actionable error message - it tells you that either your code has a bug
+or that your declared type hints are incorrect. Sounds simple enough, so what's the catch?
+
+_It is soooo slowwwwww._ See for yourself.
+
+
+| Element Size | Normal Pipeline | Runtime Type Checking Pipeline
+| ------------ | --------------- | ------------------------------
+| 1            | 5.3 sec         | 5.6 sec
+| 2,001        | 9.4 sec         | 57.2 sec
+| 10,001       | 24.5 sec        | 259.8 sec
+| 18,001       | 38.7 sec        | 450.5 sec
+
+In this micro-benchmark, the pipeline with runtime type checking was over 10x slower,
+with the gap only increasing as our input PCollection increased in size.
+
+So, is there any production-friendly alternative?
+
+# Performance Runtime Type Check
+There is! We developed a new flag called `performance_runtime_type_check` that
+minimizes its footprint on the pipeline's time complexity using a combination of
+- efficient Cython code,
+- smart sampling techniques, and
+- optimized mega type-hints.
+
+So what do the new numbers look like?
+
+| Element Size | Normal    | RTC        | Performance RTC
+| -----------  | --------- | ---------- | ---------------
+| 1            | 5.3 sec   | 5.6 sec    | 5.4 sec
+| 2,001        | 9.4 sec   | 57.2 sec   | 11.2 sec
+| 10,001       | 24.5 sec  | 259.8 sec  | 25.5 sec
+| 18,001       | 38.7 sec  | 450.5 sec  | 39.4 sec
+
+On average, the new Performance RTC is 4.4% slower than a normal pipeline whereas the old RTC
+is over 900% slower! Additionally, as the size of the input PCollection increases, the fixed cost
+of setting up the Performance RTC system is spread across each element, decreasing the relative
+impact on the overall pipeline. With 18,001 elements, the difference is less than 1 second.
+
+## How does it work?
+There are three key factors responsible for this upgrade in performance.
+
+1. Instead of type checking all values, we only type check a subset of values, known as
+a sample in statistics. Initially, we sample a substantial number of elements, but as our
+confidence that the element type won't change over time increases, we reduce our
+sampling rate (up to a fixed minimum).
+
+2. Whereas the old RTC system used heavy wrappers to perform the type check, the new RTC system
+moves the type check to a Cython-optimized, non-decorated portion of the codebase. For reference,
+Cython is a programming language that gives C-like performance to Python code.
+
+3. Finally, we use a single mega type hint to type-check only the output values of transforms
+instead of type-checking both the input and output values separately. This mega typehint is composed of
+the original transform's output type constraints along with all consumer transforms' input type
+constraints. Using this mega type hint allows us to reduce overhead while simultaneously allowing
+us to throw _more actionable errors_. For instance, consider the following error (which was
+generated from the old RTC system):
+```
+Runtime type violation detected within ParDo(DownstreamDoFn): Type-hint for argument: 'element' violated. Expected an instance of <class ‘str’>, instead found 9, an instance of <class ‘int’>.
+```
+
+This error tells us that the `DownstreamDoFn` received an `int` when it was expecting a `str`, but doesn't tell us
+who created that `int` in the first place. Who is the offending upstream transform that's responsible for
+this `int`? Presumably, _that_ transform's output type hints were too expansive (e.g. `Any`) or otherwise non-existent because
+no error was thrown during the runtime type check of its output.
+
+The problem here boils down to a lack of context. If we knew who our consumers were when type
+checking our output, we could simultaneously type check our output value against our output type
+constraints and every consumers' input type constraints to know whether there is _any_ possibility
+for a mismatch. This is exactly what the mega type hint does, and it allows us to throw errors
+at the point of declaration rather than the point of exception, saving you valuable time
+while providing higher quality error messages.
+
+So what would the same error look like using Performance RTC? It's the exact same string but with one additional line:
+```
+[while running 'ParDo(UpstreamDoFn)']
+```
+
+And that's much more actionable for an investigation :)
+
+# Next Steps
+Go play with the new `performance_runtime_type_check` feature!
+
+It's in an experimental state so please
+[let us know](https://beam.apache.org/community/contact-us/)
+if you encounter any issues.
diff --git a/website/www/site/data/authors.yml b/website/www/site/data/authors.yml
@@ -160,4 +160,8 @@ pedro:
 rionmonster:
   name: Rion Williams
   email: rionmonster@gmail.com
-  twitter: rionmonster
+  twitter: rionmonster
+saavannanavati:
+  name: Saavan Nanavati
+  email: saavan.nanavati@utexas.edu
+  twitter: