diff --git a/website/www/site/assets/scss/_page-nav.sass b/website/www/site/assets/scss/_page-nav.sass index 1228ad4f5e09b..542a4222b20a8 100644 --- a/website/www/site/assets/scss/_page-nav.sass +++ b/website/www/site/assets/scss/_page-nav.sass @@ -58,3 +58,5 @@ margin-bottom: 0 ul padding-left: 20 + ul + display: none diff --git a/website/www/site/config.toml b/website/www/site/config.toml index 728bb0f6be91c..bc03fde8e5928 100644 --- a/website/www/site/config.toml +++ b/website/www/site/config.toml @@ -33,6 +33,10 @@ unsafe= true [markup.highlight] noClasses = false +[markup] + [markup.tableOfContents] + endLevel = 4 + ## Configuration for BlackFriday markdown parser: https://github.com/russross/blackfriday [blackfriday] plainIDAnchors = true diff --git a/website/www/site/content/en/blog/adding-data-sources-to-sql.md b/website/www/site/content/en/blog/adding-data-sources-to-sql.md index 996066fec3ebc..0fea4db8bc9a3 100644 --- a/website/www/site/content/en/blog/adding-data-sources-to-sql.md +++ b/website/www/site/content/en/blog/adding-data-sources-to-sql.md @@ -80,9 +80,7 @@ The `TableProvider` classes are under Our table provider looks like this: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} @AutoService(TableProvider.class) public class GenerateSequenceTableProvider extends InMemoryMetaTableProvider { @@ -96,9 +94,7 @@ public class GenerateSequenceTableProvider extends InMemoryMetaTableProvider { return new GenerateSequenceTable(table); } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} All it does is give a type to the table - and it implements the `buildBeamSqlTable` method, which simply returns a `BeamSqlTable` defined by @@ -111,9 +107,7 @@ allow users to define the number of elements to be emitted per second. We will define a simple table that emits sequential integers in a streaming fashion. This looks like so: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} class GenerateSequenceTable extends BaseBeamTable implements Serializable { public static final Schema TABLE_SCHEMA = Schema.of(Field.of("sequence", FieldType.INT64), Field.of("event_time", FieldType.DATETIME)); @@ -147,9 +141,7 @@ class GenerateSequenceTable extends BaseBeamTable implements Serializable { throw new UnsupportedOperationException("buildIOWriter unsupported!"); } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## The real fun diff --git a/website/www/site/content/en/blog/beam-kotlin.md b/website/www/site/content/en/blog/beam-kotlin.md index 2b576cf62a352..03ab15bae367b 100644 --- a/website/www/site/content/en/blog/beam-kotlin.md +++ b/website/www/site/content/en/blog/beam-kotlin.md @@ -41,100 +41,68 @@ Here are few brief snippets of code that show how the Kotlin Samples compare to ### Java -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} String filename = String.format( "%s-%s-of-%s%s", filenamePrefixForWindow(intervalWindow), shardNumber, numShards, outputFileHints.suggestedFilenameSuffix); -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Kotlin -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // String templating val filename = "$filenamePrefixForWindow(intervalWindow)-$shardNumber-of-$numShards${outputFileHints.suggestedFilenameSuffix)" -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Java -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} public static class FormatAsTextFn extends SimpleFunction, String> { @Override public String apply(KV input) { return input.getKey() + ": " + input.getValue(); } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## Kotlin -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} public class FormatAsTextFn : SimpleFunction, String>() { override fun apply(input: KV) = "${input.key} : ${input.value}" //Single line functions } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Java -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} if(tableRow != null){ formatAndInsert(tableRow); } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Kotlin -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} tableRow?.let{ formatAndInsert(it) // No need for null checks } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Java -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} String tableName = "testTable"; -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Kotlin -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} val tableName = "testTable" // Type inferencing -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## Contributors Welcomed! diff --git a/website/www/site/content/en/blog/looping-timers.md b/website/www/site/content/en/blog/looping-timers.md index b4022a471681e..c53f9eb5197b1 100644 --- a/website/www/site/content/en/blog/looping-timers.md +++ b/website/www/site/content/en/blog/looping-timers.md @@ -172,9 +172,7 @@ So how do timers help? Well let's have a look at a new transform: Edit: Looping Timer State changed from Boolean to Long to allow for min value check. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} public static class LoopingStatefulTimer extends DoFn, KV> { Instant stopTimerTime; @@ -238,9 +236,7 @@ public static class LoopingStatefulTimer extends DoFn, KV}} There are two data values that the state API needs to keep: @@ -279,9 +275,7 @@ In the @OnTimer block, the following occurs: And that's it, let's add our transform back into the pipeline: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // Apply a fixed window of duration 1 min and Sum the results p.apply(Create.timestamped(time_1, time_2, time_3)).apply( Window.>into(FixedWindows.of(Duration.standardMinutes(1)))) @@ -300,9 +294,7 @@ And that's it, let's add our transform back into the pipeline: } })); -``` - -{{% /classwrapper %}} +{{< /highlight >}} 1. In the first part of the pipeline we create FixedWindows and reduce the value per key down to a single Sum. diff --git a/website/www/site/content/en/blog/splittable-do-fn.md b/website/www/site/content/en/blog/splittable-do-fn.md index 339a63a5de317..952896f7dd15c 100644 --- a/website/www/site/content/en/blog/splittable-do-fn.md +++ b/website/www/site/content/en/blog/splittable-do-fn.md @@ -345,9 +345,7 @@ smaller restrictions, and a few others. The "Hello World" of SDF is a counter, which takes pairs *(x, N)* as input and produces pairs *(x, 0), (x, 1), …, (x, N-1)* as output. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} class CountFn extends DoFn, KV> { @ProcessElement public void process(ProcessContext c, OffsetRangeTracker tracker) { @@ -365,13 +363,9 @@ class CountFn extends DoFn, KV> { PCollection> input = …; PCollection> output = input.apply( ParDo.of(new CountFn()); -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} class CountFn(DoFn): def process(element, tracker=DoFn.RestrictionTrackerParam) for i in xrange(*tracker.current_restriction()): @@ -381,9 +375,7 @@ class CountFn(DoFn): def get_initial_restriction(element): return (0, element[1]) -``` - -{{% /classwrapper %}} +{{< /highlight >}} This short `DoFn` subsumes the functionality of [CountingSource](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CountingSource.java), @@ -405,9 +397,7 @@ A slightly more complex example is the `ReadFn` considered above, which reads data from Avro files and illustrates the idea of *blocks*: we provide pseudocode to illustrate the approach. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} class ReadFn extends DoFn { @ProcessElement void process(ProcessContext c, OffsetRangeTracker tracker) { @@ -433,13 +423,9 @@ class ReadFn extends DoFn { return new OffsetRange(0, new File(filename).getSize()); } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} class AvroReader(DoFn): def process(filename, tracker=DoFn.RestrictionTrackerParam) with fileio.ChannelFactory.open(filename) as file: @@ -459,9 +445,7 @@ class AvroReader(DoFn): def get_initial_restriction(self, filename): return (0, fileio.ChannelFactory.size_in_bytes(filename)) -``` - -{{% /classwrapper %}} +{{< /highlight >}} This hypothetical `DoFn` reads records from a single Avro file. Notably missing is the code for expanding a filepattern: it no longer needs to be part of this diff --git a/website/www/site/content/en/blog/stateful-processing.md b/website/www/site/content/en/blog/stateful-processing.md index c567435714ecf..1c1153dc12d0d 100644 --- a/website/www/site/content/en/blog/stateful-processing.md +++ b/website/www/site/content/en/blog/stateful-processing.md @@ -356,9 +356,7 @@ If you try to express the building of your model as a `CombineFn`, you may have trouble with `mergeAccumulators`. Assuming you could express that, it might look something like this: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} class ModelFromEventsFn extends CombineFn { @Override public abstract Model createAccumulator() { @@ -379,13 +377,9 @@ class ModelFromEventsFn extends CombineFn { public abstract Model extractOutput(Model accumulator) { return accumulator; } } -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} -```py +{{< highlight py >}} class ModelFromEventsFn(apache_beam.core.CombineFn): def create_accumulator(self): @@ -400,9 +394,7 @@ class ModelFromEventsFn(apache_beam.core.CombineFn): def extract_output(self, model): return model -``` - -{{% /classwrapper %}} +{{< /highlight >}} Now you have a way to compute the model of a particular user for a window as `Combine.perKey(new ModelFromEventsFn())`. How would you apply this model to @@ -412,9 +404,7 @@ elements of a `PCollection` is to read it as a side input to a `ParDo` transform. So you could side input the model and check the stream of events against it, outputting the prediction, like so: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} PCollection> events = ... final PCollectionView> userModels = events @@ -435,13 +425,9 @@ PCollection> predictions = events … c.output(KV.of(userId, model.prediction(event))) … } })); -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} -```py +{{< highlight py >}} # Events is a collection of (user, event) pairs. events = (p | ReadFromEventSource() | beam.WindowInto(....)) @@ -460,9 +446,7 @@ def event_prediction(user_event, models): # Predictions is a collection of (user, prediction) pairs. predictions = events | beam.Map(event_prediction, user_models) -``` - -{{% classwrapper %}} +{{< /highlight >}} In this pipeline, there is just one model emitted by the `Combine.perKey(...)` per user, per window, which is then prepared for side input by the `View.asMap()` @@ -480,9 +464,7 @@ generic Beam feature for managing completeness versus latency tradeoffs. So here is the same pipeline with an added trigger that outputs a new model one second after input arrives: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} PCollection> events = ... PCollectionView> userModels = events @@ -493,13 +475,9 @@ PCollectionView> userModels = events .apply(Combine.perKey(new ModelFromEventsFn())) .apply(View.asMap()); -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} -```py +{{< highlight py >}} events = ... user_models = beam.pvalue.AsDict( @@ -509,9 +487,7 @@ user_models = beam.pvalue.AsDict( trigger.AfterCount(1), trigger.AfterProcessingTime(1))) | beam.CombinePerKey(ModelFromEventsFn())) -``` - -{{% /classwrapper %}} +{{< /highlight >}} This is often a pretty nice tradeoff between latency and cost: If a huge flood of events comes in a second, then you will only emit one new model, so you @@ -533,9 +509,7 @@ Stateful processing lets you address both the latency problem of side inputs and the cost problem of excessive uninteresting output. Here is the code, using only features I have already introduced: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} new DoFn, KV>() { @StateId("model") @@ -566,13 +540,9 @@ new DoFn, KV>() { } } }; -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} -```py +{{< highlight py >}} class ModelStatefulFn(beam.DoFn): PREVIOUS_PREDICTION = BagStateSpec('previous_pred_state', PredictionCoder()) @@ -598,9 +568,7 @@ class ModelStatefulFn(beam.DoFn): previous_pred_state.clear() previous_pred_state.add(new_prediction) yield (user, new_prediction) -``` - -{{% /classwrapper %}} +{{< /highlight >}} Let's walk through it, diff --git a/website/www/site/content/en/blog/test-stream.md b/website/www/site/content/en/blog/test-stream.md index 5377b5b3b0b64..04c1eeee758f8 100644 --- a/website/www/site/content/en/blog/test-stream.md +++ b/website/www/site/content/en/blog/test-stream.md @@ -124,9 +124,7 @@ For example, if we create a TestStream where all the data arrives before the watermark and provide the result PCollection as input to the CalculateTeamScores PTransform: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo.class)) .addElements(new GameActionInfo("sky", "blue", 12, new Instant(0L)),                 new GameActionInfo("navy", "blue", 3, new Instant(0L)), @@ -138,25 +136,19 @@ TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo PCollection> teamScores = p.apply(createEvents) .apply(new CalculateTeamScores(TEAM_WINDOW_DURATION, ALLOWED_LATENESS)); -``` - -{{% /classwrapper %}} +{{< /highlight >}} we can then assert that the result PCollection contains elements that arrived: Elements all arrive before the watermark, and are produced in the on-time pane -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // Only one value is emitted for the blue team PAssert.that(teamScores) .inWindow(window) .containsInAnyOrder(KV.of("blue", 18)); p.run(); -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Some elements are late, but arrive before the end of the window @@ -166,9 +158,7 @@ of the window (shown below to the left of the red watermark), which demonstrates the system to be on time, as it arrives before the watermark passes the end of the window -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo.class)) .addElements(new GameActionInfo("sky", "blue", 3, new Instant(0L)),         new GameActionInfo("navy", "blue", 3, new Instant(0L).plus(Duration.standardMinutes(3)))) @@ -180,23 +170,17 @@ TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo PCollection> teamScores = p.apply(createEvents) .apply(new CalculateTeamScores(TEAM_WINDOW_DURATION, ALLOWED_LATENESS)); -``` - -{{% /classwrapper %}} +{{< /highlight >}} An element arrives late, but before the watermark passes the end of the window, and is produced in the on-time pane -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // Only one value is emitted for the blue team PAssert.that(teamScores) .inWindow(window) .containsInAnyOrder(KV.of("blue", 18)); p.run(); -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Elements are late, and arrive after the end of the window @@ -204,9 +188,7 @@ By advancing the watermark farther in time before adding the late data, we can demonstrate the triggering behavior that causes the system to emit an on-time pane, and then after the late data arrives, a pane that refines the result. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo.class)) .addElements(new GameActionInfo("sky", "blue", 3, new Instant(0L)),           new GameActionInfo("navy", "blue", 3, new Instant(0L).plus(Duration.standardMinutes(3)))) @@ -218,15 +200,11 @@ TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo PCollection> teamScores = p.apply(createEvents) .apply(new CalculateTeamScores(TEAM_WINDOW_DURATION, ALLOWED_LATENESS)); -``` - -{{% /classwrapper %}} +{{< /highlight >}} Elements all arrive before the watermark, and are produced in the on-time pane -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // An on-time pane is emitted with the events that arrived before the window closed PAssert.that(teamScores) .inOnTimePane(window) @@ -236,9 +214,7 @@ PAssert.that(teamScores) .inFinalPane(window) .containsInAnyOrder(KV.of("blue", 18)); p.run(); -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Elements are late, and after the end of the window plus the allowed lateness @@ -246,9 +222,7 @@ If we push the watermark even further into the future, beyond the maximum configured allowed lateness, we can demonstrate that the late element is dropped by the system. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo.class)) .addElements(new GameActionInfo("sky", "blue", 3, Duration.ZERO),          new GameActionInfo("navy", "blue", 3, Duration.standardMinutes(3))) @@ -265,24 +239,18 @@ TestStream infos = TestStream.create(AvroCoder.of(GameActionInfo PCollection> teamScores = p.apply(createEvents) .apply(new CalculateTeamScores(TEAM_WINDOW_DURATION, ALLOWED_LATENESS)); -``` - -{{% /classwrapper %}} +{{< /highlight >}} Elements all arrive before the watermark, and are produced in the on-time pane -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // An on-time pane is emitted with the events that arrived before the window closed PAssert.that(teamScores) .inWindow(window) .containsInAnyOrder(KV.of("blue", 6)); p.run(); -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Elements arrive before the end of the window, and some processing time passes Using additional methods, we can demonstrate the behavior of speculative @@ -290,9 +258,7 @@ triggers by advancing the processing time of the TestStream. If we add elements to an input PCollection, occasionally advancing the processing time clock, and apply `CalculateUserScores` -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} TestStream.create(AvroCoder.of(GameActionInfo.class))    .addElements(new GameActionInfo("scarlet", "red", 3, new Instant(0L)),                new GameActionInfo("scarlet", "red", 2, new Instant(0L).plus(Duration.standardMinutes(1)))) @@ -304,15 +270,11 @@ TestStream.create(AvroCoder.of(GameActionInfo.class)) PCollection> userScores =    p.apply(infos).apply(new CalculateUserScores(ALLOWED_LATENESS)); -``` - -{{% /classwrapper %}} +{{< /highlight >}} Elements all arrive before the watermark, and are produced in the on-time pane -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} PAssert.that(userScores) .inEarlyGlobalWindowPanes() .containsInAnyOrder(KV.of("scarlet", 5), @@ -320,9 +282,7 @@ PAssert.that(userScores) KV.of("oxblood", 2)); p.run(); -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## TestStream - Under the Hood diff --git a/website/www/site/content/en/blog/timely-processing.md b/website/www/site/content/en/blog/timely-processing.md index be17cc62b1647..ffd4454264ff4 100644 --- a/website/www/site/content/en/blog/timely-processing.md +++ b/website/www/site/content/en/blog/timely-processing.md @@ -184,9 +184,7 @@ Let's set up the state we need to track batches of elements. As each element comes in, we will write the element to a buffer while tracking the number of elements we have buffered. Here are the state cells in code: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} new DoFn() { @StateId("buffer") @@ -197,13 +195,9 @@ new DoFn() { … TBD … } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} class StatefulBufferingFn(beam.DoFn): BUFFER_STATE = BagStateSpec('buffer', EventCoder()) @@ -211,9 +205,7 @@ class StatefulBufferingFn(beam.DoFn): COUNT_STATE = CombiningValueStateSpec('count', VarIntCoder(), combiners.SumCombineFn()) -``` - -{{% /classwrapper %}} +{{< /highlight >}} Walking through the code, we have: @@ -225,9 +217,7 @@ method. We will choose a limit on the size of the buffer, `MAX_BUFFER_SIZE`. If our buffer reaches this size, we will perform a single RPC to enrich all the events, and output. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} new DoFn() { private static final int MAX_BUFFER_SIZE = 500; @@ -260,13 +250,9 @@ new DoFn() { … TBD … } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} class StatefulBufferingFn(beam.DoFn): MAX_BUFFER_SIZE = 500; @@ -291,9 +277,7 @@ class StatefulBufferingFn(beam.DoFn): yield event count_state.clear() buffer_state.clear() -``` - -{{% /classwrapper %}} +{{< /highlight >}} Here is an illustration to accompany the code: @@ -335,9 +319,7 @@ completeness for a `PCollection` - such as when a window expires. For our example, let us add an event time timer so that when the window expires, any events remaining in the buffer are processed. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} new DoFn() { … @@ -369,13 +351,9 @@ new DoFn() { } } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} class StatefulBufferingFn(beam.DoFn): … @@ -402,9 +380,7 @@ class StatefulBufferingFn(beam.DoFn): buffer_state.clear() count_state.clear() -``` - -{{% /classwrapper %}} +{{< /highlight >}} Let's unpack the pieces of this snippet: @@ -459,9 +435,7 @@ timer has not been set, then we set it for the current moment plus `MAX_BUFFER_DURATION`. After the allotted processing time has passed, a callback will fire and enrich and emit any buffered elements. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} new DoFn() { … @@ -503,13 +477,9 @@ new DoFn() { … same expiry as above … } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} class StatefulBufferingFn(beam.DoFn): … @@ -542,9 +512,7 @@ class StatefulBufferingFn(beam.DoFn): buffer_state.clear() count_state.clear() -``` - -{{% /classwrapper %}} +{{< /highlight >}} Here is an illustration of the final code: diff --git a/website/www/site/content/en/documentation/_index.md b/website/www/site/content/en/documentation/_index.md index de0906807367d..8524f89b56c20 100644 --- a/website/www/site/content/en/documentation/_index.md +++ b/website/www/site/content/en/documentation/_index.md @@ -1,9 +1,6 @@ --- -layout: section title: "Learn about Beam" -permalink: /documentation/ -section_menu: section-menu/documentation.html -redirect_from: +aliases: - /learn/ - /docs/learn/ --- @@ -29,23 +26,23 @@ This section provides in-depth conceptual information and reference material for Learn about the Beam Programming Model and the concepts common to all Beam SDKs and Runners. -* Read the [Programming Guide]({{ site.baseurl }}/documentation/programming-guide/), which introduces all the key Beam concepts. -* Learn about Beam's [execution model]({{ site.baseurl }}/documentation/runtime/model) to better understand how pipelines execute. -* Visit [Learning Resources]({{ site.baseurl }}/documentation/resources/learning-resources) for some of our favorite articles and talks about Beam. +* Read the [Programming Guide](/documentation/programming-guide/), which introduces all the key Beam concepts. +* Learn about Beam's [execution model](/documentation/runtime/model) to better understand how pipelines execute. +* Visit [Learning Resources](/documentation/resources/learning-resources) for some of our favorite articles and talks about Beam. ## Pipeline Fundamentals -* [Design Your Pipeline]({{ site.baseurl }}/documentation/pipelines/design-your-pipeline/) by planning your pipeline’s structure, choosing transforms to apply to your data, and determining your input and output methods. -* [Create Your Pipeline]({{ site.baseurl }}/documentation/pipelines/create-your-pipeline/) using the classes in the Beam SDKs. -* [Test Your Pipeline]({{ site.baseurl }}/documentation/pipelines/test-your-pipeline/) to minimize debugging a pipeline’s remote execution. +* [Design Your Pipeline](/documentation/pipelines/design-your-pipeline/) by planning your pipeline’s structure, choosing transforms to apply to your data, and determining your input and output methods. +* [Create Your Pipeline](/documentation/pipelines/create-your-pipeline/) using the classes in the Beam SDKs. +* [Test Your Pipeline](/documentation/pipelines/test-your-pipeline/) to minimize debugging a pipeline’s remote execution. ## SDKs Find status and reference information on all of the available Beam SDKs. -* [Java SDK]({{ site.baseurl }}/documentation/sdks/java/) -* [Python SDK]({{ site.baseurl }}/documentation/sdks/python/) -* [Go SDK]({{ site.baseurl }}/documentation/sdks/go/) +* [Java SDK](/documentation/sdks/java/) +* [Python SDK](/documentation/sdks/python/) +* [Go SDK](/documentation/sdks/go/) ## Runners @@ -53,18 +50,18 @@ A Beam Runner runs a Beam pipeline on a specific (often distributed) data proces ### Available Runners -* [DirectRunner]({{ site.baseurl }}/documentation/runners/direct/): Runs locally on your machine -- great for developing, testing, and debugging. -* [ApexRunner]({{ site.baseurl }}/documentation/runners/apex/): Runs on [Apache Apex](https://apex.apache.org). -* [FlinkRunner]({{ site.baseurl }}/documentation/runners/flink/): Runs on [Apache Flink](https://flink.apache.org). -* [SparkRunner]({{ site.baseurl }}/documentation/runners/spark/): Runs on [Apache Spark](https://spark.apache.org). -* [DataflowRunner]({{ site.baseurl }}/documentation/runners/dataflow/): Runs on [Google Cloud Dataflow](https://cloud.google.com/dataflow), a fully managed service within [Google Cloud Platform](https://cloud.google.com/). -* [GearpumpRunner]({{ site.baseurl }}/documentation/runners/gearpump/): Runs on [Apache Gearpump (incubating)](https://gearpump.apache.org). -* [SamzaRunner]({{ site.baseurl }}/documentation/runners/samza/): Runs on [Apache Samza](https://samza.apache.org). -* [NemoRunner]({{ site.baseurl }}/documentation/runners/nemo/): Runs on [Apache Nemo](https://nemo.apache.org). -* [JetRunner]({{ site.baseurl }}/documentation/runners/jet/): Runs on [Hazelcast Jet](https://jet.hazelcast.org/). +* [DirectRunner](/documentation/runners/direct/): Runs locally on your machine -- great for developing, testing, and debugging. +* [ApexRunner](/documentation/runners/apex/): Runs on [Apache Apex](http://apex.apache.org). +* [FlinkRunner](/documentation/runners/flink/): Runs on [Apache Flink](http://flink.apache.org). +* [SparkRunner](/documentation/runners/spark/): Runs on [Apache Spark](http://spark.apache.org). +* [DataflowRunner](/documentation/runners/dataflow/): Runs on [Google Cloud Dataflow](https://cloud.google.com/dataflow), a fully managed service within [Google Cloud Platform](https://cloud.google.com/). +* [GearpumpRunner](/documentation/runners/gearpump/): Runs on [Apache Gearpump (incubating)](http://gearpump.apache.org). +* [SamzaRunner](/documentation/runners/samza/): Runs on [Apache Samza](http://samza.apache.org). +* [NemoRunner](/documentation/runners/nemo/): Runs on [Apache Nemo](http://nemo.apache.org). +* [JetRunner](/documentation/runners/jet/): Runs on [Hazelcast Jet](https://jet.hazelcast.org/). ### Choosing a Runner -Beam is designed to enable pipelines to be portable across different runners. However, given every runner has different capabilities, they also have different abilities to implement the core concepts in the Beam model. The [Capability Matrix]({{ site.baseurl }}/documentation/runners/capability-matrix/) provides a detailed comparison of runner functionality. +Beam is designed to enable pipelines to be portable across different runners. However, given every runner has different capabilities, they also have different abilities to implement the core concepts in the Beam model. The [Capability Matrix](/documentation/runners/capability-matrix/) provides a detailed comparison of runner functionality. -Once you have chosen which runner to use, see that runner's page for more information about any initial runner-specific setup as well as any required or optional `PipelineOptions` for configuring its execution. You may also want to refer back to the Quickstart for [Java]({{ site.baseurl }}/get-started/quickstart-java), [Python]({{ site.baseurl }}/get-started/quickstart-py) or [Go]({{ site.baseurl }}/get-started/quickstart-go) for instructions on executing the sample WordCount pipeline. +Once you have chosen which runner to use, see that runner's page for more information about any initial runner-specific setup as well as any required or optional `PipelineOptions` for configuring its execution. You may also want to refer back to the Quickstart for [Java](/get-started/quickstart-java), [Python](/get-started/quickstart-py) or [Go](/get-started/quickstart-go) for instructions on executing the sample WordCount pipeline. diff --git a/website/www/site/content/en/documentation/io/built-in.md b/website/www/site/content/en/documentation/io/built-in.md index fb81b0f76a3ef..271b2bfc09a4c 100644 --- a/website/www/site/content/en/documentation/io/built-in.md +++ b/website/www/site/content/en/documentation/io/built-in.md @@ -1,8 +1,5 @@ --- -layout: section title: "Built-in I/O Transforms" -section_menu: section-menu/documentation.html -permalink: /documentation/io/built-in/ --- -![This is a sequence diagram that shows the lifecycle of the Source]( - {{ "/images/source-sequence-diagram.svg" | prepend: site.baseurl }}) +![This is a sequence diagram that shows the lifecycle of the Source](/images/source-sequence-diagram.svg) ### Using ParDo and GroupByKey @@ -173,8 +169,8 @@ For **file-based sinks**, you can use the `FileBasedSink` abstraction that is provided by both the Java and Python SDKs. See our language specific implementation guides for more details: -* [Developing I/O connectors for Java]({{ site.baseurl }}/documentation/io/developing-io-java/) -* [Developing I/O connectors for Python]({{ site.baseurl }}/documentation/io/developing-io-python/) +* [Developing I/O connectors for Java](/documentation/io/developing-io-java/) +* [Developing I/O connectors for Python](/documentation/io/developing-io-python/) diff --git a/website/www/site/content/en/documentation/io/developing-io-python.md b/website/www/site/content/en/documentation/io/developing-io-python.md index fdb2a76690539..47c8595598c6a 100644 --- a/website/www/site/content/en/documentation/io/developing-io-python.md +++ b/website/www/site/content/en/documentation/io/developing-io-python.md @@ -1,9 +1,6 @@ --- -layout: section title: "Apache Beam: Developing I/O connectors for Python" -section_menu: section-menu/documentation.html -permalink: /documentation/io/developing-io-python/ -redirect_from: +aliases: - /documentation/io/authoring-python/ - /documentation/sdks/python-custom-io/ --- @@ -26,13 +23,13 @@ To connect to a data store that isn’t supported by Beam’s existing I/O connectors, you must create a custom I/O connector that usually consist of a source and a sink. All Beam sources and sinks are composite transforms; however, the implementation of your custom I/O depends on your use case. Before you -start, read the [new I/O connector overview]({{ site.baseurl }}/documentation/io/developing-io-overview/) +start, read the [new I/O connector overview](/documentation/io/developing-io-overview/) for an overview of developing a new I/O connector, the available implementation options, and how to choose the right option for your use case. -This guide covers using the [Source and FileBasedSink interfaces](https://beam.apache.org/releases/pydoc/{{ site.release_latest }}/apache_beam.io.iobase.html) +This guide covers using the [Source and FileBasedSink interfaces](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.iobase.html) for Python. The Java SDK offers the same functionality, but uses a slightly -different API. See [Developing I/O connectors for Java]({{ site.baseurl }}/documentation/io/developing-io-java/) +different API. See [Developing I/O connectors for Java](/documentation/io/developing-io-java/) for information specific to the Java SDK. ## Basic code requirements {#basic-code-reqs} @@ -62,7 +59,7 @@ multiple worker instances in parallel. As such, the code you provide for methods available in the [source_test_utils module](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/source_test_utils.py) to develop tests for your source. -In addition, see the [PTransform style guide]({{ site.baseurl }}/contribute/ptransform-style-guide/) +In addition, see the [PTransform style guide](/contribute/ptransform-style-guide/) for Beam's transform style guidance. ## Implementing the Source interface @@ -83,7 +80,7 @@ Supply the logic for your new source by creating the following classes: a wrapper. You can find these classes in the -[apache_beam.io.iobase module](https://beam.apache.org/releases/pydoc/{{ site.release_latest }}/apache_beam.io.iobase.html). +[apache_beam.io.iobase module](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.iobase.html). ### Implementing the BoundedSource subclass @@ -185,13 +182,13 @@ See [AvroSource](https://github.com/apache/beam/blob/master/sdks/python/apache_b The following example, `CountingSource`, demonstrates an implementation of `BoundedSource` and uses the SDK-provided `RangeTracker` called `OffsetRangeTracker`. -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:model_custom_source_new_source %}``` + To read data from the source in your pipeline, use the `Read` transform: -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:model_custom_source_use_new_source %}``` + **Note:** When you create a source that end-users are going to use, we recommended that you do not expose the code for the source itself as @@ -202,10 +199,10 @@ exposing your sources, and walks through how to create a wrapper. ## Using the FileBasedSink abstraction -If your data source uses files, you can implement the [FileBasedSink](https://beam.apache.org/releases/pydoc/{{ site.release_latest }}/apache_beam.io.filebasedsink.html) +If your data source uses files, you can implement the [FileBasedSink](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.filebasedsink.html) abstraction to create a file-based sink. For other sinks, use `ParDo`, `GroupByKey`, and other transforms offered by the Beam SDK for Python. See the -[developing I/O connectors overview]({{ site.baseurl }}/documentation/io/developing-io-overview/) +[developing I/O connectors overview](/documentation/io/developing-io-overview/) for more details. When using the `FileBasedSink` interface, you must provide the format-specific @@ -254,7 +251,7 @@ users would need to add the reshard themselves (using the `GroupByKey` transform). To solve this, we recommended that you expose the source as a composite `PTransform` that performs both the read operation and the reshard. -See Beam’s [PTransform style guide]({{ site.baseurl }}/contribute/ptransform-style-guide/#exposing-a-ptransform-vs-something-else) +See Beam’s [PTransform style guide](/contribute/ptransform-style-guide/#exposing-a-ptransform-vs-something-else) for additional information about wrapping with a `PTransform`. The following examples change the source and sink from the above sections so @@ -262,20 +259,20 @@ that they are not exposed to end-users. For the source, rename `CountingSource` to `_CountingSource`. Then, create the wrapper `PTransform`, called `ReadFromCountingSource`: -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:model_custom_source_new_ptransform %}``` + Finally, read from the source: -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:model_custom_source_use_ptransform %}``` + For the sink, rename `SimpleKVSink` to `_SimpleKVSink`. Then, create the wrapper `PTransform`, called `WriteToKVSink`: -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:model_custom_sink_new_ptransform %}``` + Finally, write to the sink: -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py tag:model_custom_sink_use_ptransform %}``` + diff --git a/website/www/site/content/en/documentation/io/testing.md b/website/www/site/content/en/documentation/io/testing.md index 4e8b9b78c49b1..4bd46aaf9ccd6 100644 --- a/website/www/site/content/en/documentation/io/testing.md +++ b/website/www/site/content/en/documentation/io/testing.md @@ -1,8 +1,5 @@ --- -layout: section title: "Testing I/O Transforms" -section_menu: section-menu/documentation.html -permalink: /documentation/io/testing/ --- # Custom window patterns -The samples on this page demonstrate common custom window patterns. You can create custom windows with [`WindowFn` functions]({{ site.baseurl }}/documentation/programming-guide/#provided-windowing-functions). For more information, see the [programming guide section on windowing]({{ site.baseurl }}/documentation/programming-guide/#windowing). +The samples on this page demonstrate common custom window patterns. You can create custom windows with [`WindowFn` functions](/documentation/programming-guide/#provided-windowing-functions). For more information, see the [programming guide section on windowing](/documentation/programming-guide/#windowing). **Note**: Custom merging windows isn't supported in Python (with fnapi). @@ -29,10 +26,10 @@ You can modify the [`assignWindows`](https://beam.apache.org/releases/javadoc/cu Access the `assignWindows` function through `WindowFn.AssignContext.element()`. The original, fixed-duration `assignWindows` function is: -```java +{{< highlight java >}} {% github_sample /apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java tag:CustomSessionWindow1 %} -``` +{{< /highlight >}} ### Creating data-driven gaps To create data-driven gaps, add the following snippets to the `assignWindows` function: @@ -41,34 +38,34 @@ To create data-driven gaps, add the following snippets to the `assignWindows` fu For example, the following function assigns each element to a window between the timestamp and `gapDuration`: -```java +{{< highlight java >}} {% github_sample /apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java tag:CustomSessionWindow3 %} -``` +{{< /highlight >}} Then, set the `gapDuration` field in a windowing function: -```java +{{< highlight java >}} {% github_sample /apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java tag:CustomSessionWindow2 %} -``` +{{< /highlight >}} ### Windowing messages into sessions After creating data-driven gaps, you can window incoming data into the new, custom sessions. First, set the session length to the gap duration: -```java +{{< highlight java >}} {% github_sample /apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java tag:CustomSessionWindow4 %} -``` +{{< /highlight >}} Lastly, window data into sessions in your pipeline: -```java +{{< highlight java >}} {% github_sample /apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java tag:CustomSessionWindow6 %} -``` +{{< /highlight >}} ### Example data and windows The following test data tallies two users' scores with and without the `gap` attribute: @@ -86,7 +83,7 @@ The following test data tallies two users' scores with and without the `gap` att The diagram below visualizes the test data: -![Two sets of data and the standard and dynamic sessions with which the data is windowed.]( {{ "/images/standard-vs-dynamic-sessions.png" | prepend: site.baseurl }}) +![Two sets of data and the standard and dynamic sessions with which the data is windowed.](/images/standard-vs-dynamic-sessions.png) #### Standard sessions diff --git a/website/www/site/content/en/documentation/patterns/file-processing.md b/website/www/site/content/en/documentation/patterns/file-processing.md index 592a58b198d45..585b325249a9c 100644 --- a/website/www/site/content/en/documentation/patterns/file-processing.md +++ b/website/www/site/content/en/documentation/patterns/file-processing.md @@ -1,8 +1,5 @@ --- -layout: section title: "File processing patterns" -section_menu: section-menu/documentation.html -permalink: /documentation/patterns/file-processing/ --- + # Create Your Pipeline -* TOC -{:toc} +{{< toc >}} Your Beam program expresses a data processing pipeline, from start to finish. This section explains the mechanics of using the classes in the Beam SDKs to build a pipeline. To construct a pipeline using the classes in the Beam SDKs, your program will need to perform the following general steps: @@ -36,15 +33,15 @@ A Beam program often starts by creating a `Pipeline` object. In the Beam SDKs, each pipeline is represented by an explicit object of type `Pipeline`. Each `Pipeline` object is an independent entity that encapsulates both the data the pipeline operates over and the transforms that get applied to that data. -To create a pipeline, declare a `Pipeline` object, and pass it some [configuration options]({{ site.baseurl }}/documentation/programming-guide#configuring-pipeline-options). +To create a pipeline, declare a `Pipeline` object, and pass it some [configuration options](/documentation/programming-guide#configuring-pipeline-options). -```java +{{< highlight java >}} // Start by defining the options for the pipeline. PipelineOptions options = PipelineOptionsFactory.create(); // Then create the pipeline. Pipeline p = Pipeline.create(options); -``` +{{< /highlight >}} ## Reading Data Into Your Pipeline @@ -54,24 +51,24 @@ There are two kinds of root transforms in the Beam SDKs: `Read` and `Create`. `R The following example code shows how to `apply` a `TextIO.Read` root transform to read data from a text file. The transform is applied to a `Pipeline` object `p`, and returns a pipeline data set in the form of a `PCollection`: -```java +{{< highlight java >}} PCollection lines = p.apply( "ReadLines", TextIO.read().from("gs://some/inputData.txt")); -``` +{{< /highlight >}} ## Applying Transforms to Process Pipeline Data -You can manipulate your data using the various [transforms]({{ site.baseurl }}/documentation/programming-guide/#transforms) provided in the Beam SDKs. To do this, you **apply** the transforms to your pipeline's `PCollection` by calling the `apply` method on each `PCollection` that you want to process and passing the desired transform object as an argument. +You can manipulate your data using the various [transforms](/documentation/programming-guide/#transforms) provided in the Beam SDKs. To do this, you **apply** the trannsforms to your pipeline's `PCollection` by calling the `apply` method on each `PCollection` that you want to process and passing the desired transform object as an argument. The following code shows how to `apply` a transform to a `PCollection` of strings. The transform is a user-defined custom transform that reverses the contents of each string and outputs a new `PCollection` containing the reversed strings. The input is a `PCollection` called `words`; the code passes an instance of a `PTransform` object called `ReverseWords` to `apply`, and saves the return value as the `PCollection` called `reversedWords`. -```java +{{< highlight java >}} PCollection words = ...; PCollection reversedWords = words.apply(new ReverseWords()); -``` +{{< /highlight >}} ## Writing or Outputting Your Final Pipeline Data @@ -79,27 +76,27 @@ Once your pipeline has applied all of its transforms, you'll usually need to out The following example code shows how to `apply` a `TextIO.Write` transform to write a `PCollection` of `String` to a text file: -```java +{{< highlight java >}} PCollection filteredWords = ...; filteredWords.apply("WriteMyFile", TextIO.write().to("gs://some/outputData.txt")); -``` +{{< /highlight >}} ## Running Your Pipeline Once you have constructed your pipeline, use the `run` method to execute the pipeline. Pipelines are executed asynchronously: the program you create sends a specification for your pipeline to a **pipeline runner**, which then constructs and runs the actual series of pipeline operations. -```java +{{< highlight java >}} p.run(); -``` +{{< /highlight >}} The `run` method is asynchronous. If you'd like a blocking execution instead, run your pipeline appending the `waitUntilFinish` method: -```java +{{< highlight java >}} p.run().waitUntilFinish(); -``` +{{< /highlight >}} ## What's next -* [Programming Guide]({{ site.baseurl }}/documentation/programming-guide) - Learn the details of creating your pipeline, configuring pipeline options, and applying transforms. -* [Test your pipeline]({{ site.baseurl }}/documentation/pipelines/test-your-pipeline). +* [Programming Guide](/documentation/programming-guide) - Learn the details of creating your pipeline, configuring pipeline options, and applying transforms. +* [Test your pipeline](/documentation/pipelines/test-your-pipeline). diff --git a/website/www/site/content/en/documentation/pipelines/design-your-pipeline.md b/website/www/site/content/en/documentation/pipelines/design-your-pipeline.md index 5c4c28efb618e..6c65efdf1852f 100644 --- a/website/www/site/content/en/documentation/pipelines/design-your-pipeline.md +++ b/website/www/site/content/en/documentation/pipelines/design-your-pipeline.md @@ -1,8 +1,5 @@ --- -layout: section title: "Design Your Pipeline" -section_menu: section-menu/documentation.html -permalink: /documentation/pipelines/design-your-pipeline/ --- # Design Your Pipeline -* TOC -{:toc} +{{< toc >}} This page helps you design your Apache Beam pipeline. It includes information about how to determine your pipeline's structure, how to choose which transforms to apply to your data, and how to determine your input and output methods. -Before reading this section, it is recommended that you become familiar with the information in the [Beam programming guide]({{ site.baseurl }}/documentation/programming-guide). +Before reading this section, it is recommended that you become familiar with the information in the [Beam programming guide](/documentation/programming-guide). ## What to consider when designing your pipeline @@ -32,7 +28,7 @@ When designing your Beam pipeline, consider a few basic questions: * **Where is your input data stored?** How many sets of input data do you have? This will determine what kinds of `Read` transforms you'll need to apply at the start of your pipeline. * **What does your data look like?** It might be plaintext, formatted log files, or rows in a database table. Some Beam transforms work exclusively on `PCollection`s of key/value pairs; you'll need to determine if and how your data is keyed and how to best represent that in your pipeline's `PCollection`(s). -* **What do you want to do with your data?** The core transforms in the Beam SDKs are general purpose. Knowing how you need to change or manipulate your data will determine how you build core transforms like [ParDo]({{ site.baseurl }}/documentation/programming-guide/#pardo), or when you use pre-written transforms included with the Beam SDKs. +* **What do you want to do with your data?** The core transforms in the Beam SDKs are general purpose. Knowing how you need to change or manipulate your data will determine how you build core transforms like [ParDo](/documentation/programming-guide/#pardo), or when you use pre-written transforms included with the Beam SDKs. * **What does your output data look like, and where should it go?** This will determine what kinds of `Write` transforms you'll need to apply at the end of your pipeline. ## A basic pipeline @@ -40,8 +36,7 @@ When designing your Beam pipeline, consider a few basic questions: The simplest pipelines represent a linear flow of operations, as shown in figure 1. ![A linear pipeline starts with one input collection, sequentially applies - three transforms, and ends with one output collection.]( - {{ "/images/design-your-pipeline-linear.svg" | prepend: site.baseurl }}) + three transforms, and ends with one output collection.](/images/design-your-pipeline-linear.svg) *Figure 1: A linear pipeline.* @@ -58,15 +53,14 @@ You can use the same `PCollection` as input for multiple transforms without cons The pipeline in figure 2 is a branching pipeline. The pipeline reads its input (first names represented as strings) from a database table and creates a `PCollection` of table rows. Then, the pipeline applies multiple transforms to the **same** `PCollection`. Transform A extracts all the names in that `PCollection` that start with the letter 'A', and Transform B extracts all the names in that `PCollection` that start with the letter 'B'. Both transforms A and B have the same input `PCollection`. ![The pipeline applies two transforms to a single input collection. Each - transform produces an output collection.]( - {{ "/images/design-your-pipeline-multiple-pcollections.svg" | prepend: site.baseurl }}) + transform produces an output collection.](/images/design-your-pipeline-multiple-pcollections.svg) *Figure 2: A branching pipeline. Two transforms are applied to a single PCollection of database table rows.* The following example code applies two transforms to a single input collection. -```java +{{< highlight java >}} PCollection dbRowCollection = ...; PCollection aCollection = dbRowCollection.apply("aTrans", ParDo.of(new DoFn(){ @@ -86,16 +80,15 @@ PCollection bCollection = dbRowCollection.apply("bTrans", ParDo.of(new D } } })); -``` +{{< /highlight >}} ### A single transform that produces multiple outputs -Another way to branch a pipeline is to have a **single** transform output to multiple `PCollection`s by using [tagged outputs]({{ site.baseurl }}/documentation/programming-guide/#additional-outputs). Transforms that produce more than one output process each element of the input once, and output to zero or more `PCollection`s. +Another way to branch a pipeline is to have a **single** transform output to multiple `PCollection`s by using [tagged outputs](/documentation/programming-guide/#additional-outputs). Transforms that produce more than one output process each element of the input once, and output to zero or more `PCollection`s. Figure 3 illustrates the same example described above, but with one transform that produces multiple outputs. Names that start with 'A' are added to the main output `PCollection`, and names that start with 'B' are added to an additional output `PCollection`. -![The pipeline applies one transform that produces multiple output collections.]( - {{ "/images/design-your-pipeline-additional-outputs.svg" | prepend: site.baseurl }}) +![The pipeline applies one transform that produces multiple output collections.](/images/design-your-pipeline-additional-outputs.svg) *Figure 3: A pipeline with a transform that outputs multiple PCollections.* @@ -121,7 +114,7 @@ where each element in the input `PCollection` is processed once. The following example code applies one transform that processes each element once and outputs two collections. -```java +{{< highlight java >}} // Define two TupleTags, one for each output. final TupleTag startsWithATag = new TupleTag(){}; final TupleTag startsWithBTag = new TupleTag(){}; @@ -151,7 +144,7 @@ mixedCollection.get(startsWithATag).apply(...); // Get subset of the output with tag startsWithBTag. mixedCollection.get(startsWithBTag).apply(...); -``` +{{< /highlight >}} You can use either mechanism to produce multiple output `PCollection`s. However, using additional outputs makes more sense if the transform's computation per element is time-consuming. @@ -170,14 +163,13 @@ single `PCollection` that now contains all names that begin with either 'A' or 'B'. Here, it makes sense to use `Flatten` because the `PCollection`s being merged both contain the same type. -![The pipeline merges two collections into one collection with the Flatten transform.]( - {{ "/images/design-your-pipeline-flatten.svg" | prepend: site.baseurl }}) +![The pipeline merges two collections into one collection with the Flatten transform.](/images/design-your-pipeline-flatten.svg) *Figure 4: A pipeline that merges two collections into one collection with the Flatten transform.* The following example code applies `Flatten` to merge two collections. -```java +{{< highlight java >}} //merge the two PCollections with Flatten PCollectionList collectionList = PCollectionList.of(aCollection).and(bCollection); PCollection mergedCollectionWithFlatten = collectionList @@ -185,20 +177,19 @@ PCollection mergedCollectionWithFlatten = collectionList // continue with the new merged PCollection mergedCollectionWithFlatten.apply(...); -``` +{{< /highlight >}} ## Multiple sources Your pipeline can read its input from one or more sources. If your pipeline reads from multiple sources and the data from those sources is related, it can be useful to join the inputs together. In the example illustrated in figure 5 below, the pipeline reads names and addresses from a database table, and names and order numbers from a Kafka topic. The pipeline then uses `CoGroupByKey` to join this information, where the key is the name; the resulting `PCollection` contains all the combinations of names, addresses, and orders. -![The pipeline joins two input collections into one collection with the Join transform.]( - {{ "/images/design-your-pipeline-join.svg" | prepend: site.baseurl }}) +![The pipeline joins two input collections into one collection with the Join transform.](/images/design-your-pipeline-join.svg) *Figure 5: A pipeline that does a relational join of two input collections.* The following example code applies `Join` to join two input collections. -```java +{{< highlight java >}} PCollection> userAddress = pipeline.apply(JdbcIO.>read()...); PCollection> userOrder = pipeline.apply(KafkaIO.read()...); @@ -213,9 +204,9 @@ PCollection> joinedCollection = .apply(CoGroupByKey.create()); joinedCollection.apply(...); -``` +{{< /highlight >}} ## What's next -* [Create your own pipeline]({{ site.baseurl }}/documentation/pipelines/create-your-pipeline). -* [Test your pipeline]({{ site.baseurl }}/documentation/pipelines/test-your-pipeline). +* [Create your own pipeline](/documentation/pipelines/create-your-pipeline). +* [Test your pipeline](/documentation/pipelines/test-your-pipeline). diff --git a/website/www/site/content/en/documentation/pipelines/test-your-pipeline.md b/website/www/site/content/en/documentation/pipelines/test-your-pipeline.md index 460416f81dfac..d119b96c31bce 100644 --- a/website/www/site/content/en/documentation/pipelines/test-your-pipeline.md +++ b/website/www/site/content/en/documentation/pipelines/test-your-pipeline.md @@ -1,8 +1,5 @@ --- -layout: section title: "Test Your Pipeline" -section_menu: section-menu/documentation.html -permalink: /documentation/pipelines/test-your-pipeline/ --- # Test Your Pipeline -* TOC -{:toc} +{{< toc >}} Testing your pipeline is a particularly important step in developing an effective data processing solution. The indirect nature of the Beam model, in which your user code constructs a pipeline graph to be executed remotely, can make debugging-failed runs a non-trivial task. Often it is faster and simpler to perform local unit testing on your pipeline code than to debug a pipeline's remote execution. Before running your pipeline on the runner of your choice, unit testing your pipeline code locally is often the best way to identify and fix bugs in your pipeline code. Unit testing your pipeline locally also allows you to use your familiar/favorite local debugging tools. -You can use [DirectRunner]({{ site.baseurl }}/documentation/runners/direct), a local runner helpful for testing and local development. +You can use [DirectRunner](/documentation/runners/direct), a local runner helpful for testing and local development. After you test your pipeline using the `DirectRunner`, you can use the runner of your choice to test on a small scale. For example, use the Flink runner with a local or remote Flink cluster. - - - - - The Beam SDKs provide a number of ways to unit test your pipeline code, from the lowest to the highest levels. From the lowest to the highest level, these are: -* You can test the individual function objects, such as [DoFn]({{ site.baseurl }}/documentation/programming-guide/#pardo)s, inside your pipeline's core transforms. -* You can test an entire [Composite Transform]({{ site.baseurl }}/documentation/programming-guide/#composite-transforms) as a unit. +* You can test the individual function objects, such as [DoFn](/documentation/programming-guide/#pardo)s, inside your pipeline's core transforms. +* You can test an entire [Composite Transform](/documentation/programming-guide/#composite-transforms) as a unit. * You can perform an end-to-end test for an entire pipeline. To support unit testing, the Beam SDK for Java provides a number of test classes in the [testing package](https://github.com/apache/beam/tree/master/sdks/java/core/src/test/java/org/apache/beam/sdk). You can use these tests as references and guides. @@ -49,10 +40,10 @@ The code in your pipeline's `DoFn` functions runs often, and often across multip The Beam SDK for Java provides a convenient way to test an individual `DoFn` called [DoFnTester](https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/DoFnTesterTest.java), which is included in the SDK `Transforms` package. -`DoFnTester`uses the [JUnit](https://junit.org) framework. To use `DoFnTester`, you'll need to do the following: +`DoFnTester`uses the [JUnit](http://junit.org) framework. To use `DoFnTester`, you'll need to do the following: 1. Create a `DoFnTester`. You'll need to pass an instance of the `DoFn` you want to test to the static factory method for `DoFnTester`. -2. Create one or more main test inputs of the appropriate type for your `DoFn`. If your `DoFn` takes side inputs and/or produces [multiple outputs]({{ site.baseurl }}/documentation/programming-guide#additional-outputs), you should also create the side inputs and the output tags. +2. Create one or more main test inputs of the appropriate type for your `DoFn`. If your `DoFn` takes side inputs and/or produces [multiple outputs](/documentation/programming-guide#additional-outputs), you should also create the side inputs and the output tags. 3. Call `DoFnTester.processBundle` to process the main inputs. 4. Use JUnit's `Assert.assertThat` method to ensure the test outputs returned from `processBundle` match your expected values. @@ -60,30 +51,30 @@ The Beam SDK for Java provides a convenient way to test an individual `DoFn` cal To create a `DoFnTester`, first create an instance of the `DoFn` you want to test. You then use that instance when you create a `DoFnTester` using the `.of()` static factory method: -```java +{{< highlight java >}} static class MyDoFn extends DoFn { ... } MyDoFn myDoFn = ...; DoFnTester fnTester = DoFnTester.of(myDoFn); -``` +{{< /highlight >}} ### Creating Test Inputs You'll need to create one or more test inputs for `DoFnTester` to send to your `DoFn`. To create test inputs, simply create one or more input variables of the same input type that your `DoFn` accepts. In the case above: -```java +{{< highlight java >}} static class MyDoFn extends DoFn { ... } MyDoFn myDoFn = ...; DoFnTester fnTester = DoFnTester.of(myDoFn); String testInput = "test1"; -``` +{{< /highlight >}} #### Side Inputs If your `DoFn` accepts side inputs, you can create those side inputs by using the method `DoFnTester.setSideInputs`. -```java +{{< highlight java >}} static class MyDoFn extends DoFn { ... } MyDoFn myDoFn = ...; DoFnTester fnTester = DoFnTester.of(myDoFn); @@ -91,9 +82,9 @@ DoFnTester fnTester = DoFnTester.of(myDoFn); PCollectionView> sideInput = ...; Iterable value = ...; fnTester.setSideInputInGlobalWindow(sideInput, value); -``` +{{< /highlight >}} -See the `ParDo` documentation on [side inputs]({{ site.baseurl }}/documentation/programming-guide/#side-inputs) for more information. +See the `ParDo` documentation on [side inputs](/documentation/programming-guide/#side-inputs) for more information. #### Additional Outputs @@ -106,7 +97,7 @@ Suppose your `DoFn` produces outputs of type `String` and `Integer`. You create `TupleTag` objects for each, and bundle them into a `TupleTagList`, then set it for the `DoFnTester` as follows: -```java +{{< highlight java >}} static class MyDoFn extends DoFn { ... } MyDoFn myDoFn = ...; DoFnTester fnTester = DoFnTester.of(myDoFn); @@ -116,9 +107,9 @@ TupleTag tag2 = ...; TupleTagList tags = TupleTagList.of(tag1).and(tag2); fnTester.setOutputTags(tags); -``` +{{< /highlight >}} -See the `ParDo` documentation on [additional outputs]({{ site.baseurl }}/documentation/programming-guide/#additional-outputs) for more information. +See the `ParDo` documentation on [additional outputs](/documentation/programming-guide/#additional-outputs) for more information. ### Processing Test Inputs and Checking Results @@ -126,18 +117,18 @@ To process the inputs (and thus run the test on your `DoFn`), you call the metho `DoFnTester.processBundle` returns a `List` of outputs—that is, objects of the same type as the `DoFn`'s specified output type. For a `DoFn`, `processBundle` returns a `List`: -```java +{{< highlight java >}} static class MyDoFn extends DoFn { ... } MyDoFn myDoFn = ...; DoFnTester fnTester = DoFnTester.of(myDoFn); String testInput = "test1"; List testOutputs = fnTester.processBundle(testInput); -``` +{{< /highlight >}} To check the results of `processBundle`, you use JUnit's `Assert.assertThat` method to test if the `List` of outputs contains the values you expect: -```java +{{< highlight java >}} String testInput = "test1"; List testOutputs = fnTester.processBundle(testInput); @@ -145,7 +136,7 @@ Assert.assertThat(testOutputs, Matchers.hasItems(...)); // Process a larger batch in a single step. Assert.assertThat(fnTester.processBundle("input1", "input2", "input3"), Matchers.hasItems(...)); -``` +{{< /highlight >}} ## Testing Composite Transforms @@ -163,22 +154,22 @@ To test a composite transform you've created, you can use the following pattern: You create a `TestPipeline` as follows: -```java +{{< highlight java >}} Pipeline p = TestPipeline.create(); -``` +{{< /highlight >}} -> **Note:** Read about testing unbounded pipelines in Beam in [this blog post]({{ site.baseurl }}/blog/2016/10/20/test-stream.html). +> **Note:** Read about testing unbounded pipelines in Beam in [this blog post](/blog/2016/10/20/test-stream.html). ### Using the Create Transform -You can use the `Create` transform to create a `PCollection` out of a standard in-memory collection class, such as Java `List`. See [Creating a PCollection]({{ site.baseurl }}/documentation/programming-guide/#creating-a-pcollection) for more information. +You can use the `Create` transform to create a `PCollection` out of a standard in-memory collection class, such as Java `List`. See [Creating a PCollection](/documentation/programming-guide/#creating-a-pcollection) for more information. ### PAssert [PAssert](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/testing/PAssert.html) is a class included in the Beam Java SDK that is an assertion on the contents of a `PCollection`. You can use `PAssert`to verify that a `PCollection` contains a specific set of expected elements. For a given `PCollection`, you can use `PAssert` to verify the contents as follows: -```java +{{< highlight java >}} PCollection output = ...; // Check whether a PCollection contains some elements in any order. @@ -187,18 +178,18 @@ PAssert.that(output) "elem1", "elem3", "elem2"); -``` +{{< /highlight >}} Any code that uses `PAssert` must link in `JUnit` and `Hamcrest`. If you're using Maven, you can link in `Hamcrest` by adding the following dependency to your project's `pom.xml` file: -```java +{{< highlight java >}} org.hamcrest hamcrest-all 1.3 test -``` +{{< /highlight >}} For more information on how these classes work, see the [org.apache.beam.sdk.testing](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/testing/package-summary.html) package documentation. @@ -206,7 +197,7 @@ For more information on how these classes work, see the [org.apache.beam.sdk.tes The following code shows a complete test for a composite transform. The test applies the `Count` transform to an input `PCollection` of `String` elements. The test uses the `Create` transform to create the input `PCollection` from a Java `List`. -```java +{{< highlight java >}} public class CountTest { // Our static input data, which will make up the initial PCollection. @@ -240,7 +231,7 @@ public void testCount() { // Run the pipeline. p.run(); } -``` +{{< /highlight >}} ## Testing a Pipeline End-to-End @@ -255,11 +246,11 @@ You can use the test classes in the Beam SDKs (such as `TestPipeline` and `PAsse ### Testing the WordCount Pipeline -The following example code shows how one might test the [WordCount example pipeline]({{ site.baseurl }}/get-started/wordcount-example/). `WordCount` usually reads lines from a text file for input data; instead, the test creates a Java `List` containing some text lines and uses a `Create` transform to create an initial `PCollection`. +The following example code shows how one might test the [WordCount example pipeline](/get-started/wordcount-example/). `WordCount` usually reads lines from a text file for input data; instead, the test creates a Java `List` containing some text lines and uses a `Create` transform to create an initial `PCollection`. `WordCount`'s final transform (from the composite transform `CountWords`) produces a `PCollection` of formatted word counts suitable for printing. Rather than write that `PCollection` to an output text file, our test pipeline uses `PAssert` to verify that the elements of the `PCollection` match those of a static `String` array containing our expected output data. -```java +{{< highlight java >}} public class WordCountTest { // Our static input data, which will comprise the initial PCollection. @@ -291,4 +282,4 @@ public class WordCountTest { p.run(); } } -``` +{{< /highlight >}} diff --git a/website/www/site/content/en/documentation/programming-guide/_index.md b/website/www/site/content/en/documentation/programming-guide/_index.md index 9034eaa843701..3ec52b277e9c0 100644 --- a/website/www/site/content/en/documentation/programming-guide/_index.md +++ b/website/www/site/content/en/documentation/programming-guide/_index.md @@ -1,9 +1,6 @@ --- -layout: section title: "Beam Programming Guide" -section_menu: section-menu/documentation.html -permalink: /documentation/programming-guide/ -redirect_from: +aliases: - /learn/programming-guide/ - /docs/learn/programming-guide/ --- @@ -31,16 +28,11 @@ programmatically building your Beam pipeline. As the programming guide is filled out, the text will include code samples in multiple languages to help illustrate how to implement Beam concepts in your pipelines. - +{{< language-switcher java py >}} -{:.language-py} +{{< paragraph class="language-py" >}} The Python SDK supports Python 2.7, 3.5, 3.6, and 3.7. New Python SDK releases will stop supporting Python 2.7 in 2020 ([BEAM-8371](https://issues.apache.org/jira/browse/BEAM-8371)). For best results, use Beam with Python 3. +{{< /paragraph >}} ## 1. Overview {#overview} @@ -87,7 +79,7 @@ A typical Beam driver program works as follows: * Create an initial `PCollection` for pipeline data, either using the IOs to read data from an external storage system, or using a `Create` transform to build a `PCollection` from in-memory data. -* **Apply** `PTransform`s to each `PCollection`. Transforms can change, filter, +* **Apply** `PTransforms` to each `PCollection`. Transforms can change, filter, group, analyze, or otherwise process the elements in a `PCollection`. A transform creates a new output `PCollection` *without modifying the input collection*. A typical pipeline applies subsequent transforms to each new @@ -109,10 +101,10 @@ asynchronous "job" (or equivalent) on that back-end. The `Pipeline` abstraction encapsulates all the data and steps in your data processing task. Your Beam driver program typically starts by constructing a -[Pipeline](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/Pipeline.html) +[Pipeline](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/sdk/Pipeline.html) [Pipeline](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py) object, and then using that object as the basis for creating the pipeline's data -sets as `PCollection`s and its operations as `PTransform`s. +sets as `PCollection`s and its operations as `Transform`s. To use Beam, your driver program must first create an instance of the Beam SDK class `Pipeline` (typically in the `main()` function). When you create your @@ -121,21 +113,25 @@ your pipeline's configuration options programatically, but it's often easier to set the options ahead of time (or read them from the command line) and pass them to the `Pipeline` object when you create the object. -```java +{{< highlight java >}} // Start by defining the options for the pipeline. PipelineOptions options = PipelineOptionsFactory.create(); // Then create the pipeline. Pipeline p = Pipeline.create(options); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} + +{{< highlight go >}} // In order to start creating the pipeline for execution, a Pipeline object and a Scope object are needed. p, s := beam.NewPipelineWithRoot() -``` +{{< /highlight >}} ### 2.1. Configuring pipeline options {#configuring-pipeline-options} @@ -158,18 +154,22 @@ you can use to set fields in `PipelineOptions` using command-line arguments. To read options from the command-line, construct your `PipelineOptions` object as demonstrated in the following example code: -```java +{{< highlight java >}} PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create(); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} + +{{< highlight go >}} // If beamx or Go flags are used, flags must be parsed first. flag.Parse() -``` +{{< /highlight >}} This interprets command-line arguments that follow the format: @@ -183,7 +183,7 @@ This interprets command-line arguments that follow the format: Building your `PipelineOptions` this way lets you specify any of the options as a command-line argument. -> **Note:** The [WordCount example pipeline]({{ site.baseurl }}/get-started/wordcount-example) +> **Note:** The [WordCount example pipeline](/get-started/wordcount-example) > demonstrates how to set pipeline options at runtime by using command-line > options. @@ -194,7 +194,7 @@ You can add your own custom options in addition to the standard setter methods for each option, as in the following example for adding `input` and `output` custom options: -```java +{{< highlight java >}} public interface MyOptions extends PipelineOptions { String getInput(); void setInput(String input); @@ -202,24 +202,28 @@ public interface MyOptions extends PipelineOptions { String getOutput(); void setOutput(String output); } -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} + +{{< highlight go >}} var ( input = flag.String("input", "", "") output = flag.String("output", "", "") ) -``` +{{< /highlight >}} You can also specify a description, which appears when a user passes `--help` as a command-line argument, and a default value. You set the description and default value using annotations, as follows: -```java +{{< highlight java >}} public interface MyOptions extends PipelineOptions { @Description("Input for the pipeline") @Default.String("gs://my-bucket/input") @@ -227,47 +231,52 @@ public interface MyOptions extends PipelineOptions { void setInput(String input); @Description("Output for the pipeline") - @Default.String("gs://my-bucket/output") + @Default.String("gs://my-bucket/input") String getOutput(); void setOutput(String output); } -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} + +{{< highlight go >}} var ( input = flag.String("input", "gs://my-bucket/input", "Input for the pipeline") output = flag.String("output", "gs://my-bucket/output", "Output for the pipeline") ) -``` - +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} It's recommended that you register your interface with `PipelineOptionsFactory` and then pass the interface when creating the `PipelineOptions` object. When you register your interface with `PipelineOptionsFactory`, the `--help` can find your custom options interface and add it to the output of the `--help` command. `PipelineOptionsFactory` will also validate that your custom options are compatible with all other registered options. +{{< /paragraph >}} -{:.language-java} +{{< paragraph class="language-java" >}} The following example code shows how to register your custom options interface with `PipelineOptionsFactory`: +{{< /paragraph >}} -```java +{{< highlight java >}} PipelineOptionsFactory.register(MyOptions.class); MyOptions options = PipelineOptionsFactory.fromArgs(args) .withValidation() .as(MyOptions.class); -``` +{{< /highlight >}} Now your pipeline can accept `--input=value` and `--output=value` as command-line arguments. ## 3. PCollections {#pcollections} -The [PCollection](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/values/PCollection.html) +The [PCollection](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/sdk/values/PCollection.html) `PCollection` abstraction represents a potentially distributed, multi-element data set. You can think of a `PCollection` as "pipeline" data; Beam transforms use `PCollection` objects as @@ -305,7 +314,7 @@ would apply `TextIO.Read` `io.TextFileSource` to your `Pipeline` to create a `PCollection`: -```java +{{< highlight java >}} public static void main(String[] args) { // Create the pipeline. PipelineOptions options = @@ -316,39 +325,46 @@ public static void main(String[] args) { PCollection lines = p.apply( "ReadMyFile", TextIO.read().from("gs://some/inputData.txt")); } -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} + +{{< highlight go >}} lines := textio.Read(s, "gs://some/inputData.txt") -``` +{{< /highlight >}} See the [section on I/O](#pipeline-io) to learn more about how to read from the various data sources supported by the Beam SDK. #### 3.1.2. Creating a PCollection from in-memory data {#creating-pcollection-in-memory} -{:.language-java} +{{< paragraph class="language-java" >}} To create a `PCollection` from an in-memory Java `Collection`, you use the Beam-provided `Create` transform. Much like a data adapter's `Read`, you apply `Create` directly to your `Pipeline` object itself. +{{< /paragraph >}} -{:.language-java} +{{< paragraph class="language-java" >}} As parameters, `Create` accepts the Java `Collection` and a `Coder` object. The `Coder` specifies how the elements in the `Collection` should be [encoded](#element-type). +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} To create a `PCollection` from an in-memory `list`, you use the Beam-provided `Create` transform. Apply this transform directly to your `Pipeline` object itself. +{{< /paragraph >}} The following example code shows how to create a `PCollection` from an in-memory `List``list`: -```java +{{< highlight java >}} public static void main(String[] args) { // Create a Java Collection, in this case a List of Strings. final List LINES = Arrays.asList( @@ -365,11 +381,14 @@ public static void main(String[] args) { // Apply Create, passing the list and the coder, to create the PCollection. p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of()); } -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} ### 3.2. PCollection characteristics {#pcollection-characteristics} @@ -387,25 +406,19 @@ around to distributed workers). The Beam SDKs provide a data encoding mechanism that includes built-in encoding for commonly-used types as well as support for specifying custom encodings as needed. -#### 3.2.2. Element schema {#element-schema} - -In many cases, the element type in a `PCollection` has a structure that can introspected. -Examples are JSON, Protocol Buffer, Avro, and database records. Schemas provide a way to -express types as a set of named fields, allowing for more-expressive aggregations. - -#### 3.2.3. Immutability {#immutability} +#### 3.2.2. Immutability {#immutability} A `PCollection` is immutable. Once created, you cannot add, remove, or change individual elements. A Beam Transform might process each element of a `PCollection` and generate new pipeline data (as a new `PCollection`), *but it does not consume or modify the original input collection*. -#### 3.2.4. Random access {#random-access} +#### 3.2.3. Random access {#random-access} A `PCollection` does not support random access to individual elements. Instead, Beam Transforms consider every element in a `PCollection` individually. -#### 3.2.5. Size and boundedness {#size-and-boundedness} +#### 3.2.4. Size and boundedness {#size-and-boundedness} A `PCollection` is a large, immutable "bag" of elements. There is no upper limit on how many elements a `PCollection` can contain; any given `PCollection` might @@ -436,7 +449,7 @@ on a per-window basis — as the data set is generated, they process each `PCollection` as a succession of these finite windows. -#### 3.2.6. Element timestamps {#element-timestamps} +#### 3.2.5. Element timestamps {#element-timestamps} Each element in a `PCollection` has an associated intrinsic **timestamp**. The timestamp for each element is initially assigned by the [Source](#pipeline-io) @@ -490,12 +503,13 @@ slight difference: You apply the transform to the input `PCollection`, passing the transform itself as an argument, and the operation returns the output `PCollection`. This takes the general form: -```java +{{< highlight java >}} [Output PCollection] = [Input PCollection].apply([Transform]) -``` -```py +{{< /highlight >}} + +{{< highlight py >}} [Output PCollection] = [Input PCollection] | [Transform] -``` +{{< /highlight >}} Because Beam uses a generic `apply` method for `PCollection`, you can both chain transforms sequentially and also apply transforms that contain other transforms @@ -505,22 +519,22 @@ SDKs). How you apply your pipeline's transforms determines the structure of your pipeline. The best way to think of your pipeline is as a directed acyclic graph, where `PTransform` nodes are subroutines that accept `PCollection` nodes as inputs and emit `PCollection` nodes as outputs. For example, you can chain together transforms to create a pipeline that successively modifies input data: -```java +{{< highlight java >}} [Final Output PCollection] = [Initial Input PCollection].apply([First Transform]) .apply([Second Transform]) .apply([Third Transform]) -``` -```py +{{< /highlight >}} + +{{< highlight py >}} [Final Output PCollection] = ([Initial Input PCollection] | [First Transform] | [Second Transform] | [Third Transform]) -``` +{{< /highlight >}} The graph of this pipeline looks like the following: ![This linear pipeline starts with one input collection, sequentially applies - three transforms, and ends with one output collection.]( - {{ "/images/design-your-pipeline-linear.svg" | prepend: site.baseurl }}) + three transforms, and ends with one output collection.](/images/design-your-pipeline-linear.svg) *Figure 1: A linear pipeline with three sequential transforms.* @@ -529,22 +543,22 @@ collection--remember that a `PCollection` is immutable by definition. This means that you can apply multiple transforms to the same input `PCollection` to create a branching pipeline, like so: -```java +{{< highlight java >}} [PCollection of database table rows] = [Database Table Reader].apply([Read Transform]) [PCollection of 'A' names] = [PCollection of database table rows].apply([Transform A]) [PCollection of 'B' names] = [PCollection of database table rows].apply([Transform B]) -``` -```py +{{< /highlight >}} + +{{< highlight py >}} [PCollection of database table rows] = [Database Table Reader] | [Read Transform] [PCollection of 'A' names] = [PCollection of database table rows] | [Transform A] [PCollection of 'B' names] = [PCollection of database table rows] | [Transform B] -``` +{{< /highlight >}} The graph of this branching pipeline looks like the following: ![This pipeline applies two transforms to a single input collection. Each - transform produces an output collection.]( - {{ "/images/design-your-pipeline-multiple-pcollections.svg" | prepend: site.baseurl }}) + transform produces an output collection.](/images/design-your-pipeline-multiple-pcollections.svg) *Figure 2: A branching pipeline. Two transforms are applied to a single PCollection of database table rows.* @@ -610,7 +624,7 @@ Like all Beam transforms, you apply `ParDo` by calling the `apply` method on the input `PCollection` and passing `ParDo` as an argument, as shown in the following example code: -```java +{{< highlight java >}} // The input PCollection of Strings. PCollection words = ...; @@ -622,8 +636,10 @@ PCollection wordLengths = words.apply( ParDo .of(new ComputeWordLengthFn())); // The DoFn to perform on each element, which // we define above. -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} + +{{< highlight go >}} // words is the input PCollection of strings var words beam.PCollection = ... @@ -641,7 +659,7 @@ func computeWordLengthFn(word string) int { } wordLengths := beam.ParDo(s, computeWordLengthFn, words) -``` +{{< /highlight >}} In the example, our input `PCollection` contains `String` values. We apply a `ParDo` transform that specifies a function (`ComputeWordLengthFn`) to compute @@ -659,19 +677,20 @@ define your pipeline's exact data processing tasks. > for writing user code for Beam transforms](#requirements-for-writing-user-code-for-beam-transforms) > and ensure that your code follows them. -{:.language-java} +{{< paragraph class="language-java" >}} A `DoFn` processes one element at a time from the input `PCollection`. When you create a subclass of `DoFn`, you'll need to provide type parameters that match the types of the input and output elements. If your `DoFn` processes incoming `String` elements and produces `Integer` elements for the output collection (like our previous example, `ComputeWordLengthFn`), your class declaration would look like this: +{{< /paragraph >}} -```java +{{< highlight java >}} static class ComputeWordLengthFn extends DoFn { ... } -``` +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} Inside your `DoFn` subclass, you'll write a method annotated with `@ProcessElement` where you provide the actual processing logic. You don't need to manually extract the elements from the input collection; the Beam SDKs handle @@ -682,16 +701,18 @@ provides a method for emitting elements. The parameter types must match the inpu and output types of your `DoFn` or the framework will raise an error. Note: @Element and OutputReceiver were introduced in Beam 2.5.0; if using an earlier release of Beam, a ProcessContext parameter should be used instead. +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} Inside your `DoFn` subclass, you'll write a method `process` where you provide the actual processing logic. You don't need to manually extract the elements from the input collection; the Beam SDKs handle that for you. Your `process` method should accept an object of type `element`. This is the input element and output is emitted by using `yield` or `return` statement inside `process` method. +{{< /paragraph >}} -```java +{{< highlight java >}} static class ComputeWordLengthFn extends DoFn { @ProcessElement public void processElement(@Element String word, OutputReceiver out) { @@ -699,16 +720,20 @@ static class ComputeWordLengthFn extends DoFn { out.output(word.length()); } } -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} > **Note:** If the elements in your input `PCollection` are key/value pairs, you > can access the key or value by using `element.getKey()` or > `element.getValue()`, respectively. +{{< /paragraph >}} A given `DoFn` instance generally gets invoked one or more times to process some arbitrary bundle of elements. However, Beam doesn't guarantee an exact number of @@ -722,12 +747,13 @@ requirements to ensure that Beam and the processing back-end can safely serialize and cache the values in your pipeline. Your method should meet the following requirements: -{:.language-java} +{{< paragraph class="language-java" >}} * You should not in any way modify an element returned by the `@Element` annotation or `ProcessContext.sideInput()` (the incoming elements from the input collection). * Once you output a value using `OutputReceiver.output()` you should not modify that value in any way. +{{< /paragraph >}} ##### 4.2.1.3. Lightweight DoFns and other abstractions {#lightweight-dofns} @@ -741,7 +767,7 @@ Here's the previous example, `ParDo` with `ComputeLengthWordsFn`, with the an anonymous inner class instance a lambda function: -```java +{{< highlight java >}} // The input PCollection. PCollection words = ...; @@ -755,23 +781,27 @@ PCollection wordLengths = words.apply( out.output(word.length()); } })); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} + +{{< highlight go >}} // words is the input PCollection of strings var words beam.PCollection = ... lengths := beam.ParDo(s, func (word string) int { return len(word) }, words) -``` +{{< /highlight >}} If your `ParDo` performs a one-to-one mapping of input elements to output elements--that is, for each input element, it applies a function that produces @@ -783,7 +813,7 @@ Java 8 lambda function for additional brevity. Here's the previous example using `MapElements` `Map`: -```java +{{< highlight java >}} // The input PCollection. PCollection words = ...; @@ -792,19 +822,23 @@ PCollection words = ...; PCollection wordLengths = words.apply( MapElements.into(TypeDescriptors.integers()) .via((String word) -> word.length())); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} > **Note:** You can use Java 8 lambda functions with several other Beam > transforms, including `Filter`, `FlatMapElements`, and `Partition`. +{{< /paragraph >}} ##### 4.2.1.4. DoFn lifecycle {#dofn} Here is a sequence diagram that shows the lifecycle of the DoFn during @@ -814,8 +848,7 @@ Here is a sequence diagram that shows the lifecycle of the DoFn during instance reuse. They also give instanciation use cases. -![This is a sequence diagram that shows the lifecycle of the DoFn]( - {{ "/images/dofn-sequence-diagram.svg" | prepend: site.baseurl }}) +![This is a sequence diagram that shows the lifecycle of the DoFn](/images/dofn-sequence-diagram.svg) #### 4.2.2. GroupByKey {#groupbykey} @@ -904,7 +937,7 @@ IllegalStateException error at pipeline construction time. `CoGroupByKey` performs a relational join of two or more key/value `PCollection`s that have the same key type. -[Design Your Pipeline]({{ site.baseurl }}/documentation/pipelines/design-your-pipeline/#multiple-sources) +[Design Your Pipeline](/documentation/pipelines/design-your-pipeline/#multiple-sources) shows an example pipeline that uses a join. Consider using `CoGroupByKey` if you have multiple data sets that provide @@ -954,46 +987,66 @@ The first set of data contains names and email addresses. The second set of data contains names and phone numbers. -```java +{{< highlight java >}} + +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} After `CoGroupByKey`, the resulting data contains all data associated with each unique key from any of the input collections. -```java +{{< highlight java >}} + +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} The following code example joins the two `PCollection`s with `CoGroupByKey`, followed by a `ParDo` to consume the result. Then, the code uses tags to look up and format data from each collection. -```java +{{< highlight java >}} + +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} The formatted data looks like this: -```java +{{< highlight java >}} + +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} #### 4.2.4. Combine {#combine} -[`Combine`](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Combine.html) +[`Combine`](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/sdk/transforms/Combine.html) [`Combine`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) is a Beam transform for combining collections of elements or values in your data. `Combine` has variants that work on entire `PCollection`s, and some that @@ -1018,7 +1071,7 @@ input/output type. The following example code shows a simple combine function. -```java +{{< highlight java >}} // Sum a collection of Integer values. The function SumInts implements the interface SerializableFunction. public static class SumInts implements SerializableFunction, Integer> { @Override @@ -1030,11 +1083,13 @@ public static class SumInts implements SerializableFunction, I return sum; } } -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} ##### 4.2.4.2. Advanced combinations using CombineFn {#advanced-combines} @@ -1070,7 +1125,7 @@ corresponding methods: The following example code shows how to define a `CombineFn` that computes a mean average: -```java +{{< highlight java >}} public class AverageFn extends CombineFn { public static class Accum { int sum = 0; @@ -1102,11 +1157,14 @@ public class AverageFn extends CombineFn { return ((double) accum.sum) / accum.count; } } -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} ##### 4.2.4.3. Combining a PCollection into a single value {#combining-pcollection} @@ -1116,20 +1174,23 @@ containing one element. The following example code shows how to apply the Beam provided sum combine function to produce a single sum value for a `PCollection` of integers. -```java +{{< highlight java >}} // Sum.SumIntegerFn() combines the elements in the input PCollection. The resulting PCollection, called sum, // contains one value: the sum of all the elements in the input PCollection. PCollection pc = ...; PCollection sum = pc.apply( Combine.globally(new Sum.SumIntegerFn())); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} ##### 4.2.4.4. Combine and global windowing {#combine-global-windowing} @@ -1144,15 +1205,16 @@ To have `Combine` instead return an empty `PCollection` if the input is empty, specify `.withoutDefaults` when you apply your `Combine` transform, as in the following code example: -```java +{{< highlight java >}} PCollection pc = ...; PCollection sum = pc.apply( Combine.globally(new Sum.SumIntegerFn()).withoutDefaults()); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} pc = ... sum = pc | beam.CombineGlobally(sum).without_defaults() -``` +{{< /highlight >}} ##### 4.2.4.5. Combine and non-global windowing {#combine-non-global-windowing} @@ -1192,7 +1254,7 @@ create a single, merged value to be paired with each key. This pattern of a Beam's Combine PerKey transform. The combine function you supply to Combine PerKey must be an associative reduction function or a subclass of `CombineFn`. -```java +{{< highlight java >}} // PCollection is grouped by key and the Double values associated with each key are combined into a Double. PCollection> salesRecords = ...; PCollection> totalSalesPerPerson = @@ -1205,18 +1267,19 @@ PCollection> playerAccuracy = ...; PCollection> avgAccuracyPerPlayer = playerAccuracy.apply(Combine.perKey( new MeanInts()))); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} # PCollection is grouped by key and the numeric values associated with each key # are averaged into a float. player_accuracies = ... {% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets_test.py tag:combine_per_key %} -``` +{{< /highlight >}} #### 4.2.5. Flatten {#flatten} -[`Flatten`](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Flatten.html) +[`Flatten`](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/sdk/transforms/Flatten.html) [`Flatten`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) is a Beam transform for `PCollection` objects that store the same data type. `Flatten` merges multiple `PCollection` objects into a single logical @@ -1225,7 +1288,7 @@ is a Beam transform for `PCollection` objects that store the same data type. The following example shows how to apply a `Flatten` transform to merge multiple `PCollection` objects. -```java +{{< highlight java >}} // Flatten takes a PCollectionList of PCollection objects of a given type. // Returns a single PCollection that contains all of the elements in the PCollection objects in that list. PCollection pc1 = ...; @@ -1234,15 +1297,18 @@ PCollection pc3 = ...; PCollectionList collections = PCollectionList.of(pc1).and(pc2).and(pc3); PCollection merged = collections.apply(Flatten.pCollections()); -``` +{{< /highlight >}} -```py + +{{< highlight py >}} + +{{< /highlight >}} ##### 4.2.5.1. Data encoding in merged collections {#data-encoding-merged-collections} @@ -1265,7 +1331,7 @@ pipeline is constructed. #### 4.2.6. Partition {#partition} -[`Partition`](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/Partition.html) +[`Partition`](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/sdk/transforms/Partition.html) [`Partition`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py) is a Beam transform for `PCollection` objects that store the same data type. `Partition` splits a single `PCollection` into a fixed number of smaller @@ -1283,7 +1349,7 @@ for instance). The following example divides a `PCollection` into percentile groups. -```java +{{< highlight java >}} // Provide an int value with the desired number of result partitions, and a PartitionFn that represents the // partitioning function. In this example, we define the PartitionFn in-line. Returns a PCollectionList // containing each of the resulting partitions as individual PCollection objects. @@ -1298,8 +1364,10 @@ PCollectionList studentsByPercentile = // You can extract each partition from the PCollectionList using the get method, as follows: PCollection fortiethPercentile = studentsByPercentile.get(4); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} ### 4.3. Requirements for writing user code for Beam transforms {#requirements-for-writing-user-code-for-beam-transforms} @@ -1394,7 +1463,7 @@ determined by the input data, or depend on a different branch of your pipeline. #### 4.4.1. Passing side inputs to ParDo {#side-inputs-pardo} -```java +{{< highlight java >}} // Pass side inputs to your ParDo transform by invoking .withSideInputs. // Inside your DoFn, access the side input by using the method DoFn.ProcessContext.sideInput. @@ -1423,8 +1492,10 @@ determined by the input data, or depend on a different branch of your pipeline. } }).withSideInputs(maxWordLengthCutOffView) ); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} #### 4.4.2. Side inputs and windowing {#side-inputs-windowing} @@ -1481,7 +1553,7 @@ together. #### 4.5.1. Tags for multiple outputs {#output-tags} -```java +{{< highlight java >}} // To emit elements to multiple output PCollections, create a TupleTag object to identify each collection // that your ParDo produces. For example, if your ParDo produces three output PCollections (the main output // and two additional outputs), you must create three TupleTags. The following example code shows how to @@ -1527,9 +1599,10 @@ together. // Specify the tags for the two additional outputs as a TupleTagList. TupleTagList.of(wordLengthsAboveCutOffTag) .and(markedWordsTag))); -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} #### 4.5.2. Emitting to multiple outputs in your DoFn {#multiple-outputs-dofn} -```java +{{< highlight java >}} // Inside your ParDo's DoFn, you can emit an element to a specific output PCollection by providing a // MultiOutputReceiver to your process method, and passing in the appropriate TupleTag to obtain an OutputReceiver. // After your ParDo, extract the resulting output PCollections from the returned PCollectionTuple. @@ -1567,9 +1641,10 @@ together. out.get(markedWordsTag).output(word); } }})); -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} #### 4.5.3. Accessing additional parameters in your DoFn {#other-dofn-parameters} -{:.language-java} +{{< paragraph class="language-java" >}} In addition to the element and the `OutputReceiver`, Beam will populate other parameters to your DoFn's `@ProcessElement` method. Any combination of these parameters can be added to your process method in any order. +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} In addition to the element, Beam will populate other parameters to your DoFn's `process` method. Any combination of these parameters can be added to your process method in any order. +{{< /paragraph >}} -{:.language-java} +{{< paragraph class="language-java" >}} **Timestamp:** To access the timestamp of an input element, add a parameter annotated with `@Timestamp` of type `Instant`. For example: +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} **Timestamp:** To access the timestamp of an input element, add a keyword parameter default to `DoFn.TimestampParam`. For example: +{{< /paragraph >}} -```java +{{< highlight java >}} .of(new DoFn() { public void processElement(@Element String word, @Timestamp Instant timestamp) { }}) -``` +{{< /highlight >}} -```py +{{< highlight py >}} import apache_beam as beam class ProcessRecord(beam.DoFn): @@ -1616,29 +1696,31 @@ class ProcessRecord(beam.DoFn): # access timestamp of element. pass -``` +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} **Window:** To access the window an input element falls into, add a parameter of the type of the window used for the input `PCollection`. If the parameter is a window type (a subclass of `BoundedWindow`) that does not match the input `PCollection`, then an error will be raised. If an element falls in multiple windows (for example, this will happen when using `SlidingWindows`), then the `@ProcessElement` method will be invoked multiple time for the element, once for each window. For example, when fixed windows are being used, the window is of type `IntervalWindow`. +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} **Window:** To access the window an input element falls into, add a keyword parameter default to `DoFn.WindowParam`. If an element falls in multiple windows (for example, this will happen when using `SlidingWindows`), then the -`process` method will be invoked multiple time for the element, once for each window. +`process` method will be invoked multiple time for the element, once for each window. +{{< /paragraph >}} -```java +{{< highlight java >}} .of(new DoFn() { public void processElement(@Element String word, IntervalWindow window) { }}) -``` +{{< /highlight >}} -```py +{{< highlight py >}} import apache_beam as beam class ProcessRecord(beam.DoFn): @@ -1647,26 +1729,28 @@ class ProcessRecord(beam.DoFn): # access window e.g window.end.micros pass -``` +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} **PaneInfo:** When triggers are used, Beam provides a `PaneInfo` object that contains information about the current firing. Using `PaneInfo` you can determine whether this is an early or a late firing, and how many times this window has already fired for this key. +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} **PaneInfo:** When triggers are used, Beam provides a `DoFn.PaneInfoParam` object that contains information about the current firing. Using `DoFn.PaneInfoParam` you can determine whether this is an early or a late firing, and how many times this window has already fired for this key. This feature implementation in python sdk is not fully completed, see more at [BEAM-3759](https://issues.apache.org/jira/browse/BEAM-3759). +{{< /paragraph >}} -```java +{{< highlight java >}} .of(new DoFn() { public void processElement(@Element String word, PaneInfo paneInfo) { }}) -``` +{{< /highlight >}} -```py +{{< highlight py >}} import apache_beam as beam class ProcessRecord(beam.DoFn): @@ -1675,31 +1759,36 @@ class ProcessRecord(beam.DoFn): # access pane info e.g pane_info.is_first, pane_info.is_last, pane_info.timing pass -``` +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} **PipelineOptions:** -The `PipelineOptions` for the current pipeline can always be accessed in a process method by adding it as a parameter: -```java +The `PipelineOptions` for the current pipeline can always be accessed in a process method by adding it +as a parameter: +{{< /paragraph >}} + +{{< highlight java >}} .of(new DoFn() { public void processElement(@Element String word, PipelineOptions options) { }}) -``` +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} `@OnTimer` methods can also access many of these parameters. Timestamp, Window, key, `PipelineOptions`, `OutputReceiver`, and `MultiOutputReceiver` parameters can all be accessed in an `@OnTimer` method. In addition, an `@OnTimer` method can take a parameter of type `TimeDomain` which tells whether the timer is based on event time or processing time. Timers are explained in more detail in the -[Timely (and Stateful) Processing with Apache Beam]({{ site.baseurl }}/blog/2017/08/28/timely-processing.html) blog post. +[Timely (and Stateful) Processing with Apache Beam](/blog/2017/08/28/timely-processing.html) blog post. +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} **Timer and State:** In addition to aforementioned parameters, user defined Timer and State parameters can be used in a Stateful DoFn. Timers and States are explained in more detail in the -[Timely (and Stateful) Processing with Apache Beam]({{ site.baseurl }}/blog/2017/08/28/timely-processing.html) blog post. +[Timely (and Stateful) Processing with Apache Beam](/blog/2017/08/28/timely-processing.html) blog post. +{{< /paragraph >}} -```py +{{< highlight py >}} class StatefulDoFn(beam.DoFn): """An example stateful DoFn with state and timer""" @@ -1755,7 +1844,8 @@ class StatefulDoFn(beam.DoFn): # Some business logic return True -``` +{{< /highlight >}} + ### 4.6. Composite transforms {#composite-transforms} Transforms can have a nested structure, where a complex transform performs @@ -1766,12 +1856,12 @@ transform can make your code more modular and easier to understand. The Beam SDK comes packed with many useful composite transforms. See the API reference pages for a list of transforms: - * [Pre-written Beam transforms for Java](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/package-summary.html) - * [Pre-written Beam transforms for Python](https://beam.apache.org/releases/pydoc/{{ site.release_latest }}/apache_beam.transforms.html) + * [Pre-written Beam transforms for Java](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/sdk/transforms/package-summary.html) + * [Pre-written Beam transforms for Python](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.transforms.html) #### 4.6.1. An example composite transform {#composite-transform-example} -The `CountWords` transform in the [WordCount example program]({{ site.baseurl }}/get-started/wordcount-example/) +The `CountWords` transform in the [WordCount example program](/get-started/wordcount-example/) is an example of a composite transform. `CountWords` is a `PTransform` subclass that consists of multiple nested transforms. @@ -1792,7 +1882,7 @@ Your composite transform's parameters and return value must match the initial input type and final return type for the entire transform, even if the transform's intermediate data changes type multiple times. -```java +{{< highlight java >}} public static class CountWords extends PTransform, PCollection>> { @Override @@ -1809,11 +1899,13 @@ transform's intermediate data changes type multiple times. return wordCounts; } } -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} #### 4.6.2. Creating a composite transform {#composite-transform-creation} @@ -1822,25 +1914,28 @@ class and override the `expand` method to specify the actual processing logic. You can then use this transform just as you would a built-in transform from the Beam SDK. -{:.language-java} +{{< paragraph class="language-java" >}} For the `PTransform` class type parameters, you pass the `PCollection` types that your transform takes as input, and produces as output. To take multiple `PCollection`s as input, or produce multiple `PCollection`s as output, use one of the multi-collection types for the relevant type parameter. +{{< /paragraph >}} The following code sample shows how to declare a `PTransform` that accepts a `PCollection` of `String`s for input, and outputs a `PCollection` of `Integer`s: -```java +{{< highlight java >}} static class ComputeWordLengths extends PTransform, PCollection> { ... } -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} Within your `PTransform` subclass, you'll need to override the `expand` method. The `expand` method is where you add the processing logic for the `PTransform`. @@ -1851,7 +1946,7 @@ value. The following code sample shows how to override `expand` for the `ComputeWordLengths` class declared in the previous example: -```java +{{< highlight java >}} static class ComputeWordLengths extends PTransform, PCollection> { @Override @@ -1860,11 +1955,13 @@ The following code sample shows how to override `expand` for the // transform logic goes here ... } -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} As long as you override the `expand` method in your `PTransform` subclass to accept the appropriate input `PCollection`(s) and return the corresponding @@ -1879,7 +1976,7 @@ transforms to be nested within the structure of your pipeline. #### 4.6.3. PTransform Style Guide {#ptransform-style-guide} -The [PTransform Style Guide]({{ site.baseurl }}/contribute/ptransform-style-guide/) +The [PTransform Style Guide](/contribute/ptransform-style-guide/) contains additional information not included here, such as style guidelines, logging and testing guidance, and language-specific considerations. The guide is a useful starting point when you want to write new composite PTransforms. @@ -1890,7 +1987,7 @@ When you create a pipeline, you often need to read data from some external source, such as a file or a database. Likewise, you may want your pipeline to output its result data to an external storage system. Beam provides read and write transforms for a [number of common data storage -types]({{ site.baseurl }}/documentation/io/built-in/). If you want your pipeline +types](/documentation/io/built-in/). If you want your pipeline to read from or write to a data storage format that isn't supported by the built-in transforms, you can [implement your own read and write transforms]({{site.baseurl }}/documentation/io/developing-io-overview/). @@ -1902,13 +1999,13 @@ representation of the data for use by your pipeline. You can use a read transform at any point while constructing your pipeline to create a new `PCollection`, though it will be most common at the start of your pipeline. -```java +{{< highlight java >}} PCollection lines = p.apply(TextIO.read().from("gs://some/inputData.txt")); -``` +{{< /highlight >}} -```py +{{< highlight py >}} lines = pipeline | beam.io.ReadFromText('gs://some/inputData.txt') -``` +{{< /highlight >}} ### 5.2. Writing output data {#pipeline-io-writing-data} @@ -1917,13 +2014,13 @@ You will most often use write transforms at the end of your pipeline to output your pipeline's final results. However, you can use a write transform to output a `PCollection`'s data at any point in your pipeline. -```java +{{< highlight java >}} output.apply(TextIO.write().to("gs://some/outputData")); -``` +{{< /highlight >}} -```py +{{< highlight py >}} output | beam.io.WriteToText('gs://some/outputData') -``` +{{< /highlight >}} ### 5.3. File-based input and output data {#file-based-data} @@ -1935,15 +2032,17 @@ filesystem-specific consistency models. The following TextIO example uses a glob operator (\*) to read all matching input files that have prefix "input-" and the suffix ".csv" in the given location: -```java +{{< highlight java >}} p.apply("ReadFromText", TextIO.read().from("protocol://my_bucket/path/to/input-*.csv")); -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} To read data from disparate sources into a single `PCollection`, read each one independently and then use the [Flatten](#flatten) transform to create a single @@ -1960,1208 +2059,116 @@ The following write transform example writes multiple output files to a location. Each file has the prefix "numbers", a numeric tag, and the suffix ".csv". -```java +{{< highlight java >}} records.apply("WriteToText", TextIO.write().to("protocol://my_bucket/path/to/numbers") .withSuffix(".csv")); -``` +{{< /highlight >}} -```py +{{< highlight py >}} + +{{< /highlight >}} ### 5.4. Beam-provided I/O transforms {#provided-io-transforms} See the [Beam-provided I/O Transforms]({{site.baseurl }}/documentation/io/built-in/) page for a list of the currently available I/O transforms. -## 6. Schemas {#schemas} -Often, the types of the records being processed have an obvious structure. Common Beam sources produce -JSON, Avro, Protocol Buffer, or database row objects; all of these types have well defined structures, -structures that can often be determined by examining the type. Even within a SDK pipeline, Simple Java POJOs -(or equivalent structures in other languages) are often used as intermediate types, and these also have a - clear structure that can be inferred by inspecting the class. By understanding the structure of a pipeline’s - records, we can provide much more concise APIs for data processing. - -### 6.1. What is a schema {#what-is-a-schema} -Most structured records share some common characteristics: -* They can be subdivided into separate named fields. Fields usually have string names, but sometimes - as in the case of indexed - tuples - have numerical indices instead. -* There is a confined list of primitive types that a field can have. These often match primitive types in most programming - languages: int, long, string, etc. -* Often a field type can be marked as optional (sometimes referred to as nullable) or required. - -Oten records have a nested structure. A nested structure occurs when a field itself has subfields so the -type of the field itself has a schema. Fields that are array or map types is also a common feature of these structured -records. +## 6. Data encoding and type safety {#data-encoding-and-type-safety} -For example, consider the following schema, representing actions in a fictitious e-commerce company: - -**Purchase** - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Field NameField Type
userIdSTRING
itemIdINT64
shippingAddressROW(ShippingAddress)
costINT64
transactionsARRAY[ROW(Transaction)]
-
+When Beam runners execute your pipeline, they often need to materialize the +intermediate data in your `PCollection`s, which requires converting elements to +and from byte strings. The Beam SDKs use objects called `Coder`s to describe how +the elements of a given `PCollection` may be encoded and decoded. -**ShippingAddress** - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Field NameField Type
streetAddressSTRING
citySTRING
statenullable STRING
countrySTRING
postCodeSTRING
-
+> Note that coders are unrelated to parsing or formatting data when interacting +> with external data sources or sinks. Such parsing or formatting should +> typically be done explicitly, using transforms such as `ParDo` or +> `MapElements`. -**Transaction** - - - - - - - - - - - - - - - - - -
Field NameField Type
bankSTRING
purchaseAmountDOUBLE
-
+{{< paragraph class="language-java" >}} +In the Beam SDK for Java, the type `Coder` provides the methods required for +encoding and decoding data. The SDK for Java provides a number of Coder +subclasses that work with a variety of standard Java types, such as Integer, +Long, Double, StringUtf8 and more. You can find all of the available Coder +subclasses in the [Coder package](https://github.com/apache/beam/tree/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders). +{{< /paragraph >}} -Purchase event records are represented by the above purchase schema. Each purchase event contains a shipping address, which -is a nested row containing its own schema. Each purchase also contains an array of credit-card transactions -(a list, because a purchase might be split across multiple credit cards); each item in the transaction list is a row -with its own schema. +{{< paragraph class="language-py" >}} +In the Beam SDK for Python, the type `Coder` provides the methods required for +encoding and decoding data. The SDK for Python provides a number of Coder +subclasses that work with a variety of standard Python types, such as primitive +types, Tuple, Iterable, StringUtf8 and more. You can find all of the available +Coder subclasses in the +[apache_beam.coders](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/coders) +package. +{{< /paragraph >}} -This provides an abstract description of the types involved, one that is abstracted away from any specific programming -language. +> Note that coders do not necessarily have a 1:1 relationship with types. For +> example, the Integer type can have multiple valid coders, and input and output +> data can use different Integer coders. A transform might have Integer-typed +> input data that uses BigEndianIntegerCoder, and Integer-typed output data that +> uses VarIntCoder. -Schemas provide us a type-system for Beam records that is independent of any specific programming-language type. There -might be multiple Java classes that all have the same schema (for example a Protocol-Buffer class or a POJO class), -and Beam will allow us to seamlessly convert between these types. Schemas also provide a simple way to reason about -types across different programming-language APIs. +### 6.1. Specifying coders {#specifying-coders} -A `PCollection` with a schema does not need to have a `Coder` specified, as Beam knows how to encode and decode -Schema rows; Beam uses a special coder to encode schema types. +The Beam SDKs require a coder for every `PCollection` in your pipeline. In most +cases, the Beam SDK is able to automatically infer a `Coder` for a `PCollection` +based on its element type or the transform that produces it, however, in some +cases the pipeline author will need to specify a `Coder` explicitly, or develop +a `Coder` for their custom type. -### 6.2. Schemas for programming language types {#schemas-for-pl-types} -While schemas themselves are language independent, they are designed to embed naturally into the programming languages -of the Beam SDK being used. This allows Beam users to continue using native types while reaping the advantage of -having Beam understand their element schemas. - - {:.language-java} - In Java you could use the following set of classes to represent the purchase schema. Beam will automatically - infer the correct schema based on the members of the class. - -```java -@DefaultSchema(JavaBeanSchema.class) -public class Purchase { - public String getUserId(); // Returns the id of the user who made the purchase. - public long getItemId(); // Returns the identifier of the item that was purchased. - public ShippingAddress getShippingAddress(); // Returns the shipping address, a nested type. - public long getCostCents(); // Returns the cost of the item. - public List getTransactions(); // Returns the transactions that paid for this purchase (returns a list, since the purchase might be spread out over multiple credit cards). - - @SchemaCreate - public Purchase(String userId, long itemId, ShippingAddress shippingAddress, long costCents, - List transactions) { - ... - } -} +{{< paragraph class="language-java" >}} +You can explicitly set the coder for an existing `PCollection` by using the +method `PCollection.setCoder`. Note that you cannot call `setCoder` on a +`PCollection` that has been finalized (e.g. by calling `.apply` on it). +{{< /paragraph >}} -@DefaultSchema(JavaBeanSchema.class) -public class ShippingAddress { - public String getStreetAddress(); - public String getCity(); - @Nullable public String getState(); - public String getCountry(); - public String getPostCode(); - - @SchemaCreate - public ShippingAddress(String streetAddress, String city, @Nullable String state, String country, - String postCode) { - ... - } -} +{{< paragraph class="language-java" >}} +You can get the coder for an existing `PCollection` by using the method +`getCoder`. This method will fail with an `IllegalStateException` if a coder has +not been set and cannot be inferred for the given `PCollection`. +{{< /paragraph >}} -@DefaultSchema(JavaBeanSchema.class) -public class Transaction { - public String getBank(); - public double getPurchaseAmount(); - - @SchemaCreate - public Transaction(String bank, double purchaseAmount) { - ... - } -} -``` +Beam SDKs use a variety of mechanisms when attempting to automatically infer the +`Coder` for a `PCollection`. -Using JavaBean classes as above is one way to map a schema to Java classes. However multiple Java classes might have -the same schema, in which case the different Java types can often be used interchangeably. Beam will add implicit -conversions betweens types that have matching schemas. For example, the above -`Transaction` class has the same schema as the following class: +{{< paragraph class="language-java" >}} +Each pipeline object has a `CoderRegistry`. The `CoderRegistry` represents a +mapping of Java types to the default coders that the pipeline should use for +`PCollection`s of each type. +{{< /paragraph >}} -```java -@DefaultSchema(JavaFieldSchema.class) -public class TransactionPojo { - public String bank; - public double purchaseAmount; -} -``` +{{< paragraph class="language-py" >}} +The Beam SDK for Python has a `CoderRegistry` that represents a mapping of +Python types to the default coder that should be used for `PCollection`s of each +type. +{{< /paragraph >}} -So if we had two `PCollection`s as follows +{{< paragraph class="language-java" >}} +By default, the Beam SDK for Java automatically infers the `Coder` for the +elements of a `PCollection` produced by a `PTransform` using the type parameter +from the transform's function object, such as `DoFn`. In the case of `ParDo`, +for example, a `DoFn` function object accepts an input element +of type `Integer` and produces an output element of type `String`. In such a +case, the SDK for Java will automatically infer the default `Coder` for the +output `PCollection` (in the default pipeline `CoderRegistry`, this is +`StringUtf8Coder`). +{{< /paragraph >}} -```java -PCollection transactionBeans = readTransactionsAsJavaBean(); -PCollection transactionPojos = readTransactionsAsPojo(); -``` - -Then these two `PCollection`s would have the same schema, even though their Java types would be different. This means -for example the following two code snippets are valid: - -```java -transactionBeans.apply(ParDo.of(new DoFn<...>() { - @ProcessElement public void process(@Element TransactionPojo pojo) { - ... - } -})); -``` - -and -```java -transactionPojos.apply(ParDo.of(new DoFn<...>() { - @ProcessElement public void process(@Element Transaction row) { - } -})); -``` - -Even though the in both cases the `@Element` parameter differs from the the `PCollection`'s Java type, since the -schemas are the same Beam will automatically make the conversion. The built-in `Convert` transform can also be used -to translate between Java types of equivalent schemas, as detailed below. - -### 6.3. Schema definition {#schema-definition} -The schema for a `PCollection` defines elements of that `PCollection` as an ordered list of named fields. Each field -has a name, a type, and possibly a set of user options. The type of a field can be primitive or composite. The following -are the primitive types currently supported by Beam: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TypeDescription
BYTEAn 8-bit signed value
INT16A 16-bit signed value
INT32A 32-bit signed value
INT64A 64-bit signed value
DECIMALAn arbitrary-precision decimal type
FLOATA 32-bit IEEE 754 floating point number
DOUBLEA 64-bit IEEE 754 floating point number
STRINGA string
DATETIMEA timestamp represented as milliseconds since the epoch
BOOLEANA boolean value
BYTESA raw byte array
-
- -A field can also reference a nested schema. In this case, the field will have type ROW, and the nested schema will -be an attribute of this field type. - -Three collection types are supported as field types: ARRAY, ITERABLE and MAP: -* **ARRAY** This represents a repeated value type, where the repeated elements can have any supported type. Arrays of -nested rows are supported, as are arrays of arrays. -* **ITERABLE** This is very similar to the array type, it represents a repeated value, but one in which the full list of -items is not known until iterated over. This is intended for the case where an iterable might be larger than the -available memory, and backed by external storage (for example, this can happen with the iterable returned by a -`GroupByKey`). The repeated elements can have any supported type. -* **MAP** This represents an associative map from keys to values. All schema types are supported for both keys and values. - Values that contain map types cannot be used as keys in any grouping operation. - -### 6.4. Logical types {#logical-types} -Users can extend the schema type system to add custom logical types that can be used as a field. A logical type is -identified by a unique identifier and an argument. A logical type also specifies an underlying schema type to be used -for storage, along with conversions to and from that type. As an example, a logical union can always be represented as -a row with nullable fields, where the user ensures that only one of those fields is ever set at a time. However this can -be tedious and complex to manage. The OneOf logical type provides a value class that makes it easier to manage the type -as a union, while still using a row with nullable fields as its underlying storage. Each logical type also has a -unique identifier, so they can be interpreted by other languages as well. More examples of logical types are listed -below. - -#### 6.4.1. Defining a logical type {#defining-a-logical-type} -To define a logical type you must specify a Schema type to be used to represent the underlying type as well as a unique -identifier for that type. A logical type imposes additional semantics on top a schema type. For example, a logical -type to represent nanosecond timestamps is represented as a schema containing an INT64 and an INT32 field. This schema -alone does not say anything about how to interpret this type, however the logical type tells you that this represents -a nanosecond timestamp, with the INT64 field representing seconds and the INT32 field representing nanoseconds. - -Logical types are also specified by an argument, which allows creating a class of related types. For example, a -limited-precision decimal type would have an integer argument indicating how many digits of precision are represented. -The argument is represented by a schema type, so can itself be a complex type. - - {:.language-java} -In Java, a logical type is specified as a subclass of the `LogicalType` class. A custom Java class can be specified to -represent the logical type and conversion functions must be supplied to convert back and forth between this Java class -and the underlying Schema type representation. For example, the logical type representing nanosecond timestamp might -be implemented as follows - -```java -// A Logical type using java.time.Instant to represent the logical type. -public class TimestampNanos implements LogicalType { - // The underlying schema used to represent rows. - private final Schema SCHEMA = Schema.builder().addInt64Field("seconds").addInt32Field("nanos").build(); - @Override public String getIdentifier() { return "timestampNanos"; } - @Override public FieldType getBaseType() { return schema; } - - // Convert the representation type to the underlying Row type. Called by Beam when necessary. - @Override public Row toBaseType(Instant instant) { - return Row.withSchema(schema).addValues(instant.getEpochSecond(), instant.getNano()).build(); - } - - // Convert the underlying Row type to and Instant. Called by Beam when necessary. - @Override public Instant toInputType(Row base) { - return Instant.of(row.getInt64("seconds"), row.getInt32("nanos")); - } - - ... -} -``` - -#### 6.4.2. Useful logical types {#built-in-logical-types} -##### **EnumerationType** -This logical type allows creating an enumeration type consisting of a set of named constants. - -```java -Schema schema = Schema.builder() - … - .addLogicalTypeField(“color”, EnumerationType.create(“RED”, “GREEN”, “BLUE”)) - .build(); -``` - -The value of this field is stored in the row as an INT32 type, however the logical type defines a value type that lets -you access the enumeration either as a string or a value. For example: - -```java -EnumerationType.Value enumValue = enumType.valueOf(“RED”); -enumValue.getValue(); // Returns 0, the integer value of the constant. -enumValue.toString(); // Returns “RED”, the string value of the constant -``` - -Given a row object with an enumeration field, you can also extract the field as the enumeration value. - -```java -EnumerationType.Value enumValue = row.getLogicalTypeValue(“color”, EnumerationType.Value.class); -``` - -Automatic schema inference from Java POJOs and JavaBeans automatically converts Java enums to EnumerationType logical -types. - -##### **OneOfType** -OneOfType allows creating a disjoint union type over a set of schema fields. For example: - -```java -Schema schema = Schema.builder() - … - .addLogicalTypeField(“oneOfField”, - OneOfType.create(Field.of(“intField”, FieldType.INT32), - Field.of(“stringField”, FieldType.STRING), - Field.of(“bytesField”, FieldType.BYTES))) - .build(); -``` - -The value of this field is stored in the row as another Row type, where all the fields are marked as nullable. The -logical type however defines a Value object that contains an enumeration value indicating which field was set and allows - getting just that field: - -```java -// Returns an enumeration indicating all possible case values for the enum. -// For the above example, this will be -// EnumerationType.create(“intField”, “stringField”, “bytesField”); -EnumerationType oneOfEnum = onOfType.getCaseEnumType(); - -// Creates an instance of the union with the string field set. -OneOfType.Value oneOfValue = oneOfType.createValue(“stringField”, “foobar”); - -// Handle the oneof -switch (oneOfValue.getCaseEnumType().toString()) { - case “intField”: - return processInt(oneOfValue.getValue(Integer.class)); - case “stringField”: - return processString(oneOfValue.getValue(String.class)); - case “bytesField”: - return processBytes(oneOfValue.getValue(bytes[].class)); -} -``` - -In the above example we used the field names in the switch statement for clarity, however the enum integer values could - also be used. - -### 6.5. Creating Schemas {#creating-schemas} - -In order to take advantage of schemas, your `PCollection`s must have a schema attached to it. Often, the source -itself will attach a schema to the PCollection. For example, when using `AvroIO` to read Avro files, the source can -automatically infer a Beam schema from the Avro schema and attach that to the Beam `PCollection`. However not all sources -produce schemas. In addition, often Beam pipelines have intermediate stages and types, and those also can benefit from -the expressiveness of schemas. - -#### 6.5.1. Inferring schemas {#inferring-schemas} -{:.language-java} -Beam is able to infer schemas from a variety of common Java types. The `@DefaultSchema` annotation can be used to tell -Beam to infer schemas from a specific type. The annotation takes a `SchemaProvider` as an argument, and `SchemaProvider` -classes are already built in for common Java types. The `SchemaRegistry` can also be invoked programmatically for cases -where it is not practical to annotate the Java type itself. - -##### **Java POJOs** -A POJO (Plain Old Java Object) is a Java object that is not bound by any restriction other than the Java Language -Specification. A POJO can contain member variables that are primitives, that are other POJOs, or are collections maps or -arrays thereof. POJOs do not have to extend prespecified classes or extend any specific interfaces. - -If a POJO class is annotated with `@DefaultSchema(JavaFieldSchema.class)`, Beam will automatically infer a schema for -this class. Nested classes are supported as are classes with `List`, array, and `Map` fields. - -For example, annotating the following class tells Beam to infer a schema from this POJO class and apply it to any -`PCollection`. - -```java -@DefaultSchema(JavaFieldSchema.class) -public class TransactionPojo { - public final String bank; - public final double purchaseAmount; - @SchemaCreate - public TransactionPojo(String bank, double purchaseAmount) { - this.bank = bank. - this.purchaseAmount = purchaseAmount; - } -} -// Beam will automatically infer the correct schema for this PCollection. No coder is needed as a result. -PCollection pojos = readPojos(); -```` - -The `@SchemaCreate` annotation tells Beam that this constructor can be used to create instances of TransactionPojo, -assuming that constructor parameters have the same names as the field names. `@SchemaCreate` can also be used to annotate -static factory methods on the class, allowing the constructor to remain private. If there is no `@SchemaCreate` - annotation then all the fields must be non-final and the class must have a zero-argument constructor. - -There are a couple of other useful annotations that affect how Beam infers schemas. By default the schema field names -inferred will match that of the class field names. However `@SchemaFieldName` can be used to specify a different name to -be used for the schema field. `@SchemaIgnore` can be used to mark specific class fields as excluded from the inferred -schema. For example, it’s common to have ephemeral fields in a class that should not be included in a schema -(e.g. caching the hash value to prevent expensive recomputation of the hash), and `@SchemaIgnore` can be used to -exclude these fields. Note that ignored fields will not be included in the encoding of these records. - -In some cases it is not convenient to annotate the POJO class, for example if the POJO is in a different package that is -not owned by the Beam pipeline author. In these cases the schema inference can be triggered programmatically in -pipeline’s main function as follows: - -```java - pipeline.getSchemaRegistry().registerPOJO(TransactionPOJO.class); -``` - -##### **Java Beans** -Java Beans are a de-facto standard for creating reusable property classes in Java. While the full -standard has many characteristics, the key ones are that all properties are accessed via getter and setter classes, and -the name format for these getters and setters is standardized. A Java Bean class can be annotated with -`@DefaultSchema(JavaBeanSchema.class)` and Beam will automatically infer a schema for this class. For example: - -```java -@DefaultSchema(JavaBeanSchema.class) -public class TransactionBean { - public TransactionBean() { … } - public String getBank() { … } - public void setBank(String bank) { … } - public double getPurchaseAmount() { … } - public void setPurchaseAmount(double purchaseAmount) { … } -} -// Beam will automatically infer the correct schema for this PCollection. No coder is needed as a result. -PCollection beans = readBeans(); -``` - -The `@SchemaCreate` annotation can be used to specify a constructor or a static factory method, in which case the -setters and zero-argument constructor can be omitted. - -```java -@DefaultSchema(JavaBeanSchema.class) -public class TransactionBean { - @SchemaCreate - Public TransactionBean(String bank, double purchaseAmount) { … } - public String getBank() { … } - public double getPurchaseAmount() { … } -} -``` - -`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema inferred, just like with POJO classes. - -##### **AutoValue** -Java value classes are notoriously difficult to generate correctly. There is a lot of boilerplate you must create in -order to properly implement a value class. AutoValue is a popular library for easily generating such classes by i -mplementing a simple abstract base class. - -Beam can infer a schema from an AutoValue class. For example: - -```java -@DefaultSchema(AutoValueSchema.class) -@AutoValue -public abstract class TransactionValue { - public abstract String getBank(); - public abstract double getPurchaseAmount(); -} -``` - -This is all that’s needed to generate a simple AutoValue class, and the above `@DefaultSchema` annotation tells Beam to -infer a schema from it. This also allows AutoValue elements to be used inside of `PCollection`s. - -`@SchemaFieldName` and `@SchemaIgnore` can be used to alter the schema inferred. - -### 6.6. Using Schema Transforms {#using-schemas} -A schema on a `PCollection` enables a rich variety of relational transforms. The fact that each record is composed of -named fields allows for simple and readable aggregations that reference fields by name, similar to the aggregations in -a SQL expression. - -#### 6.6.1. Field selection syntax -The advantage of schemas is that they allow referencing of element fields by name. Beam provides a selection syntax for -referencing fields, including nested and repeated fields. This syntax is used by all of the schema transforms when -referencing the fields they operate on. The syntax can also be used inside of a DoFn to specify which schema fields to -process. - -Addressing fields by name still retains type safety as Beam will check that schemas match at the time the pipeline graph -is constructed. If a field is specified that does not exist in the schema, the pipeline will fail to launch. In addition, -if a field is specified with a type that does not match the type of that field in the schema, the pipeline will fail to -launch. - -The following characters are not allowed in field names: . * [ ] { } - -##### **Top-level fields** -In order to select a field at the top level of a schema, the name of the field is specified. For example, to select just -the user ids from a `PCollection` of purchases one would write (using the `Select` transform) - -```java -purchases.apply(Select.fieldNames(“userId”)); -``` - -##### **Nested fields** -Individual nested fields can be specified using the dot operator. For example, to select just the postal code from the - shipping address one would write - -```java -purchases.apply(Select.fieldNames(“shippingAddress.postCode”)); -``` - -##### **Wildcards** -The * operator can be specified at any nesting level to represent all fields at that level. For example, to select all -shipping-address fields one would write - -```java -purchases.apply(Select.fieldNames(“shippingAddress.*”)); -``` - -##### **Arrays** -An array field, where the array element type is a row, can also have subfields of the element type addressed. When -selected, the result is an array of the selected subfield type. For example - -```java -purchases.apply(Select.fieldNames(“transactions[].bank”)); -``` - -Will result in a row containing an array field with element-type string, containing the list of banks for each -transaction. - -While the use of [] brackets in the selector is recommended, to make it clear that array elements are being selected, -they can be omitted for brevity. In the future, array slicing will be supported, allowing selection of portions of the -array. - -##### **Maps** -A map field, where the value type is a row, can also have subfields of the value type addressed. When selected, the -result is a map where the keys are the same as in the original map but the value is the specified type. Similar to -arrays, the use of {} curly brackets in the selector is recommended, to make it clear that map value elements are being -selected, they can be omitted for brevity. In the future, map key selectors will be supported, allowing selection of -specific keys from the map. For example, given the following schema: - -**PurchasesByType** - - - - - - - - - - - - - -
Field NameField Type
purchasesMAP{STRING, ROW{PURCHASE}
-
- -The following - -```java -purchasesByType.apply(Select.fieldNames(“purchases{}.userId”)); -``` - -Will result in a row containing an map field with key-type string and value-type string. The selected map will contain -all of the keys from the original map, and the values will be the userId contained in the purchasee reecord. - -While the use of {} brackets in the selector is recommended, to make it clear that map value elements are being selected, -they can be omitted for brevity. In the future, map slicing will be supported, allowing selection of specific keys from -the map. - -#### 6.6.2. Schema transforms -Beam provides a collection of transforms that operate natively on schemas. These transforms are very expressive, -allowing selections and aggregations in terms of named schema fields. Following are some examples of useful -schema transforms. - -##### **Selecting input** -Often a computation is only interested in a subset of the fields in an input `PCollection`. The `Select` transform allows -one to easily project out only the fields of interest. The resulting `PCollection` has a schema containing each selected -field as a top-level field. Both top-level and nested fields can be selected. For example, in the Purchase schema, one -could select only the userId and streetAddress fields as follows - -```java -purchases.apply(Select.fieldNames(“userId”, shippingAddress.streetAddress”)); -``` - -The resulting `PCollection` will have the following schema - - - - - - - - - - - - - - - - - - -
Field NameField Type
userIdSTRING
streetAddressSTRING
-
- -The same is true for wildcard selections. The following - -```java -purchases.apply(Select.fieldNames(“userId”, shippingAddress.*”)); -``` - -Will result in the following schema - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Field NameField Type
userIdSTRING
streetAddressSTRING
citySTRING
statenullable STRING
countrySTRING
postCodeSTRING
-
- -When selecting fields nested inside of an array, the same rule applies that each selected field appears separately as a -top-level field in the resulting row. This means that if multiple fields are selected from the same nested row, each -selected field will appear as its own array field. For example - -```java -purchases.apply(Select.fieldNames( “transactions.bank”, transactions.purchaseAmount”)); -``` - -Will result in the following schema - - - - - - - - - - - - - - - - - -
Field NameField Type
bankARRAY[STRING]
purchaseAmountARRAY[DOUBLE]
-
- -Wildcard selections are equivalent to separately selecting each field. - -Selecting fields nested inside of maps have the same semantics as arrays. If you select multiple fields from a map -, then each selected field will be expanded to its own map at the top level. This means that the set of map keys will - be copied, once for each selected field. - -Sometimes different nested rows will have fields with the same name. Selecting multiple of these fields would result in -a name conflict, as all selected fields are put in the same row schema. When this situation arises, the -`Select.withFieldNameAs` builder method can be used to provide an alternate name for the selected field. - -Another use of the Select transform is to flatten a nested schema into a single flat schema. For example - -```java -purchases.apply(Select.flattenedSchema()); -``` - -Will result in the following schema - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Field NameField Type
userIdSTRING
itemIdSTRING
shippingAddress_streetAddressSTRING
shippingAddress_citynullable STRING
shippingAddress_stateSTRING
shippingAddress_countrySTRING
shippingAddress_postCodeSTRING
costCentsINT64
transactions_bankARRAY[STRING]
transactions_purchaseAmountARRAY[DOUBLE]
-
- -##### **Grouping aggregations** -The `Group` transform allows simply grouping data by any number of fields in the input schema, applying aggregations to -those groupings, and storing the result of those aggregations in a new schema field. The output of the `Group` transform -has a schema with one field corresponding to each aggregation performed. - -The simplest usage of `Group` specifies no aggregations, in which case all inputs matching the provided set of fields -are grouped together into an `ITERABLE` field. For example - -```java -purchases.apply(Group.byFieldNames(“userId”, shippingAddress.streetAddress”)); -``` - -The output schema of this is: - - - - - - - - - - - - - - - - - - -
Field NameField Type
keyROW{userId:STRING, streetAddress:STRING}
valuesITERABLE[ROW[Purchase]]
-
- -The key field contains the grouping key and the values field contains a list of all the values that matched that key. - -The names of the key and values fields in the output schema can be controlled using this withKeyField and withValueField -builders, as follows: - -```java -purchases.apply(Group.byFieldNames(“userId”, shippingAddress.streetAddress”) - .withKeyField(“userAndStreet”) - .withValueField(“matchingPurchases”)); -``` - -It is quite common to apply one or more aggregations to the grouped result. Each aggregation can specify one or more fields -to aggregate, an aggregation function, and the name of the resulting field in the output schema. For example, the -following application computes three aggregations grouped by userId, with all aggregations represented in a single -output schema: - -```java -purchases.apply(Group.byFieldNames(“userId”) - .aggregateField(“itemId”, Count.combineFn(), “numPurchases”) - .aggregateField(“costCents”, Sum.ofLongs(), “totalSpendCents”) - .aggregateField(“costCents”, Top.largestLongsFn(10), “topPurchases”)); -``` - -The result of this aggregation will have the following schema: - - - - - - - - - - - - - - - - - -
Field NameField Type
keyROW{userId:STRING}
valueROW{numPurchases: INT64, totalSpendCents: INT64, topPurchases: ARRAY[INT64]}
-
- -Often `Selected.flattenedSchema` will be use to flatten the result into a non-nested, flat schema. - -##### **Joins** -Beam supports equijoins on schema `PCollections` - namely joins where the join condition depends on the equality of a -subset of fields. For example, the following examples uses the Purchases schema to join transactions with the reviews -that are likely associated with that transaction (both the user and product match that in the transaction). This is a -"natural join" - one in which the same field names are used on both the left-hand and right-hand sides of the join - -and is specified with the `using` keyword: - -```java -PCollection transactions = readTransactions(); -PCollection reviews = readReviews(); -PCollection joined = transactions.apply( - Join.innerJoin(reviews).using(“userId”, “productId”)); -``` - -The resulting schema is the following: - - - - - - - - - - - - - - - - - -
Field NameField Type
lhsROW{Transaction}
rhsROW{Review}
-
- -Each resulting row contains one Review and one Review that matched the join condition. - -If the fields to match in the two schemas have different names, then the on function can be used. For example, if the -Review schema named those fields differently than the Transaction schema, then we could write the following: - -```java -PCollection joined = transactions.apply( - Join.innerJoin(reviews).on( - FieldsEqual - .left(“userId”, “productId”) - .right(“reviewUserId”, “reviewProductId”))); -``` - -In addition to inner joins, the Join transform supports full outer joins, left outer joins, and right outer joins. - -##### **Complex joins** -While most joins tend to be binary joins - joining two inputs together - sometimes you have more than two input -streams that all need to be joined on a common key. The `CoGroup` transform allows joining multiple `PCollections` -together based on equality of schema fields. Each `PCollection` can be marked as required or optional in the final -join record, providing a generalization of outer joins to joins with greater than two input `PCollection`s. The output -can optionally be expanded - providing individual joined records, as in the `Join` transform. The output can also be -processed in unexpanded format - providing the join key along with Iterables of all records from each input that matched -that key. - -##### **Filtering events** -The `Filter` transform can be configured with a set of predicates, each one based one specified fields. Only records for -which all predicates return true will pass the filter. For example the following - -```java -purchases.apply(Filter - .whereFieldName(“costCents”, c -> c > 100 * 20) - .whereFieldName(“shippingAddress.country”, c -> c.equals(“de”)); -``` - -Will produce all purchases made from Germany with a purchase price of greater than twenty cents. - - -##### **Adding fields to a schema** -The AddFields transform can be used to extend a schema with new fields. Input rows will be extended to the new schema by -inserting null values for the new fields, though alternate default values can be specified; if the default null value -is used then the new field type will be marked as nullable. Nested subfields can be added using the field selection -syntax, including nested fields inside arrays or map values. - -For example, the following application - -```java -purchases.apply(AddFields.create() - .field(“timeOfDaySeconds”, FieldType.INT32) - .field(“shippingAddress.deliveryNotes”, FieldType.STRING) - .field(“transactions.isFlagged, FieldType.BOOLEAN, false)); -``` - -Results in a `PCollection` with an expanded schema. All of the rows and fields of the input, but also with the specified -fields added to the schema. All resulting rows will have null values filled in for the **timeOfDaySeconds** and the -**shippingAddress.deliveryNotes** fields, and a false value filled in for the **transactions.isFlagged** field. - -##### **Removing fields from a schema** -`DropFields` allows specific fields to be dropped from a schema. Input rows will have their schemas truncated, and any -values for dropped fields will be removed from the output. Nested fields can also be dropped using the field selection -syntax. - -For example, the following snippet - -```java -purchases.apply(DropFields.fields(“userId”, “shippingAddress.streetAddress”)); -``` - -Results in a copy of the input with those two fields and their corresponding values removed. - -##### **Renaming schema fields** -`RenameFields` allows specific fields in a schema to be renamed. The field values in input rows are left unchanged, only -the schema is modified. This transform is often used to prepare records for output to a schema-aware sink, such as an -RDBMS, to make sure that the `PCollection` schema field names match that of the output. It can also be used to rename -fields generated by other transforms to make them more usable (similar to SELECT AS in SQL). Nested fields can also be -renamed using the field-selection syntax. - -For example, the following snippet - -```java -purchases.apply(RenameFields.create() - .rename(“userId”, “userIdentifier”) - .rename(“shippingAddress.streetAddress”, “shippingAddress.street”)); -``` - -Results in the same set of unmodified input elements, however the schema on the PCollection has been changed to rename -**userId** to **userIdentifier** and **shippingAddress.streetAddress** to **shippingAddress.street**. - -##### **Converting between types** -As mentioned, Beam can automatically convert between different Java types, as long as those types have equivalent -schemas. One way to do this is by using the `Convert` transform, as follows. - -```java -PCollection purchaseBeans = readPurchasesAsBeans(); -PCollection pojoPurchases = - purchaseBeans.apply(Convert.to(PurchasePojo.class)); -``` - -Beam will validate that the inferred schema for `PurchasePojo` matches that of the input `PCollection`, and will -then cast to a `PCollection`. - -Since the `Row` class can support any schema, any `PCollection` with schema can be cast to a `PCollection` of rows, as -follows. - -```java -PCollection purchaseRows = purchaseBeans.apply(Convert.toRows()); -``` - -If the source type is a single-field schema, Convert will also convert to the type of the field if asked, effectively -unboxing the row. For example, give a schema with a single INT64 field, the following will convert it to a -`PCollection` - -```java -PCollection longs = rows.apply(Convert.to(TypeDescriptors.longs())); -``` - -In all cases, type checking is done at pipeline graph construction, and if the types do not match the schema then the -pipeline will fail to launch. - -#### 6.6.3. Schemas in ParDo -A `PCollection` with a schema can apply a `ParDo`, just like any other `PCollection`. However the Beam runner is aware - of schemas when applying a `ParDo`, which enables additional functionality. - -##### **Input conversion** -Since Beam knows the schema of the source `PCollection`, it can automatically convert the elements to any Java type for -which a matching schema is known. For example, using the above-mentioned Transaction schema, say we have the following -`PCollection`: - -```java -PCollection purchases = readPurchases(); -``` - -If there were no schema, then the applied `DoFn` would have to accept an element of type `TransactionPojo`. However -since there is a schema, you could apply the following DoFn: - -```java -purchases.appy(ParDo.of(new DoFn() { - @ProcessElement public void process(@Element PurchaseBean purchase) { - ... - } -})); -``` - -Even though the `@Element` parameter does not match the Java type of the `PCollection`, since it has a matching schema -Beam will automatically convert elements. If the schema does not match, Beam will detect this at graph-construction time -and will fail the job with a type error. - -Since every schema can be represented by a Row type, Row can also be used here: - -```java -purchases.appy(ParDo.of(new DoFn() { - @ProcessElement public void process(@Element Row purchase) { - ... - } -})); -``` - -##### **Input selection** -Since the input has a schema, you can also automatically select specific fields to process in the DoFn. - -Given the above purchases `PCollection`, say you want to process just the userId and the itemId fields. You can do these -using the above-described selection expressions, as follows: - -```java -purchases.appy(ParDo.of(new DoFn() { - @ProcessElement public void process( - @FieldAccess(“userId”) String userId, @FieldAccess(“itemId”) long itemId) { - ... - } -})); -``` - -You can also select nested fields, as follows. - -```java -purchases.appy(ParDo.of(new DoFn() { - @ProcessElement public void process( - @FieldAccess(“shippingAddress.street”) String street) { - ... - } -})); -``` - -For more information, see the section on field-selection expressions. When selecting subschemas, Beam will -automatically convert to any matching schema type, just like when reading the entire row. - - -## 7. Data encoding and type safety {#data-encoding-and-type-safety} - -When Beam runners execute your pipeline, they often need to materialize the -intermediate data in your `PCollection`s, which requires converting elements to -and from byte strings. The Beam SDKs use objects called `Coder`s to describe how -the elements of a given `PCollection` may be encoded and decoded. - -> Note that coders are unrelated to parsing or formatting data when interacting -> with external data sources or sinks. Such parsing or formatting should -> typically be done explicitly, using transforms such as `ParDo` or -> `MapElements`. - -{:.language-java} -In the Beam SDK for Java, the type `Coder` provides the methods required for -encoding and decoding data. The SDK for Java provides a number of Coder -subclasses that work with a variety of standard Java types, such as Integer, -Long, Double, StringUtf8 and more. You can find all of the available Coder -subclasses in the [Coder package](https://github.com/apache/beam/tree/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders). - -{:.language-py} -In the Beam SDK for Python, the type `Coder` provides the methods required for -encoding and decoding data. The SDK for Python provides a number of Coder -subclasses that work with a variety of standard Python types, such as primitive -types, Tuple, Iterable, StringUtf8 and more. You can find all of the available -Coder subclasses in the -[apache_beam.coders](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/coders) -package. - -> Note that coders do not necessarily have a 1:1 relationship with types. For -> example, the Integer type can have multiple valid coders, and input and output -> data can use different Integer coders. A transform might have Integer-typed -> input data that uses BigEndianIntegerCoder, and Integer-typed output data that -> uses VarIntCoder. - -### 7.1. Specifying coders {#specifying-coders} - -The Beam SDKs require a coder for every `PCollection` in your pipeline. In most -cases, the Beam SDK is able to automatically infer a `Coder` for a `PCollection` -based on its element type or the transform that produces it, however, in some -cases the pipeline author will need to specify a `Coder` explicitly, or develop -a `Coder` for their custom type. - -{:.language-java} -You can explicitly set the coder for an existing `PCollection` by using the -method `PCollection.setCoder`. Note that you cannot call `setCoder` on a -`PCollection` that has been finalized (e.g. by calling `.apply` on it). - -{:.language-java} -You can get the coder for an existing `PCollection` by using the method -`getCoder`. This method will fail with an `IllegalStateException` if a coder has -not been set and cannot be inferred for the given `PCollection`. - -Beam SDKs use a variety of mechanisms when attempting to automatically infer the -`Coder` for a `PCollection`. - -{:.language-java} -Each pipeline object has a `CoderRegistry`. The `CoderRegistry` represents a -mapping of Java types to the default coders that the pipeline should use for -`PCollection`s of each type. - -{:.language-py} -The Beam SDK for Python has a `CoderRegistry` that represents a mapping of -Python types to the default coder that should be used for `PCollection`s of each -type. - -{:.language-java} -By default, the Beam SDK for Java automatically infers the `Coder` for the -elements of a `PCollection` produced by a `PTransform` using the type parameter -from the transform's function object, such as `DoFn`. In the case of `ParDo`, -for example, a `DoFn` function object accepts an input element -of type `Integer` and produces an output element of type `String`. In such a -case, the SDK for Java will automatically infer the default `Coder` for the -output `PCollection` (in the default pipeline `CoderRegistry`, this is -`StringUtf8Coder`). - -{:.language-py} -By default, the Beam SDK for Python automatically infers the `Coder` for the -elements of an output `PCollection` using the typehints from the transform's -function object, such as `DoFn`. In the case of `ParDo`, for example a `DoFn` -with the typehints `@beam.typehints.with_input_types(int)` and -`@beam.typehints.with_output_types(str)` accepts an input element of type int -and produces an output element of type str. In such a case, the Beam SDK for -Python will automatically infer the default `Coder` for the output `PCollection` -(in the default pipeline `CoderRegistry`, this is `BytesCoder`). +{{< paragraph class="language-py" >}} +By default, the Beam SDK for Python automatically infers the `Coder` for the +elements of an output `PCollection` using the typehints from the transform's +function object, such as `DoFn`. In the case of `ParDo`, for example a `DoFn` +with the typehints `@beam.typehints.with_input_types(int)` and +`@beam.typehints.with_output_types(str)` accepts an input element of type int +and produces an output element of type str. In such a case, the Beam SDK for +Python will automatically infer the default `Coder` for the output `PCollection` +(in the default pipeline `CoderRegistry`, this is `BytesCoder`). +{{< /paragraph >}} > NOTE: If you create your `PCollection` from in-memory data by using the > `Create` transform, you cannot rely on coder inference and default coders. @@ -3169,11 +2176,12 @@ Python will automatically infer the default `Coder` for the output `PCollection` > may not be able to infer a coder if the argument list contains a value whose > exact run-time class doesn't have a default coder registered. -{:.language-java} +{{< paragraph class="language-java" >}} When using `Create`, the simplest way to ensure that you have the correct coder is by invoking `withCoder` when you apply the `Create` transform. +{{< /paragraph >}} -### 7.2. Default coders and the CoderRegistry {#default-coders-and-the-coderregistry} +### 6.2. Default coders and the CoderRegistry {#default-coders-and-the-coderregistry} Each Pipeline object has a `CoderRegistry` object, which maps language types to the default coder the pipeline should use for those types. You can use the @@ -3186,7 +2194,7 @@ types for any pipeline you create using the Beam SDK for JavaPython. The following table shows the standard mapping: -{:.language-java} +{{< paragraph class="language-java" >}} @@ -3249,8 +2257,9 @@ The following table shows the standard mapping:
+{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} @@ -3281,23 +2290,26 @@ The following table shows the standard mapping:
+{{< /paragraph >}} -#### 7.2.1. Looking up a default coder {#default-coder-lookup} +#### 6.2.1. Looking up a default coder {#default-coder-lookup} -{:.language-java} +{{< paragraph class="language-java" >}} You can use the method `CoderRegistry.getCoder` to determine the default Coder for a Java type. You can access the `CoderRegistry` for a given pipeline by using the method `Pipeline.getCoderRegistry`. This allows you to determine (or set) the default Coder for a Java type on a per-pipeline basis: i.e. "for this pipeline, verify that Integer values are encoded using `BigEndianIntegerCoder`." +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} You can use the method `CoderRegistry.get_coder` to determine the default Coder for a Python type. You can use `coders.registry` to access the `CoderRegistry`. This allows you to determine (or set) the default Coder for a Python type. +{{< /paragraph >}} -#### 7.2.2. Setting the default coder for a type {#setting-default-coder} +#### 6.2.2. Setting the default coder for a type {#setting-default-coder} To set the default Coder for a JavaPython @@ -3315,39 +2327,41 @@ The following example code demonstrates how to set a default Coder, in this case Integerint values for a pipeline. -```java +{{< highlight java >}} PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); CoderRegistry cr = p.getCoderRegistry(); cr.registerCoder(Integer.class, BigEndianIntegerCoder.class); -``` +{{< /highlight >}} -```py +{{< highlight py >}} apache_beam.coders.registry.register_coder(int, BigEndianIntegerCoder) -``` +{{< /highlight >}} -#### 7.2.3. Annotating a custom data type with a default coder {#annotating-custom-type-default-coder} +#### 6.2.3. Annotating a custom data type with a default coder {#annotating-custom-type-default-coder} -{:.language-java} +{{< paragraph class="language-java" >}} If your pipeline program defines a custom data type, you can use the `@DefaultCoder` annotation to specify the coder to use with that type. For example, let's say you have a custom data type for which you want to use `SerializableCoder`. You can use the `@DefaultCoder` annotation as follows: +{{< /paragraph >}} -```java +{{< highlight java >}} @DefaultCoder(AvroCoder.class) public class MyCustomDataType { ... } -``` +{{< /highlight >}} -{:.language-java} +{{< paragraph class="language-java" >}} If you've created a custom coder to match your data type, and you want to use the `@DefaultCoder` annotation, your coder class must implement a static `Coder.of(Class)` factory method. +{{< /paragraph >}} -```java +{{< highlight java >}} public class MyCustomCoder implements Coder { public static Coder of(Class clazz) {...} ... @@ -3357,14 +2371,15 @@ public class MyCustomCoder implements Coder { public class MyCustomDataType { ... } -``` +{{< /highlight >}} -{:.language-py} +{{< paragraph class="language-py" >}} The Beam SDK for Python does not support annotating data types with a default coder. If you would like to set a default coder, use the method described in the previous section, *Setting the default coder for a type*. +{{< /paragraph >}} -## 8. Windowing {#windowing} +## 7. Windowing {#windowing} Windowing subdivides a `PCollection` according to the timestamps of its individual elements. Transforms that aggregate multiple elements, such as @@ -3378,7 +2393,7 @@ windowing strategy for your `PCollection`. Triggers allow you to deal with late-arriving data or to provide early results. See the [triggers](#triggers) section for more information. -### 8.1. Windowing basics {#windowing-basics} +### 7.1. Windowing basics {#windowing-basics} Some Beam transforms, such as `GroupByKey` and `Combine`, group multiple elements by a common key. Ordinarily, that grouping operation groups all of the @@ -3411,7 +2426,7 @@ your unbounded `PCollection` and subsequently use a grouping transform such as `GroupByKey` or `Combine`, your pipeline will generate an error upon construction and your job will fail. -#### 8.1.1. Windowing constraints {#windowing-constraints} +#### 7.1.1. Windowing constraints {#windowing-constraints} After you set the windowing function for a `PCollection`, the elements' windows are used the next time you apply a grouping transform to that `PCollection`. @@ -3421,7 +2436,7 @@ windows are not considered until `GroupByKey` or `Combine` aggregates across a window and key. This can have different effects on your pipeline. Consider the example pipeline in the figure below: -![Diagram of pipeline applying windowing]({{ "/images/windowing-pipeline-unbounded.svg" | prepend: site.baseurl }} "Pipeline applying windowing") +![Diagram of pipeline applying windowing](/images/windowing-pipeline-unbounded.svg) **Figure 3:** Pipeline applying windowing @@ -3434,7 +2449,7 @@ windows are not actually used until they're needed for the `GroupByKey`. Subsequent transforms, however, are applied to the result of the `GroupByKey` -- data is grouped by both key and window. -#### 8.1.2. Windowing with bounded PCollections {#windowing-bounded-collections} +#### 7.1.2. Windowing with bounded PCollections {#windowing-bounded-collections} You can use windowing with fixed-size data sets in **bounded** `PCollection`s. However, note that windowing considers only the implicit timestamps attached to @@ -3445,13 +2460,13 @@ all the elements are by default part of a single, global window. To use windowing with fixed data sets, you can assign your own timestamps to each element. To assign timestamps to elements, use a `ParDo` transform with a `DoFn` that outputs each element with a new timestamp (for example, the -[WithTimestamps](https://beam.apache.org/releases/javadoc/{{ site.release_latest }}/index.html?org/apache/beam/sdk/transforms/WithTimestamps.html) +[WithTimestamps](https://beam.apache.org/releases/javadoc/{{< param release_latest >}}/index.html?org/apache/beam/sdk/transforms/WithTimestamps.html) transform in the Beam SDK for Java). To illustrate how windowing with a bounded `PCollection` can affect how your pipeline processes data, consider the following pipeline: -![Diagram of GroupByKey and ParDo without windowing, on a bounded collection]({{ "/images/unwindowed-pipeline-bounded.svg" | prepend: site.baseurl }} "GroupByKey and ParDo without windowing, on a bounded collection") +![Diagram of GroupByKey and ParDo without windowing, on a bounded collection](/images/unwindowed-pipeline-bounded.svg) **Figure 4:** `GroupByKey` and `ParDo` without windowing, on a bounded collection. @@ -3466,7 +2481,7 @@ all elements in your `PCollection` are assigned to a single global window. Now, consider the same pipeline, but using a windowing function: -![Diagram of GroupByKey and ParDo with windowing, on a bounded collection]({{ "/images/windowing-pipeline-bounded.svg" | prepend: site.baseurl }} "GroupByKey and ParDo with windowing, on a bounded collection") +![Diagram of GroupByKey and ParDo with windowing, on a bounded collection](/images/windowing-pipeline-bounded.svg) **Figure 5:** `GroupByKey` and `ParDo` with windowing, on a bounded collection. @@ -3477,7 +2492,7 @@ for that `PCollection`. The `GroupByKey` transform groups the elements of the subsequent `ParDo` transform gets applied multiple times per key, once for each window. -### 8.2. Provided windowing functions {#provided-windowing-functions} +### 7.2. Provided windowing functions {#provided-windowing-functions} You can define different kinds of windows to divide the elements of your `PCollection`. Beam provides several windowing functions, including: @@ -3496,7 +2511,7 @@ overlapping windows wherein a single element can be assigned to multiple windows. -#### 8.2.1. Fixed time windows {#fixed-time-windows} +#### 7.2.1. Fixed time windows {#fixed-time-windows} The simplest form of windowing is using **fixed time windows**: given a timestamped `PCollection` which might be continuously updating, each window @@ -3510,11 +2525,11 @@ of the elements in your unbounded `PCollection` with timestamp values from with timestamp values from 0:00:30 up to (but not including) 0:01:00 belong to the second window, and so on. -![Diagram of fixed time windows, 30s in duration]({{ "/images/fixed-time-windows.png" | prepend: site.baseurl }} "Fixed time windows, 30s in duration") +![Diagram of fixed time windows, 30s in duration](/images/fixed-time-windows.png) **Figure 6:** Fixed time windows, 30s in duration. -#### 8.2.2. Sliding time windows {#sliding-time-windows} +#### 7.2.2. Sliding time windows {#sliding-time-windows} A **sliding time window** also represents time intervals in the data stream; however, sliding time windows can overlap. For example, each window might @@ -3529,12 +2544,12 @@ averages of data; using sliding time windows, you can compute a running average of the past 60 seconds' worth of data, updated every 30 seconds, in our example. -![Diagram of sliding time windows, with 1 minute window duration and 30s window period]({{ "/images/sliding-time-windows.png" | prepend: site.baseurl }} "Sliding time windows, with 1 minute window duration and 30s window period") +![Diagram of sliding time windows, with 1 minute window duration and 30s window period](/images/sliding-time-windows.png) **Figure 7:** Sliding time windows, with 1 minute window duration and 30s window period. -#### 8.2.3. Session windows {#session-windows} +#### 7.2.3. Session windows {#session-windows} A **session window** function defines windows that contain elements that are within a certain gap duration of another element. Session windowing applies on a @@ -3544,12 +2559,12 @@ have long periods of idle time interspersed with high concentrations of clicks. If data arrives after the minimum specified gap duration time, this initiates the start of a new window. -![Diagram of session windows with a minimum gap duration]({{ "/images/session-windows.png" | prepend: site.baseurl }} "Session windows, with a minimum gap duration") +![Diagram of session windows with a minimum gap duration](/images/session-windows.png) **Figure 8:** Session windows, with a minimum gap duration. Note how each data key has different windows, according to its data distribution. -#### 8.2.4. The single global window {#single-global-window} +#### 7.2.4. The single global window {#single-global-window} By default, all data in a `PCollection` is assigned to the single global window, and late data is discarded. If your data set is of a fixed size, you can use the @@ -3563,7 +2578,7 @@ processing, which is not possible with continuously updating data. To perform aggregations on an unbounded `PCollection` that uses global windowing, you should specify a non-default trigger for that `PCollection`. -### 8.3. Setting your PCollection's windowing function {#setting-your-pcollections-windowing-function} +### 7.3. Setting your PCollection's windowing function {#setting-your-pcollections-windowing-function} You can set the windowing function for a `PCollection` by applying the `Window` transform. When you apply the `Window` transform, you must provide a `WindowFn`. @@ -3576,73 +2591,85 @@ and emitted, and helps refine how the windowing function performs with respect to late data and computing early results. See the [triggers](#triggers) section for more information. -#### 8.3.1. Fixed-time windows {#using-fixed-time-windows} +#### 7.3.1. Fixed-time windows {#using-fixed-time-windows} The following example code shows how to apply `Window` to divide a `PCollection` into fixed windows, each 60 seconds in length: -```java +{{< highlight java >}} PCollection items = ...; PCollection fixedWindowedItems = items.apply( Window.into(FixedWindows.of(Duration.standardSeconds(60)))); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -#### 8.3.2. Sliding time windows {#using-sliding-time-windows} +#### 7.3.2. Sliding time windows {#using-sliding-time-windows} The following example code shows how to apply `Window` to divide a `PCollection` into sliding time windows. Each window is 30 seconds in length, and a new window begins every five seconds: -```java +{{< highlight java >}} PCollection items = ...; PCollection slidingWindowedItems = items.apply( Window.into(SlidingWindows.of(Duration.standardSeconds(30)).every(Duration.standardSeconds(5)))); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -#### 8.3.3. Session windows {#using-session-windows} +#### 7.3.3. Session windows {#using-session-windows} The following example code shows how to apply `Window` to divide a `PCollection` into session windows, where each session must be separated by a time gap of at least 10 minutes (600 seconds): -```java +{{< highlight java >}} PCollection items = ...; PCollection sessionWindowedItems = items.apply( Window.into(Sessions.withGapDuration(Duration.standardSeconds(600)))); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} Note that the sessions are per-key — each key in the collection will have its own session groupings depending on the data distribution. -#### 8.3.4. Single global window {#using-single-global-window} +#### 7.3.4. Single global window {#using-single-global-window} If your `PCollection` is bounded (the size is fixed), you can assign all the elements to a single global window. The following example code shows how to set a single global window for a `PCollection`: -```java +{{< highlight java >}} PCollection items = ...; PCollection batchItems = items.apply( Window.into(new GlobalWindows())); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -### 8.4. Watermarks and late data {#watermarks-and-late-data} +### 7.4. Watermarks and late data {#watermarks-and-late-data} In any data processing system, there is a certain amount of lag between the time a data event occurs (the "event time", determined by the timestamp on the data @@ -3683,7 +2710,7 @@ a `PCollection`. You can use triggers to decide when each individual window aggregates and reports its results, including how the window emits late elements. -#### 8.4.1. Managing late data {#managing-late-data} +#### 7.4.1. Managing late data {#managing-late-data} You can allow late data by invoking the `.withAllowedLateness` operation when @@ -3691,14 +2718,15 @@ you set your `PCollection`'s windowing strategy. The following code example demonstrates a windowing strategy that will allow late data up to two days after the end of a window. -```java +{{< highlight java >}} PCollection items = ...; PCollection fixedWindowedItems = items.apply( Window.into(FixedWindows.of(Duration.standardMinutes(1))) .withAllowedLateness(Duration.standardDays(2))); -``` +{{< /highlight >}} + -```py +{{< highlight py >}} pc = [Initial PCollection] pc | beam.WindowInto( FixedWindows(60), @@ -3706,14 +2734,15 @@ the end of a window. accumulation_mode=accumulation_mode, timestamp_combiner=timestamp_combiner, allowed_lateness=Duration(seconds=2*24*60*60)) # 2 days -``` +{{< /highlight >}} + When you set `.withAllowedLateness` on a `PCollection`, that allowed lateness propagates forward to any subsequent `PCollection` derived from the first `PCollection` you applied allowed lateness to. If you want to change the allowed lateness later in your pipeline, you must do so explictly by applying `Window.configure().withAllowedLateness()`. -### 8.5. Adding timestamps to a PCollection's elements {#adding-timestamps-to-a-pcollections-elements} +### 7.5. Adding timestamps to a PCollection's elements {#adding-timestamps-to-a-pcollections-elements} An unbounded source provides a timestamp for each element. Depending on your unbounded source, you may need to configure how the timestamp is extracted from @@ -3733,7 +2762,7 @@ records in from a file, the file source doesn't assign timestamps automatically. You can parse the timestamp field from each record and use a `ParDo` transform with a `DoFn` to attach the timestamps to each element in your `PCollection`. -```java +{{< highlight java >}} PCollection unstampedLogs = ...; PCollection stampedLogs = unstampedLogs.apply(ParDo.of(new DoFn() { @@ -3745,13 +2774,16 @@ with a `DoFn` to attach the timestamps to each element in your `PCollection`. out.outputWithTimestamp(element, logTimeStamp); } })); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -## 9. Triggers {#triggers} +## 8. Triggers {#triggers} When collecting and grouping data into windows, Beam uses **triggers** to determine when to emit the aggregated results of each window (referred to as a @@ -3807,7 +2839,7 @@ you want your pipeline to provide periodic updates on an unbounded data set — for example, a running average of all data provided to the present time, updated every N seconds or every N elements. -### 9.1. Event time triggers {#event-time-triggers} +### 8.1. Event time triggers {#event-time-triggers} The `AfterWatermark` trigger operates on *event time*. The `AfterWatermark` trigger emits the contents of a window after the @@ -3824,7 +2856,7 @@ before or after the end of the window. The following example shows a billing scenario, and uses both early and late firings: -```java +{{< highlight java >}} // Create a bill at the end of the month. AfterWatermark.pastEndOfWindow() // During the month, get near real-time estimates. @@ -3834,12 +2866,15 @@ firings: .plusDuration(Duration.standardMinutes(1)) // Fire on any late data so the bill can be corrected. .withLateFirings(AfterPane.elementCountAtLeast(1)) -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -#### 9.1.1. Default trigger {#default-trigger} +#### 8.1.1. Default trigger {#default-trigger} The default trigger for a `PCollection` is based on event time, and emits the results of the window when the Beam's watermark passes the end of the window, @@ -3851,7 +2886,7 @@ discarded. This is because the default windowing configuration has an allowed lateness value of 0. See the Handling Late Data section for information about modifying this behavior. -### 9.2. Processing time triggers {#processing-time-triggers} +### 8.2. Processing time triggers {#processing-time-triggers} The `AfterProcessingTime` trigger operates on *processing time*. For example, the `AfterProcessingTime.pastFirstElementInPane()` @@ -3864,7 +2899,7 @@ The `AfterProcessingTime` trigger is useful for triggering early results from a window, particularly a window with a large time frame such as a single global window. -### 9.3. Data-driven triggers {#data-driven-triggers} +### 8.3. Data-driven triggers {#data-driven-triggers} Beam provides one data-driven trigger, `AfterPane.elementCountAtLeast()` @@ -3881,39 +2916,44 @@ consider using [composite triggers](#composite-triggers) to combine multiple conditions. This allows you to specify multiple firing conditions such as “fire either when I receive 50 elements, or every 1 second”. -### 9.4. Setting a trigger {#setting-a-trigger} +### 8.4. Setting a trigger {#setting-a-trigger} When you set a windowing function for a `PCollection` by using the `Window``WindowInto` transform, you can also specify a trigger. -{:.language-java} +{{< paragraph class="language-java" >}} You set the trigger(s) for a `PCollection` by invoking the method `.triggering()` on the result of your `Window.into()` transform. This code sample sets a time-based trigger for a `PCollection`, which emits results one minute after the first element in that window has been processed. The last line in the code sample, `.discardingFiredPanes()`, sets the window's **accumulation mode**. +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} You set the trigger(s) for a `PCollection` by setting the `trigger` parameter when you use the `WindowInto` transform. This code sample sets a time-based trigger for a `PCollection`, which emits results one minute after the first element in that window has been processed. The `accumulation_mode` parameter sets the window's **accumulation mode**. +{{< /paragraph >}} -```java +{{< highlight java >}} PCollection pc = ...; pc.apply(Window.into(FixedWindows.of(1, TimeUnit.MINUTES)) .triggering(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1))) .discardingFiredPanes()); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -#### 9.4.1. Window accumulation modes {#window-accumulation-modes} +#### 8.4.1. Window accumulation modes {#window-accumulation-modes} When you specify a trigger, you must also set the the window's **accumulation mode**. When a trigger fires, it emits the current contents of the window as a @@ -3921,16 +2961,18 @@ pane. Since a trigger can fire multiple times, the accumulation mode determines whether the system *accumulates* the window panes as the trigger fires, or *discards* them. -{:.language-java} +{{< paragraph class="language-java" >}} To set a window to accumulate the panes that are produced when the trigger fires, invoke`.accumulatingFiredPanes()` when you set the trigger. To set a window to discard fired panes, invoke `.discardingFiredPanes()`. +{{< /paragraph >}} -{:.language-py} +{{< paragraph class="language-py" >}} To set a window to accumulate the panes that are produced when the trigger fires, set the `accumulation_mode` parameter to `ACCUMULATING` when you set the trigger. To set a window to discard fired panes, set `accumulation_mode` to `DISCARDING`. +{{< /paragraph >}} Let's look an example that uses a `PCollection` with fixed-time windowing and a data-based trigger. This is something you might do if, for example, each window @@ -3946,9 +2988,9 @@ The following diagram shows data events for key X as they arrive in the PCollection and are assigned to windows. To keep the diagram a bit simpler, we'll assume that the events all arrive in the pipeline in order. -![Diagram of data events for acculumating mode example]({{ "/images/trigger-accumulation.png" | prepend: site.baseurl }} "Data events for accumulating mode example") +![Diagram of data events for acculumating mode example](/images/trigger-accumulation.png) -##### 9.4.1.1. Accumulating mode {#accumulating-mode} +##### 8.4.1.1. Accumulating mode {#accumulating-mode} If our trigger is set to accumulating mode, the trigger emits the following values each time it fires. Keep in mind that the trigger fires every time three @@ -3961,7 +3003,7 @@ elements arrive: ``` -##### 9.4.1.2. Discarding mode {#discarding-mode} +##### 8.4.1.2. Discarding mode {#discarding-mode} If our trigger is set to discarding mode, the trigger emits the following values on each firing: @@ -3972,7 +3014,7 @@ on each firing: Third trigger firing: [9, 13, 10] ``` -#### 9.4.2. Handling late data {#handling-late-data} +#### 8.4.2. Handling late data {#handling-late-data} If you want your pipeline to process data that arrives after the watermark @@ -3984,14 +3026,15 @@ results immediately whenever late data arrives. You set the allowed lateness by using `.withAllowedLateness()` when you set your windowing function: -```java +{{< highlight java >}} PCollection pc = ...; pc.apply(Window.into(FixedWindows.of(1, TimeUnit.MINUTES)) .triggering(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1))) .withAllowedLateness(Duration.standardMinutes(30)); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} pc = [Initial PCollection] pc | beam.WindowInto( FixedWindows(60), @@ -3999,7 +3042,7 @@ windowing function: allowed_lateness=1800) # 30 minutes | ... -``` +{{< /highlight >}} This allowed lateness propagates to all `PCollection`s derived as a result of applying transforms to the original `PCollection`. If you want to change the @@ -4007,13 +3050,13 @@ allowed lateness later in your pipeline, you can apply `Window.configure().withAllowedLateness()` again, explicitly. -### 9.5. Composite triggers {#composite-triggers} +### 8.5. Composite triggers {#composite-triggers} You can combine multiple triggers to form **composite triggers**, and can specify a trigger to emit results repeatedly, at most once, or under other custom conditions. -#### 9.5.1. Composite trigger types {#composite-trigger-types} +#### 8.5.1. Composite trigger types {#composite-trigger-types} Beam includes the following composite triggers: @@ -4037,7 +3080,7 @@ Beam includes the following composite triggers: * `orFinally` can serve as a final condition to cause any trigger to fire one final time and never fire again. -#### 9.5.2. Composition with AfterWatermark {#composite-afterwatermark} +#### 8.5.2. Composition with AfterWatermark {#composite-afterwatermark} Some of the most useful composite triggers fire a single time when Beam estimates that all the data has arrived (i.e. when the watermark passes the end @@ -4057,11 +3100,12 @@ example trigger code fires on the following conditions: * Any time late data arrives, after a ten-minute delay -{:.language-java} +{{< paragraph class="language-java" >}} * After two days, we assume no more data of interest will arrive, and the trigger stops executing +{{< /paragraph >}} -```java +{{< highlight java >}} .apply(Window .configure() .triggering(AfterWatermark @@ -4070,27 +3114,33 @@ example trigger code fires on the following conditions: .pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10)))) .withAllowedLateness(Duration.standardDays(2))); -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -#### 9.5.3. Other composite triggers {#other-composite-triggers} +#### 8.5.3. Other composite triggers {#other-composite-triggers} You can also build other sorts of composite triggers. The following example code shows a simple composite trigger that fires whenever the pane has at least 100 elements, or after a minute. -```java +{{< highlight java >}} Repeatedly.forever(AfterFirst.of( AfterPane.elementCountAtLeast(100), AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1)))) -``` -```py +{{< /highlight >}} + +{{< highlight py >}} + +{{< /highlight >}} -## 10. Metrics {#metrics} +## 9. Metrics {#metrics} In the Beam model, metrics provide some insight into the current state of a user pipeline, potentially while the pipeline is running. There could be different reasons for that, for instance: * Check the number of errors encountered while running a specific step in the pipeline; @@ -4098,7 +3148,7 @@ potentially while the pipeline is running. There could be different reasons for * Retrieve an accurate count of the number of elements that have been processed; * ...and so on. -### 10.1 The main concepts of Beam metrics +### 9.1 The main concepts of Beam metrics * **Named**. Each metric has a name which consists of a namespace and an actual name. The namespace can be used to differentiate between multiple metrics with the same name and also allows querying for all metrics within a specific namespace. @@ -4118,13 +3168,13 @@ transform reported, as well as aggregating the metric across the entire pipeline > **Note:** It is runner-dependent whether metrics are accessible during pipeline execution or only after jobs have completed. -### 10.2 Types of metrics {#types-of-metrics} +### 9.2 Types of metrics {#types-of-metrics} There are three types of metrics that are supported for the moment: `Counter`, `Distribution` and `Gauge`. **Counter**: A metric that reports a single long value and can be incremented or decremented. -```java +{{< highlight java >}} Counter counter = Metrics.counter( "namespace", "counter1"); @ProcessElement @@ -4133,11 +3183,11 @@ public void processElement(ProcessContext context) { counter.inc(); ... } -``` +{{< /highlight >}} **Distribution**: A metric that reports information about the distribution of reported values. -```java +{{< highlight java >}} Distribution distribution = Metrics.distribution( "namespace", "distribution1"); @ProcessElement @@ -4147,12 +3197,12 @@ public void processElement(ProcessContext context) { distribution.update(element); ... } -``` +{{< /highlight >}} **Gauge**: A metric that reports the latest value out of reported values. Since metrics are collected from many workers the value may not be the absolute last, but one of the latest values. -```java +{{< highlight java >}} Gauge gauge = Metrics.gauge( "namespace", "gauge1"); @ProcessElement @@ -4162,14 +3212,14 @@ public void processElement(ProcessContext context) { gauge.set(element); ... } -``` +{{< /highlight >}} -### 10.3 Querying metrics {#querying-metrics} +### 9.3 Querying metrics {#querying-metrics} `PipelineResult` has a method `metrics()` which returns a `MetricResults` object that allows accessing metrics. The main method available in `MetricResults` allows querying for all metrics matching a given filter. -```java +{{< highlight java >}} public interface PipelineResult { MetricResults metrics(); } @@ -4190,12 +3240,12 @@ public interface MetricResult { T getCommitted(); T getAttempted(); } -``` +{{< /highlight >}} -### 10.4 Using metrics in pipeline {#using-metrics} +### 9.4 Using metrics in pipeline {#using-metrics} Below, there is a simple example of how to use a `Counter` metric in a user pipeline. -```java +{{< highlight java >}} // creating a pipeline with custom metrics DoFn pipeline .apply(...) @@ -4228,552 +3278,4 @@ public class MyMetricsDoFn extends DoFn { context.output(context.element()); } } -``` -### 9.5 Export metrics {#export-metrics} -Beam metrics can be exported to external sinks. If a metrics sink is set up in the configuration, the runner will push metrics to it at a default 5s period. -The configuration is held in the [MetricsOptions](https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/metrics/MetricsOptions.html) class. -It contains push period configuration and also sink specific options such as type and URL. As for now only the REST HTTP and the Graphite sinks are supported and only -Flink and Spark runners support metrics export. - -Also Beam metrics are exported to inner Spark and Flink dashboards to be consulted in their respective UI. - - - -## 10. State and Timers {#state-and-timers} -Beam's windowing and triggering facilities provide a powerful abstraction for grouping and aggregating unbounded input -data based on timestamps. However there are aggregation use cases for which developers may require a higher degree of -control than provided by windows and triggers. Beam provides an API for manually managing per-key state, allowing for -fine-grained control over aggregations. - -Beam's state API models state per key. To use the state API, you start out with a keyed `PCollection`, which in Java -is modeled as a `PCollection>`. A `ParDo` processing this `PCollection` can now declare state variables. Inside -the `ParDo` these state variables can be used to write or update state for the current key or to read previous state -written for that key. State is always fully scoped only to the current processing key. - -Windowing can still be used together with stateful processing. All state for a key is scoped to the current window. This -means that the first time a key is seen for a given window any state reads will return empty, and that a runner can -garbage collect state when a window is completed. It's also often useful to use Beam's windowed aggegations prior to -the stateful operator. For example, using a combiner to preaggregate data, and then storing aggregated data inside of -state. Merging windows are not currently supported when using state and timers. - -Sometimes stateful processing is used to implement state-machine style processing inside a `DoFn`. When doing this, -care must be taken to remember that the elements in input PCollection have no guaranteed order and to ensure that the -program logic is resilient to this. Unit tests written using the DirectRunner will shuffle the order of element -processing, and are recommended to test for correctness. - -In Java DoFn declares states to be accessed by creating final `StateSpec` member variables representing each state. Each -state must be named using the `StateId` annotation; this name is unique to a ParDo in the graph and has no relation -to other nodes in the graph. A `DoFn` can declare multiple state variables. - -### 10.1 Types of state {#types-of-state} -Beam provides several types of state: - -#### ValueState -A ValueState is a scalar state value. For each key in the input, a ValueState will store a typed value that can be -read and modified inside the DoFn's `@ProcessElement` or `@OnTimer` methods. If the type of the ValueState has a coder -registered, then Beam will automatically infer the coder for the state value. Otherwise, a coder can be explicitly -specified when creating the ValueState. For example, the following ParDo creates a single state variable that -accumulates the number of elements seen. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state") private final StateSpec> numElements = StateSpecs.value(); - - @ProcessElement public void process(@StateId("state") ValueState state) { - // Read the number element seen so far for this user key. - // state.read() returns null if it was never set. The below code allows us to have a default value of 0. - int currentValue = MoreObjects.firstNonNull(state.read(), 0); - // Update the state. - state.write(currentValue + 1); - } -})); -``` - -Beam also allows explicitly specifying a coder for `ValueState` values. For example: - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state") private final StateSpec> numElements = StateSpecs.value(new MyTypeCoder()); - ... -})); -``` - -#### CombiningState -`CombiningState` allows you to create a state object that is updated using a Beam combiner. For example, the previous -`ValueState` example could be rewritten to use `CombiningState` -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state") private final StateSpec> numElements = - StateSpecs.combining(Sum.ofIntegers()); - - @ProcessElement public void process(@StateId("state") ValueState state) { - state.add(1); - } -})); -``` - -#### BagState -A common use case for state is to accumulate multiple elements. `BagState` allows for accumulating an unordered set -ofelements. This allows for addition of elements to the collection without requiring the reading of the entire -collection first, which is an efficiency gain. In addition, runners that support paged reads can allow individual -bags larger than available memory. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state") private final StateSpec> numElements = StateSpecs.bag(); - - @ProcessElement public void process( - @Element KV element, - @StateId("state") BagState state) { - // Add the current element to the bag for this key. - state.add(element.getValue()); - if (shouldFetch()) { - // Occasionally we fetch and process the values. - Iterable values = state.read(); - processValues(values); - state.clear(); // Clear the state for this key. - } - } -})); -``` -### 10.2 Deferred state reads {#deferred-state-reads} -When a `DoFn` contains multiple state specifications, reading each one in order can be slow. Calling the `read()` function -on a state can cause the runner to perform a blocking read. Performing multiple blocking reads in sequence adds latency -to element processing. If you know that a state will always be read, you can annotate it as @AlwaysFetched, and then the -runner can prefetch all of the states necessary. For example: - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state1") private final StateSpec> state1 = StateSpecs.value(); - @StateId("state2") private final StateSpec> state2 = StateSpecs.value(); - @StateId("state3") private final StateSpec> state3 = StateSpecs.bag(); - - @ProcessElement public void process( - @AlwaysFetched @StateId("state1") ValueState state1, - @AlwaysFetched @StateId("state2") ValueState state2, - @AlwaysFetched @StateId("state3") BagState state3) { - state1.read(); - state2.read(); - state3.read(); - } -})); -``` - -If however there are code paths in which the states are not fetched, then annotating with @AlwaysFetched will add -unnecessary fetching for those paths. In this case, the readLater method allows the runner to know that the state will -be read in the future, allowing multiple state reads to be batched together. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state1") private final StateSpec> state1 = StateSpecs.value(); - @StateId("state2") private final StateSpec> state2 = StateSpecs.value(); - @StateId("state3") private final StateSpec> state3 = StateSpecs.bag(); - - @ProcessElement public void process( - @StateId("state1") ValueState state1, - @StateId("state2") ValueState state2, - @StateId("state3") BagState state3) { - if (/* should read state */) { - state1.readLater(); - state2.readLater(); - state3.readLater(); - } - - // The runner can now batch all three states into a single read, reducing latency. - processState1(state1.read()); - processState2(state2.read()); - processState3(state3.read()); - } -})); -``` - -### 10.3 Timers {#timers} -Beam provides a per-key timer callback API. This allows for delayed processing of data stored using the state API. -Timers can be set to callback at either an event-time or a processing-time timestamp. Every timer is identified with a -TimerId. A given timer for a key can only be set for a single timestamp. Calling set on a timer overwrites the previous -firing time for that key's timer. - -#### 10.3.1 Event-time timers {#event-time-timers} -Event-time timers fire when the input watermark for the DoFn passes the time at which the timer is set, meaning that -the runner believes that there are no more elements to be processed with timestamps before the timer timestamp. This -allows for event-time aggregations. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state") private final StateSpec> state = StateSpecs.value(); - @TimerId("timer") private final TimerSpec timer = TimerSpecs.timer(TimeDomain.EVENT_TIME); - - @ProcessElement public void process( - @Element KV element, - @Timestamp Instant elementTs, - @StateId("state") ValueState state, - @TimerId("timer") Timer timer) { - ... - // Set an event-time timer to the element timestamp. - timer.set(elementTs); - } - - @OnTimer("timer") public void onTimer() { - //Process timer. - } -})); - -``` -#### 10.3.2 Processing-time timers {#processing-time-timers} -Processing-time timers fire when the real wall-clock time passes. This is often used to create larger batches of data -before processing. It can also be used to schedule events that should occur at a specific time. Just like with -event-time timers, processing-time timers are per key - each key has a separate copy of the timer. - -While processing-time timers can be set to an absolute timestamp, it is very common to set them to an offset relative -to the current time. The `Timer.offset` and `Timer.setRelative` methods can be used to accomplish this. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @TimerId("timer") private final TimerSpec timer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); - - @ProcessElement public void process(@TimerId("timer") Timer timer) { - ... - // Set a timer to go off 30 seconds in the future. - timer.offset(Duration.standardSeconds(30)).setRelative(); - } - - @OnTimer("timer") public void onTimer() { - //Process timer. - } -})); - -``` - -#### 10.3.3 Dynamic timer tags {#dynamic-timer-tags} -Beam also supports dynamically setting a timer tag using `TimerMap`. This allows for setting multiple different timers -in a `DoFn` and allowing for the timer tags to be dynamically chosen - e.g. based on data in the input elements. A -timer with a specific tag can only be set to a single timestamp, so setting the timer again has the effect of -overwriting the previous expiration time for the timer with that tag. Each `TimerMap` is identified with a timer family -id, and timers in different timer families are independent. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @TimerFamily("actionTimers") private final TimerSpec timer = - TimerSpecs.timerMap(TimeDomain.EVENT_TIME); - - @ProcessElement public void process( - @Element KV element, - @Timestamp Instant elementTs, - @TimerFamily("actionTimers") TimerMap timers) { - timers.set(element.getValue().getActionType(), elementTs); - } - - @OnTimerFamily("actionTimers") public void onTimer(@TimerId String timerId) { - LOG.info("Timer fired with id " + timerId); - } -})); - -``` - -#### 10.3.4 Timer output timestamps {#timer-output-timestamps} -By default, event-time timers will hold the output watermark of the `ParDo` to the timestamp of the timer. This means -that if a timer is set to 12pm, any windowed aggregations or event-time timers later in the pipeline graph that finish -after 12pm will not expire. The timestamp of the timer is also the default output timestamp for the timer callback. This -means that any elements output from the onTimer method will have a timestamp equal to the timestamp of the timer firing. -For processing-time timers, the default output timestamp and watermark hold is the value of the input watermark at the -time the timer was set. - -In some cases, a DoFn needs to output timestamps earlier than the timer expiration time, and therefore also needs to -hold its output watermark to those timestamps. For example, consider the following pipeline that temporarily batches -records into state, and sets a timer to drain the state. This code may appear correct, but will not work properly. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - @StateId("elementBag") private final StateSpec> elementBag = StateSpecs.bag(); - @StateId("timerSet") private final StateSpec> timerSet = StateSpecs.value(); - @TimerId("outputState") private final TimerSpec timer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); - - @ProcessElement public void process( - @Element KV element, - @StateId("elementBag") BagState elementBag, - @StateId("timerSet") ValueState timerSet, - @TimerId("outputState") Timer timer) { - // Add the current element to the bag for this key. - elementBag.add(element.getValue()); - if (!MoreObjects.firstNonNull(timerSet.read(), false)) { - // If the timer is not current set, then set it to go off in a minute. - timer.offset(Duration.standardMinutes(1)).setRelative(); - timerSet.write(true); - } - } - - @OnTimer("outputState") public void onTimer( - @StateId("elementBag") BagState elementBag, - @StateId("timerSet") ValueState timerSet, - OutputReceiver output) { - for (ValueT bufferedElement : elementBag.read()) { - // Output each element. - output.outputWithTimestamp(bufferedElement, bufferedElement.timestamp()); - } - elementBag.clear(); - // Note that the timer has now fired. - timerSet.clear(); - } -})); -``` -The problem with this code is that the ParDo is buffering elements, however nothing is preventing the watermark -from advancing past the timestamp of those elements, so all those elements might be dropped as late data. In order -to prevent this from happening, an output timestamp needs to be set on the timer to prevent the watermark from advancing -past the timestamp of the minimum element. The following code demonstrates this. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - // The bag of elements accumulated. - @StateId("elementBag") private final StateSpec> elementBag = StateSpecs.bag(); - // The timestamp of the timer set. - @StateId("timerTimestamp") private final StateSpec> timerTimestamp = StateSpecs.value(); - // The minimum timestamp stored in the bag. - @StateId("minTimestampInBag") private final StateSpec> - minTimestampInBag = StateSpecs.combining(Min.ofLongs()); - - @TimerId("outputState") private final TimerSpec timer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); - - @ProcessElement public void process( - @Element KV element, - @StateId("elementBag") BagState elementBag, - @AlwaysFetched @StateId("timerTimestamp") ValueState timerTimestamp, - @AlwaysFetched @StateId("minTimestampInBag") CombiningState minTimestamp, - @TimerId("outputState") Timer timer) { - // Add the current element to the bag for this key. - elementBag.add(element.getValue()); - // Keep track of the minimum element timestamp currently stored in the bag. - minTimestamp.add(element.getValue().timestamp()); - - // If the timer is already set, then reset it at the same time but with an updated output timestamp (otherwise - // we would keep resetting the timer to the future). If there is no timer set, then set one to expire in a minute. - Long timerTimestampMs = timerTimestamp.read(); - Instant timerToSet = (timerTimestamp.isEmpty().read()) - ? Instant.now().plus(Duration.standardMinutes(1)) : new Instant(timerTimestampMs); - // Setting the outputTimestamp to the minimum timestamp in the bag holds the watermark to that timestamp until the - // timer fires. This allows outputting all the elements with their timestamp. - timer.withOutputTimestamp(minTimestamp.read()).set(timerToSet). - timerTimestamp.write(timerToSet.getMillis()); - } - - @OnTimer("outputState") public void onTimer( - @StateId("elementBag") BagState elementBag, - @StateId("timerTimestamp") ValueState timerTimestamp, - OutputReceiver output) { - for (ValueT bufferedElement : elementBag.read()) { - // Output each element. - output.outputWithTimestamp(bufferedElement, bufferedElement.timestamp()); - } - // Note that the timer has now fired. - timerTimestamp.clear(); - } -})); -``` -### 10.4 Garbage collecting state {#garbage-collecting-state} -Per-key state needs to be garbage collected, or eventually the increasing size of state may negatively impact -performance. There are two common strategies for garbage collecting state. - -##### 10.4.1 **Using windows for garbage collection** {#using-windows-for-garbage-collection} -All state and timers for a key is scoped to the window it is in. This means that depending on the timestamp of the -input element the ParDo will see different values for the state depending on the window that element falls into. In -addition, once the input watermark passes the end of the window, the runner should garbage collect all state for that -window. (note: if allowed lateness is set to a positive value for the window, the runner must wait for the watemark to -pass the end of the window plus the allowed lateness before garbage collecting state). This can be used as a -garbage-collection strategy. - -For example, given the following: - -```java -PCollection> perUser = readPerUser(); -perUser.apply(Window.into(CalendarWindows.days(1) - .withTimeZone(DateTimeZone.forID("America/Los_Angeles")))); - .apply(ParDo.of(new DoFn, OutputT>() { - @StateId("state") private final StateSpec> state = StateSpecs.value(); - ... - @ProcessElement public void process(@Timestamp Instant ts, @StateId("state") ValueState state) { - // The state is scoped to a calendar day window. That means that if the input timestamp ts is after - // midnight PST, then a new copy of the state will be seen for the next day. - } - })); -``` - -This `ParDo` stores state per day. Once the pipeline is done processing data for a given day, all the state for that -day is garbage collected. - -##### 10.4.1 **Using timers For garbage collection** {#using-timers-for-garbage-collection} -In some cases, it is difficult to find a windowing strategy that models the desired garbage-collection strategy. For -example, a common desire is to garbage collect state for a key once no activity has been seen on the key for some time. -This can be done by updating a timer that garbage collects state. For example - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - // The state for the key. - @StateId("state") private final StateSpec> state = StateSpecs.value(); - - // The maximum element timestamp seen so far. - @StateId("maxTimestampSeen") private final StateSpec> - maxTimestamp = StateSpecs.combining(Max.ofLongs()); - - @TimerId("gcTimer") private final TimerSpec gcTimer = TimerSpecs.timer(TimeDomain.EVENT_TIME); - - @ProcessElement public void process( - @Element KV element, - @Timestamp Instant ts, - @StateId("state") ValueState state, - @StateId("maxTimestampSeen") CombiningState maxTimestamp, - @TimerId("gcTimer") gcTimer) { - updateState(state, element); - maxTimestamp.add(ts.getMillis()); - - // Set the timer to be one hour after the maximum timestamp seen. This will keep overwriting the same timer, so - // as long as there is activity on this key the state will stay active. Once the key goes inactive for one hour's - // worth of event time (as measured by the watermark), then the gc timer will fire. - Instant expirationTime = new Instant(maxTimestamp.read()).plus(Duration.standardHours(1)); - timer.set(expirationTime); - } - - @OnTimer("gcTimer") public void onTimer( - @StateId("state") ValueState state, - @StateId("maxTimestampSeen") CombiningState maxTimestamp) { - // Clear all state for the key. - state.clear(); - maxTimestamp.clear(); - } - } -```` - -### 10.5 State and timers examples {#state-timers-examples} -Following are some example uses of state and timers - -#### 10.5.1. Joining clicks and views {#joining-clicks-and-views} -In this example, the pipeline is processing data from an e-commerce site's home page. There are two input streams: -a stream of views, representing suggested product links displayed to the user on the home page, and a stream of -clicks, representing actual user clicks on these links. The goal of the pipeline is to join click events with view -events, outputting a new joined event that contains information from both events. Each link has a unique identifier -that is present in both the view event and the join event. - -Many view events will never be followed up with clicks. This pipeline will wait one hour for a click, after which it -will give up on this join. While every click event should have a view event, some small number of view events may be -lost and never make it to the Beam pipeline; the pipeline will similarly wait one hour after seeing a click event, and -give up if the view event does not arrive in that time. Input events are not ordered - it is possible to see the click -event before the view event. The one hour join timeout should be based on event time, not on processing time. - -```java -// Read the event stream and key it by the link id. -PCollection> eventsPerLinkId = - readEvents() - .apply(WithKeys.of(Event::getLinkId).withKeyType(TypeDescriptors.strings())); - -perUser.apply(ParDo.of(new DoFn, JoinedEvent>() { - // Store the view event. - @StateId("view") private final StateSpec> viewState = StateSpecs.value(); - // Store the click event. - @StateId("click") private final StateSpec> clickState = StateSpecs.value(); - - // The maximum element timestamp seen so far. - @StateId("maxTimestampSeen") private final StateSpec> - maxTimestamp = StateSpecs.combining(Max.ofLongs()); - - // Timer that fires when an hour goes by with an incomplete join. - @TimerId("gcTimer") private final TimerSpec gcTimer = TimerSpecs.timer(TimeDomain.EVENT_TIME); - - @ProcessElement public void process( - @Element KV element, - @Timestamp Instant ts, - @AlwaysFetched @StateId("view") ValueState viewState, - @AlwaysFetched @StateId("click") ValueState clickState, - @AlwaysFetched @StateId("maxTimestampSeen") CombiningState maxTimestampState, - @TimerId("gcTimer") gcTimer, - OutputReceiver output) { - // Store the event into the correct state variable. - Event event = element.getValue(); - ValueState valueState = event.getType().equals(VIEW) ? viewState : clickState; - valueState.write(event); - - Event view = viewState.read(); - Event click = clickState.read(); - (if view != null && click != null) { - // We've seen both a view and a click. Output a joined event and clear state. - output.output(JoinedEvent.of(view, click)); - clearState(viewState, clickState, maxTimestampState); - } else { - // We've only seen on half of the join. - // Set the timer to be one hour after the maximum timestamp seen. This will keep overwriting the same timer, so - // as long as there is activity on this key the state will stay active. Once the key goes inactive for one hour's - // worth of event time (as measured by the watermark), then the gc timer will fire. - maxTimestampState.add(ts.getMillis()); - Instant expirationTime = new Instant(maxTimestampState.read()).plus(Duration.standardHours(1)); - gcTimer.set(expirationTime); - } - } - - @OnTimer("gcTimer") public void onTimer( - @StateId("view") ValueState viewState, - @StateId("click") ValueState clickState, - @StateId("maxTimestampSeen") CombiningState maxTimestampState) { - // An hour has gone by with an incomplete join. Give up and clear the state. - clearState(viewState, clickState, maxTimestampState); - } - - private void clearState( - @StateId("view") ValueState viewState, - @StateId("click") ValueState clickState, - @StateId("maxTimestampSeen") CombiningState maxTimestampState) { - viewState.clear(); - clickState.clear(); - maxTimestampState.clear(); - } - })); -```` - -#### 10.5.2 Batching RPCs {#batching-rpcs} - -In this example, input elements are being forwarded to an external RPC service. The RPC accepts batch requests - -multiple events for the same user can be batched in a single RPC call. Since this RPC service also imposes rate limits, -we want to batch ten seconds worth of events together in order to reduce the number of calls. - -```java -PCollection> perUser = readPerUser(); -perUser.apply(ParDo.of(new DoFn, OutputT>() { - // Store the elements buffered so far. - @StateId("state") private final StateSpec> elements = StateSpecs.bag(); - // Keep track of whether a timer is currently set or not. - @StateId("isTimerSet") private final StateSpec> isTimerSet = StateSpecs.value(); - // The processing-time timer user to publish the RPC. - @TimerId("outputState") private final TimerSpec timer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); - - @ProcessElement public void process( - @Element KV element, - @StateId("state") BagState elementsState, - @StateId("isTimerSet") ValueState isTimerSetState, - @TimerId("outputState") Timer timer) { - // Add the current element to the bag for this key. - state.add(element.getValue()); - if (!MoreObjects.firstNonNull(isTimerSetState.read(), false)) { - // If there is no timer currently set, then set one to go off in 10 seconds. - timer.offset(Duration.standardSeconds(10)).setRelative(); - isTimerSetState.write(true); - } - } - - @OnTimer("outputState") public void onTimer( - @StateId("state") BagState elementsState, - @StateId("isTimerSet") ValueState isTimerSetState) { - // Send an RPC containing the batched elements and clear state. - sendRPC(elementsState.read()); - elementsState.clear(); - isTimerSetState.clear(); - } -})); -``` \ No newline at end of file +{{< /highlight >}} diff --git a/website/www/site/content/en/documentation/resources/learning-resources.md b/website/www/site/content/en/documentation/resources/learning-resources.md index 0289932b1e4ed..0e563fdd211ce 100644 --- a/website/www/site/content/en/documentation/resources/learning-resources.md +++ b/website/www/site/content/en/documentation/resources/learning-resources.md @@ -1,8 +1,5 @@ --- -layout: section title: "Learning Resources" -section_menu: section-menu/documentation.html -permalink: /documentation/resources/learning-resources/ --- -# MapElements +# FlatMapElements + href="https://beam.apache.org/releases/javadoc/current/index.html?org/apache/beam/sdk/transforms/FlatMapElements.html"> Javadoc Javadoc
-
-Applies a simple 1-to-1 mapping function over each element in the collection. +

-## Examples -**Example 1**: providing the mapping function using a `SimpleFunction` - -```java -PCollection lines = Create.of("Hello World", "Beam is fun"); -PCollection lineLengths = lines.apply(MapElements.via( - new SimpleFunction() { - @Override - public Integer apply(String line) { - return line.length(); - } - }); -``` +Applies a simple 1-to-many mapping function over each element in the +collection. The many elements are flattened into the resulting collection. -**Example 2**: providing the mapping function using a `SerializableFunction`, -which allows the use of Java 8 lambdas. Due to type erasure, you need -to provide a hint indicating the desired return type. - -```java -PCollection lines = Create.of("Hello World", "Beam is fun"); -PCollection lineLengths = lines.apply(MapElements - .into(TypeDescriptors.integers()) - .via((String line) -> line.length())); -``` +## Examples +See [BEAM-7702](https://issues.apache.org/jira/browse/BEAM-7702) for updates. ## Related transforms -* [FlatMapElements]({{ site.baseurl }}/documentation/transforms/java/elementwise/flatmapelements) behaves the same as `Map`, but for - each input it may produce zero or more outputs. -* [Filter]({{ site.baseurl }}/documentation/transforms/java/elementwise/filter) is useful if the function is just +* [Filter](/documentation/transforms/java/elementwise/filter) is useful if the function is just deciding whether to output an element or not. -* [ParDo]({{ site.baseurl }}/documentation/transforms/java/elementwise/pardo) is the most general element-wise mapping - operation, and includes other abilities such as multiple output collections and side-inputs. \ No newline at end of file +* [ParDo](/documentation/transforms/java/elementwise/pardo) is the most general element-wise mapping + operation, and includes other abilities such as multiple output collections and side-inputs. \ No newline at end of file diff --git a/website/www/site/content/en/documentation/transforms/java/elementwise/keys.md b/website/www/site/content/en/documentation/transforms/java/elementwise/keys.md index b0d0738628dde..4e29731e23156 100644 --- a/website/www/site/content/en/documentation/transforms/java/elementwise/keys.md +++ b/website/www/site/content/en/documentation/transforms/java/elementwise/keys.md @@ -1,8 +1,5 @@ --- -layout: section title: "Keys" -permalink: /documentation/transforms/java/elementwise/keys/ -section_menu: section-menu/documentation.html --- -# FlatMapElements +# MapElements + href="https://beam.apache.org/releases/javadoc/current/index.html?org/apache/beam/sdk/transforms/MapElements.html"> Javadoc Javadoc
-
-Applies a simple 1-to-many mapping function over each element in the -collection. The many elements are flattened into the resulting collection. +

+ +Applies a simple 1-to-1 mapping function over each element in the collection. ## Examples -See [BEAM-7702](https://issues.apache.org/jira/browse/BEAM-7702) for updates. +**Example 1**: providing the mapping function using a `SimpleFunction` + +{{< highlight java >}} +PCollection lines = Create.of("Hello World", "Beam is fun"); +PCollection lineLengths = lines.apply(MapElements.via( + new SimpleFunction() { + @Override + public Integer apply(String line) { + return line.length(); + } + }); +{{< /highlight >}} + +**Example 2**: providing the mapping function using a `SerializableFunction`, +which allows the use of Java 8 lambdas. Due to type erasure, you need +to provide a hint indicating the desired return type. + +{{< highlight java >}} +PCollection lines = Create.of("Hello World", "Beam is fun"); +PCollection lineLengths = lines.apply(MapElements + .into(TypeDescriptors.integers()) + .via((String line) -> line.length())); +{{< /highlight >}} ## Related transforms -* [Filter]({{ site.baseurl }}/documentation/transforms/java/elementwise/filter) is useful if the function is just +* [FlatMapElements](/documentation/transforms/java/elementwise/flatmapelements) behaves the same as `Map`, but for + each input it may produce zero or more outputs. +* [Filter](/documentation/transforms/java/elementwise/filter) is useful if the function is just deciding whether to output an element or not. -* [ParDo]({{ site.baseurl }}/documentation/transforms/java/elementwise/pardo) is the most general element-wise mapping - operation, and includes other abilities such as multiple output collections and side-inputs. \ No newline at end of file +* [ParDo](/documentation/transforms/java/elementwise/pardo) is the most general element-wise mapping + operation, and includes other abilities such as multiple output collections and side-inputs. \ No newline at end of file diff --git a/website/www/site/content/en/documentation/transforms/java/elementwise/pardo.md b/website/www/site/content/en/documentation/transforms/java/elementwise/pardo.md index 3a7b979d39011..f6805efec029e 100644 --- a/website/www/site/content/en/documentation/transforms/java/elementwise/pardo.md +++ b/website/www/site/content/en/documentation/transforms/java/elementwise/pardo.md @@ -1,8 +1,5 @@ --- -layout: section title: "ParDo" -permalink: /documentation/transforms/java/elementwise/pardo/ -section_menu: section-menu/documentation.html --- + +# Java transform catalog overview + +## Element-wise + + + + + + + + + + + + + + + + +
TransformDescription
FilterGiven a predicate, filter out all elements that don't satisfy the predicate.
FlatMapElementsApplies a function that returns a collection to every element in the input and + outputs all resulting elements.
KeysExtracts the key from each element in a collection of key-value pairs.
KvSwapSwaps the key and value of each element in a collection of key-value pairs.
MapElementsApplies a function to every element in the input and outputs the result.
ParDoThe most-general mechanism for applying a user-defined DoFn to every element + in the input collection.
PartitionRoutes each input element to a specific output collection based on some partition + function.
RegexFilters input string elements based on a regex. May also transform them based on the matching groups.
ReifyTransforms for converting between explicit and implicit form of various Beam values.
ToStringTransforms every element in an input collection to a string.
WithKeysProduces a collection containing each element from the input collection converted to a key-value pair, with a key selected by applying a function to the input element.
WithTimestampsApplies a function to determine a timestamp to each element in the output collection, + and updates the implicit timestamp associated with each input. Note that it is only safe to adjust timestamps forwards.
ValuesExtracts the value from each element in a collection of key-value pairs.
+ + + +## Aggregation + + + + + + + + + + + + + + + + + + + +
TransformDescription
ApproximateQuantilesUses an approximation algorithm to estimate the data distribution within each aggregation using a specified number of quantiles.
ApproximateUniqueUses an approximation algorithm to estimate the number of unique elements within each aggregation.
CoGroupByKeySimilar to GroupByKey, but groups values associated with each key into a batch of a given size
CombineTransforms to combine elements according to a provided CombineFn.
CombineWithContextAn extended version of Combine which allows accessing side-inputs and other context.
CountCounts the number of elements within each aggregation.
DistinctProduces a collection containing distinct elements from the input collection.
GroupByKeyTakes a keyed collection of elements and produces a collection where each element + consists of a key and all values associated with that key.
GroupIntoBatchesBatches values associated with keys into Iterable batches of some size. Each batch contains elements associated with a specific key.
HllCountEstimates the number of distinct elements and creates re-aggregatable sketches using the HyperLogLog++ algorithm.
LatestSelects the latest element within each aggregation according to the implicit timestamp.
MaxOutputs the maximum element within each aggregation.
MeanComputes the average within each aggregation.
MinOutputs the minimum element within each aggregation.
SampleRandomly select some number of elements from each aggregation.
SumCompute the sum of elements in each aggregation.
TopCompute the largest element(s) in each aggregation.
+ + +## Other + + + + + + + +
TransformDescription
CreateCreates a collection from an in-memory list.
FlattenGiven multiple input collections, produces a single output collection containing + all elements from all of the input collections.
PAssertA transform to assert the contents of a PCollection used as part of testing a pipeline either locally or with a runner.
ViewOperations for turning a collection into view that may be used as a side-input to a ParDo.
WindowLogically divides up or groups the elements of a collection into finite + windows according to a provided WindowFn.
\ No newline at end of file diff --git a/website/www/site/content/en/documentation/transforms/python/aggregation/approximatequantiles.md b/website/www/site/content/en/documentation/transforms/python/aggregation/approximatequantiles.md index 4fb577ea0bd94..d3dd2b78fa3fc 100644 --- a/website/www/site/content/en/documentation/transforms/python/aggregation/approximatequantiles.md +++ b/website/www/site/content/en/documentation/transforms/python/aggregation/approximatequantiles.md @@ -1,8 +1,5 @@ --- -layout: section title: "ApproximateQuantiles" -permalink: /documentation/transforms/python/aggregation/approximatequantiles/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Filter`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter_test.py tag:perennials %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" >}} ### Example 2: Filtering with a lambda function We can also use lambda functions to simplify **Example 1**. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py tag:filter_lambda %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Filter`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter_test.py tag:perennials %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" >}} ### Example 3: Filtering with multiple arguments @@ -82,20 +87,25 @@ They are passed as additional positional arguments or keyword arguments to the f In this example, `has_duration` takes `plant` and `duration` as arguments. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py tag:filter_multiple_arguments %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Filter`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter_test.py tag:perennials %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" >}} ### Example 4: Filtering with side inputs as singletons @@ -105,20 +115,25 @@ passing the `PCollection` as a *singleton* accesses that value. In this example, we pass a `PCollection` the value `'perennial'` as a singleton. We then use that value to filter out perennials. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py tag:filter_side_inputs_singleton %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Filter`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter_test.py tag:perennials %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" >}} ### Example 5: Filtering with side inputs as iterators @@ -126,20 +141,25 @@ If the `PCollection` has multiple values, pass the `PCollection` as an *iterator This accesses elements lazily as they are needed, so it is possible to iterate over large `PCollection`s that won't fit into memory. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py tag:filter_side_inputs_iter %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Filter`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter_test.py tag:valid_plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" >}} > **Note**: You can pass the `PCollection` as a *list* with `beam.pvalue.AsList(pcollection)`, > but this requires that all the elements fit into memory. @@ -151,26 +171,31 @@ Each element must be a `(key, value)` pair. Note that all the elements of the `PCollection` must fit into memory for this. If the `PCollection` won't fit into memory, use `beam.pvalue.AsIter(pcollection)` instead. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py tag:filter_side_inputs_dict %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Filter`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter_test.py tag:perennials %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/filter.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/filter" >}} ## Related transforms -* [FlatMap]({{ site.baseurl }}/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for +* [FlatMap](/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for each input it might produce zero or more outputs. -* [ParDo]({{ site.baseurl }}/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping +* [ParDo](/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping operation, and includes other abilities such as multiple output collections and side-inputs. -{% include button-pydoc.md path="apache_beam.transforms.core" class="Filter" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Filter" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/flatmap.md b/website/www/site/content/en/documentation/transforms/python/elementwise/flatmap.md index fc5f2fc00dc10..02eeb4a1a098e 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/flatmap.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/flatmap.md @@ -1,8 +1,5 @@ --- -layout: section title: "FlatMap" -permalink: /documentation/transforms/python/elementwise/flatmap/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ### Example 2: FlatMap with a function We define a function `split_words` which splits an input `str` element using the delimiter `','` and outputs a `list` of `str`s. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_function %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md + +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ### Example 3: FlatMap with a lambda function @@ -82,20 +88,25 @@ For this example, we want to flatten a `PCollection` of lists of `str`s into a ` Each input element is already an `iterable`, where each element is what we want in the resulting `PCollection`. We use a lambda function that returns the same input element it received. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_lambda %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ### Example 4: FlatMap with a generator @@ -103,40 +114,50 @@ For this example, we want to flatten a `PCollection` of lists of `str`s into a ` We use a generator to iterate over the input list and yield each of the elements. Each yielded result in the generator is an element in the resulting `PCollection`. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_generator %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ### Example 5: FlatMapTuple for key-value pairs If your `PCollection` consists of `(key, value)` pairs, you can use `FlatMapTuple` to unpack them into different function arguments. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_tuple %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMapTuple`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ### Example 6: FlatMap with multiple arguments @@ -145,20 +166,25 @@ They are passed as additional positional arguments or keyword arguments to the f In this example, `split_words` takes `text` and `delimiter` as arguments. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_multiple_arguments %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ### Example 7: FlatMap with side inputs as singletons @@ -168,20 +194,25 @@ passing the `PCollection` as a *singleton* accesses that value. In this example, we pass a `PCollection` the value `','` as a singleton. We then use that value as the delimiter for the `str.split` method. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_side_inputs_singleton %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ### Example 8: FlatMap with side inputs as iterators @@ -189,20 +220,25 @@ If the `PCollection` has multiple values, pass the `PCollection` as an *iterator This accesses elements lazily as they are needed, so it is possible to iterate over large `PCollection`s that won't fit into memory. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_side_inputs_iter %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:valid_plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} > **Note**: You can pass the `PCollection` as a *list* with `beam.pvalue.AsList(pcollection)`, > but this requires that all the elements fit into memory. @@ -214,27 +250,32 @@ Each element must be a `(key, value)` pair. Note that all the elements of the `PCollection` must fit into memory for this. If the `PCollection` won't fit into memory, use `beam.pvalue.AsIter(pcollection)` instead. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py tag:flatmap_side_inputs_dict %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `FlatMap`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap_test.py tag:valid_plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/flatmap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/flatmap" >}} ## Related transforms -* [Filter]({{ site.baseurl }}/documentation/transforms/python/elementwise/filter) is useful if the function is just +* [Filter](/documentation/transforms/python/elementwise/filter) is useful if the function is just deciding whether to output an element or not. -* [ParDo]({{ site.baseurl }}/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping +* [ParDo](/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping operation, and includes other abilities such as multiple output collections and side-inputs. -* [Map]({{ site.baseurl }}/documentation/transforms/python/elementwise/map) behaves the same, but produces exactly one output for each input. +* [Map](/documentation/transforms/python/elementwise/map) behaves the same, but produces exactly one output for each input. -{% include button-pydoc.md path="apache_beam.transforms.core" class="FlatMap" %} +{{< button-pydoc path="apache_beam.transforms.core" class="FlatMap" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/keys.md b/website/www/site/content/en/documentation/transforms/python/elementwise/keys.md index 4473ee9973a49..de408a3ffebaf 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/keys.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/keys.md @@ -1,8 +1,5 @@ --- -layout: section title: "Keys" -permalink: /documentation/transforms/python/elementwise/keys/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Keys`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/keys_test.py tag:icons %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/keys.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/keys" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/keys" >}} ## Related transforms -* [KvSwap]({{ site.baseurl }}/documentation/transforms/python/elementwise/kvswap) swaps the key and value of each element. -* [Values]({{ site.baseurl }}/documentation/transforms/python/elementwise/values) for extracting the value of each element. +* [KvSwap](/documentation/transforms/python/elementwise/kvswap) swaps the key and value of each element. +* [Values](/documentation/transforms/python/elementwise/values) for extracting the value of each element. -{% include button-pydoc.md path="apache_beam.transforms.util" class="Keys" %} +{{< button-pydoc path="apache_beam.transforms.util" class="Keys" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/kvswap.md b/website/www/site/content/en/documentation/transforms/python/elementwise/kvswap.md index 226810049514d..de865b79dca50 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/kvswap.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/kvswap.md @@ -1,8 +1,5 @@ --- -layout: section -title: "KvSwap" -permalink: /documentation/transforms/python/elementwise/kvswap/ -section_menu: section-menu/documentation.html +title: "Partition" --- -# Kvswap +# Partition - +{{< localstorage language language-py >}} -{% include button-pydoc.md path="apache_beam.transforms.util" class="KvSwap" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Partition" >}} -Takes a collection of key-value pairs and returns a collection of key-value pairs -which has each key and value swapped. +Separates elements in a collection into multiple output +collections. The partitioning function contains the logic that determines how +to separate the elements of the input collection into each resulting +partition output collection. + +The number of partitions must be determined at graph construction time. +You cannot determine the number of partitions in mid-pipeline + +See more information in the [Beam Programming Guide](/documentation/programming-guide/#partition). ## Examples -In the following example, we create a pipeline with a `PCollection` of key-value pairs. -Then, we apply `KvSwap` to swap the keys and values. +In the following examples, we create a pipeline with a `PCollection` of produce with their icon, name, and duration. +Then, we apply `Partition` in multiple ways to split the `PCollection` into multiple `PCollections`. + +`Partition` accepts a function that receives the number of partitions, +and returns the index of the desired partition for the element. +The number of partitions passed must be a positive integer, +and it must return an integer in the range `0` to `num_partitions-1`. + +### Example 1: Partition with a function + +In the following example, we have a known list of durations. +We partition the `PCollection` into one `PCollection` for every duration type. + +{{< highlight py >}} + +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} + +### Example 2: Partition with a lambda function + +We can also use lambda functions to simplify **Example 1**. + +{{< highlight py >}} + +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} + +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} + +### Example 3: Partition with multiple arguments + +You can pass functions with multiple arguments to `Partition`. +They are passed as additional positional arguments or keyword arguments to the function. + +In machine learning, it is a common task to split data into +[training and a testing datasets](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). +Typically, 80% of the data is used for training a model and 20% is used for testing. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/kvswap.py tag:kvswap %}``` +In this example, we split a `PCollection` dataset into training and testing datasets. +We define `split_dataset`, which takes the `plant` element, `num_partitions`, +and an additional argument `ratio`. +The `ratio` is a list of numbers which represents the ratio of how many items will go into each partition. +`num_partitions` is used by `Partitions` as a positional argument, +while `plant` and `ratio` are passed to `split_dataset`. -{:.notebook-skip} -Output `PCollection` after `KvSwap`: +If we want an 80%/20% split, we can specify a ratio of `[8, 2]`, which means that for every 10 elements, +8 go into the first partition and 2 go into the second. +In order to determine which partition to send each element, we have different buckets. +For our case `[8, 2]` has **10** buckets, +where the first 8 buckets represent the first partition and the last 2 buckets represent the second partition. -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/kvswap_test.py tag:plants %}``` +First, we check that the ratio list's length corresponds to the `num_partitions` we pass. +We then get a bucket index for each element, in the range from 0 to 9 (`num_buckets-1`). +We could do `hash(element) % len(ratio)`, but instead we sum all the ASCII characters of the +JSON representation to make it deterministic. +Finally, we loop through all the elements in the ratio and have a running total to +identify the partition index to which that bucket corresponds. + +This `split_dataset` function is generic enough to support any number of partitions by any ratio. +You might want to adapt the bucket assignment to use a more appropriate or randomized hash for your dataset. + +{{< highlight py >}} + +{{< /highlight >}} + +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} + +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/kvswap.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/kvswap" -%} +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} ## Related transforms -* [Keys]({{ site.baseurl }}/documentation/transforms/python/elementwise/keys) for extracting the key of each component. -* [Values]({{ site.baseurl }}/documentation/transforms/python/elementwise/values) for extracting the value of each element. +* [Filter](/documentation/transforms/python/elementwise/filter) is useful if the function is just + deciding whether to output an element or not. +* [ParDo](/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping + operation, and includes other abilities such as multiple output collections and side-inputs. +* [CoGroupByKey](/documentation/transforms/python/aggregation/cogroupbykey) +performs a per-key equijoin. -{% include button-pydoc.md path="apache_beam.transforms.util" class="KvSwap" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Partition" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/map.md b/website/www/site/content/en/documentation/transforms/python/elementwise/map.md index 2ad0e3356237c..de865b79dca50 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/map.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/map.md @@ -1,8 +1,5 @@ --- -layout: section -title: "Map" -permalink: /documentation/transforms/python/elementwise/map/ -section_menu: section-menu/documentation.html +title: "Partition" --- -# Map +# Partition - +{{< localstorage language language-py >}} -{% include button-pydoc.md path="apache_beam.transforms.core" class="Map" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Partition" >}} -Applies a simple 1-to-1 mapping function over each element in the collection. +Separates elements in a collection into multiple output +collections. The partitioning function contains the logic that determines how +to separate the elements of the input collection into each resulting +partition output collection. -## Examples - -In the following examples, we create a pipeline with a `PCollection` of produce with their icon, name, and duration. -Then, we apply `Map` in multiple ways to transform every element in the `PCollection`. - -`Map` accepts a function that returns a single element for every input element in the `PCollection`. - -### Example 1: Map with a predefined function - -We use the function `str.strip` which takes a single `str` element and outputs a `str`. -It strips the input element's whitespaces, including newlines and tabs. +The number of partitions must be determined at graph construction time. +You cannot determine the number of partitions in mid-pipeline -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_simple %}``` +See more information in the [Beam Programming Guide](/documentation/programming-guide/#partition). -{:.notebook-skip} -Output `PCollection` after `Map`: +## Examples -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plants %}``` +In the following examples, we create a pipeline with a `PCollection` of produce with their icon, name, and duration. +Then, we apply `Partition` in multiple ways to split the `PCollection` into multiple `PCollections`. -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} +`Partition` accepts a function that receives the number of partitions, +and returns the index of the desired partition for the element. +The number of partitions passed must be a positive integer, +and it must return an integer in the range `0` to `num_partitions-1`. -### Example 2: Map with a function +### Example 1: Partition with a function -We define a function `strip_header_and_newline` which strips any `'#'`, `' '`, and `'\n'` characters from each element. +In the following example, we have a known list of durations. +We partition the `PCollection` into one `PCollection` for every duration type. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_function %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} -Output `PCollection` after `Map`: +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} -### Example 3: Map with a lambda function +### Example 2: Partition with a lambda function -We can also use lambda functions to simplify **Example 2**. +We can also use lambda functions to simplify **Example 1**. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_lambda %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} -Output `PCollection` after `Map`: +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} -### Example 4: Map with multiple arguments +### Example 3: Partition with multiple arguments -You can pass functions with multiple arguments to `Map`. +You can pass functions with multiple arguments to `Partition`. They are passed as additional positional arguments or keyword arguments to the function. -In this example, `strip` takes `text` and `chars` as arguments. - -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_multiple_arguments %}``` - -{:.notebook-skip} -Output `PCollection` after `Map`: - -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plants %}``` - -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} - -### Example 5: MapTuple for key-value pairs - -If your `PCollection` consists of `(key, value)` pairs, -you can use `MapTuple` to unpack them into different function arguments. - -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_tuple %}``` - -{:.notebook-skip} -Output `PCollection` after `MapTuple`: - -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plants %}``` - -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} - -### Example 6: Map with side inputs as singletons - -If the `PCollection` has a single value, such as the average from another computation, -passing the `PCollection` as a *singleton* accesses that value. - -In this example, we pass a `PCollection` the value `'# \n'` as a singleton. -We then use that value as the characters for the `str.strip` method. - -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_side_inputs_singleton %}``` - -{:.notebook-skip} -Output `PCollection` after `Map`: - -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plants %}``` - -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} - -### Example 7: Map with side inputs as iterators - -If the `PCollection` has multiple values, pass the `PCollection` as an *iterator*. -This accesses elements lazily as they are needed, -so it is possible to iterate over large `PCollection`s that won't fit into memory. - -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_side_inputs_iter %}``` - -{:.notebook-skip} -Output `PCollection` after `Map`: - -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plants %}``` - -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} - -> **Note**: You can pass the `PCollection` as a *list* with `beam.pvalue.AsList(pcollection)`, -> but this requires that all the elements fit into memory. - -### Example 8: Map with side inputs as dictionaries - -If a `PCollection` is small enough to fit into memory, then that `PCollection` can be passed as a *dictionary*. -Each element must be a `(key, value)` pair. -Note that all the elements of the `PCollection` must fit into memory for this. -If the `PCollection` won't fit into memory, use `beam.pvalue.AsIter(pcollection)` instead. - -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py tag:map_side_inputs_dict %}``` +In machine learning, it is a common task to split data into +[training and a testing datasets](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). +Typically, 80% of the data is used for training a model and 20% is used for testing. + +In this example, we split a `PCollection` dataset into training and testing datasets. +We define `split_dataset`, which takes the `plant` element, `num_partitions`, +and an additional argument `ratio`. +The `ratio` is a list of numbers which represents the ratio of how many items will go into each partition. +`num_partitions` is used by `Partitions` as a positional argument, +while `plant` and `ratio` are passed to `split_dataset`. + +If we want an 80%/20% split, we can specify a ratio of `[8, 2]`, which means that for every 10 elements, +8 go into the first partition and 2 go into the second. +In order to determine which partition to send each element, we have different buckets. +For our case `[8, 2]` has **10** buckets, +where the first 8 buckets represent the first partition and the last 2 buckets represent the second partition. + +First, we check that the ratio list's length corresponds to the `num_partitions` we pass. +We then get a bucket index for each element, in the range from 0 to 9 (`num_buckets-1`). +We could do `hash(element) % len(ratio)`, but instead we sum all the ASCII characters of the +JSON representation to make it deterministic. +Finally, we loop through all the elements in the ratio and have a running total to +identify the partition index to which that bucket corresponds. + +This `split_dataset` function is generic enough to support any number of partitions by any ratio. +You might want to adapt the bucket assignment to use a more appropriate or randomized hash for your dataset. + +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} -Output `PCollection` after `Map`: +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/map_test.py tag:plant_details %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/map.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/map" -%} +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} ## Related transforms -* [FlatMap]({{ site.baseurl }}/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for - each input it may produce zero or more outputs. -* [Filter]({{ site.baseurl }}/documentation/transforms/python/elementwise/filter) is useful if the function is just +* [Filter](/documentation/transforms/python/elementwise/filter) is useful if the function is just deciding whether to output an element or not. -* [ParDo]({{ site.baseurl }}/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping +* [ParDo](/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping operation, and includes other abilities such as multiple output collections and side-inputs. +* [CoGroupByKey](/documentation/transforms/python/aggregation/cogroupbykey) +performs a per-key equijoin. -{% include button-pydoc.md path="apache_beam.transforms.core" class="Map" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Partition" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/pardo.md b/website/www/site/content/en/documentation/transforms/python/elementwise/pardo.md index 40315a877e383..de865b79dca50 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/pardo.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/pardo.md @@ -1,8 +1,5 @@ --- -layout: section -title: "ParDo" -permalink: /documentation/transforms/python/elementwise/pardo/ -section_menu: section-menu/documentation.html +title: "Partition" --- -# ParDo +# Partition - +{{< localstorage language language-py >}} -{% include button-pydoc.md path="apache_beam.transforms.core" class="ParDo" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Partition" >}} -A transform for generic parallel processing. -A `ParDo` transform considers each element in the input `PCollection`, -performs some processing function (your user code) on that element, -and emits zero or more elements to an output `PCollection`. +Separates elements in a collection into multiple output +collections. The partitioning function contains the logic that determines how +to separate the elements of the input collection into each resulting +partition output collection. -See more information in the -[Beam Programming Guide]({{ site.baseurl }}/documentation/programming-guide/#pardo). +The number of partitions must be determined at graph construction time. +You cannot determine the number of partitions in mid-pipeline + +See more information in the [Beam Programming Guide](/documentation/programming-guide/#partition). ## Examples -In the following examples, we explore how to create custom `DoFn`s and access -the timestamp and windowing information. +In the following examples, we create a pipeline with a `PCollection` of produce with their icon, name, and duration. +Then, we apply `Partition` in multiple ways to split the `PCollection` into multiple `PCollections`. -### Example 1: ParDo with a simple DoFn +`Partition` accepts a function that receives the number of partitions, +and returns the index of the desired partition for the element. +The number of partitions passed must be a positive integer, +and it must return an integer in the range `0` to `num_partitions-1`. -The following example defines a simple `DoFn` class called `SplitWords` -which stores the `delimiter` as an object field. -The `process` method is called once per element, -and it can yield zero or more output elements. +### Example 1: Partition with a function -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py tag:pardo_dofn %}``` +In the following example, we have a known list of durations. +We partition the `PCollection` into one `PCollection` for every duration type. -{:.notebook-skip} -Output `PCollection` after `ParDo`: +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo_test.py tag:plants %}``` +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/pardo" -%} +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -### Example 2: ParDo with timestamp and window information +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} -In this example, we add new parameters to the `process` method to bind parameter values at runtime. +### Example 2: Partition with a lambda function -* [`beam.DoFn.TimestampParam`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.TimestampParam) - binds the timestamp information as an - [`apache_beam.utils.timestamp.Timestamp`](https://beam.apache.org/releases/pydoc/current/apache_beam.utils.timestamp.html#apache_beam.utils.timestamp.Timestamp) - object. -* [`beam.DoFn.WindowParam`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.WindowParam) - binds the window information as the appropriate - [`apache_beam.transforms.window.*Window`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.window.html) - object. +We can also use lambda functions to simplify **Example 1**. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py tag:pardo_dofn_params %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} -`stdout` output: +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo_test.py tag:dofn_params %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/pardo" -%} +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} -### Example 3: ParDo with DoFn methods +### Example 3: Partition with multiple arguments -A [`DoFn`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn) -can be customized with a number of methods that can help create more complex behaviors. -You can customize what a worker does when it starts and shuts down with `setup` and `teardown`. -You can also customize what to do when a -[*bundle of elements*](https://beam.apache.org/documentation/runtime/model/#bundling-and-persistence) -starts and finishes with `start_bundle` and `finish_bundle`. +You can pass functions with multiple arguments to `Partition`. +They are passed as additional positional arguments or keyword arguments to the function. -* [`DoFn.setup()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.setup): - Called *once per `DoFn` instance* when the `DoFn` instance is initialized. - `setup` need not to be cached, so it could be called more than once per worker. - This is a good place to connect to database instances, open network connections or other resources. +In machine learning, it is a common task to split data into +[training and a testing datasets](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). +Typically, 80% of the data is used for training a model and 20% is used for testing. -* [`DoFn.start_bundle()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.start_bundle): - Called *once per bundle of elements* before calling `process` on the first element of the bundle. - This is a good place to start keeping track of the bundle elements. +In this example, we split a `PCollection` dataset into training and testing datasets. +We define `split_dataset`, which takes the `plant` element, `num_partitions`, +and an additional argument `ratio`. +The `ratio` is a list of numbers which represents the ratio of how many items will go into each partition. +`num_partitions` is used by `Partitions` as a positional argument, +while `plant` and `ratio` are passed to `split_dataset`. -* [**`DoFn.process(element, *args, **kwargs)`**](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.process): - Called *once per element*, can *yield zero or more elements*. - Additional `*args` or `**kwargs` can be passed through - [`beam.ParDo()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.ParDo). - **[required]** +If we want an 80%/20% split, we can specify a ratio of `[8, 2]`, which means that for every 10 elements, +8 go into the first partition and 2 go into the second. +In order to determine which partition to send each element, we have different buckets. +For our case `[8, 2]` has **10** buckets, +where the first 8 buckets represent the first partition and the last 2 buckets represent the second partition. -* [`DoFn.finish_bundle()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.finish_bundle): - Called *once per bundle of elements* after calling `process` after the last element of the bundle, - can *yield zero or more elements*. This is a good place to do batch calls on a bundle of elements, - such as running a database query. +First, we check that the ratio list's length corresponds to the `num_partitions` we pass. +We then get a bucket index for each element, in the range from 0 to 9 (`num_buckets-1`). +We could do `hash(element) % len(ratio)`, but instead we sum all the ASCII characters of the +JSON representation to make it deterministic. +Finally, we loop through all the elements in the ratio and have a running total to +identify the partition index to which that bucket corresponds. - For example, you can initialize a batch in `start_bundle`, - add elements to the batch in `process` instead of yielding them, - then running a batch query on those elements on `finish_bundle`, and yielding all the results. +This `split_dataset` function is generic enough to support any number of partitions by any ratio. +You might want to adapt the bucket assignment to use a more appropriate or randomized hash for your dataset. - Note that yielded elements from `finish_bundle` must be of the type - [`apache_beam.utils.windowed_value.WindowedValue`](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/windowed_value.py). - You need to provide a timestamp as a unix timestamp, which you can get from the last processed element. - You also need to provide a window, which you can get from the last processed element like in the example below. +{{< highlight py >}} + +{{< /highlight >}} -* [`DoFn.teardown()`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.teardown): - Called *once (as a best effort) per `DoFn` instance* when the `DoFn` instance is shutting down. - This is a good place to close database instances, close network connections or other resources. +{{< paragraph class="notebook-skip" >}} +Output `PCollection`s: +{{< /paragraph >}} - Note that `teardown` is called as a *best effort* and is *not guaranteed*. - For example, if the worker crashes, `teardown` might not be called. +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py tag:pardo_dofn_methods %}``` - -{:.notebook-skip} -`stdout` output: - -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo_test.py tag:results %}``` - -{% include buttons-code-snippet.md - py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/pardo.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/pardo" -%} - -> *Known issues:* -> -> * [[BEAM-7885]](https://issues.apache.org/jira/browse/BEAM-7885) -> `DoFn.setup()` doesn't run for streaming jobs running in the `DirectRunner`. -> * [[BEAM-7340]](https://issues.apache.org/jira/browse/BEAM-7340) -> `DoFn.teardown()` metrics are lost. +{{< buttons-code-snippet + py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} ## Related transforms -* [Map]({{ site.baseurl }}/documentation/transforms/python/elementwise/map) behaves the same, but produces exactly one output for each input. -* [FlatMap]({{ site.baseurl }}/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, - but for each input it may produce zero or more outputs. -* [Filter]({{ site.baseurl }}/documentation/transforms/python/elementwise/filter) is useful if the function is just +* [Filter](/documentation/transforms/python/elementwise/filter) is useful if the function is just deciding whether to output an element or not. +* [ParDo](/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping + operation, and includes other abilities such as multiple output collections and side-inputs. +* [CoGroupByKey](/documentation/transforms/python/aggregation/cogroupbykey) +performs a per-key equijoin. -{% include button-pydoc.md path="apache_beam.transforms.core" class="ParDo" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Partition" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/partition.md b/website/www/site/content/en/documentation/transforms/python/elementwise/partition.md index 00949c346ff6e..de865b79dca50 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/partition.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/partition.md @@ -1,8 +1,5 @@ --- -layout: section title: "Partition" -permalink: /documentation/transforms/python/elementwise/partition/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection`s: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition_test.py tag:partitions %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} ### Example 2: Partition with a lambda function We can also use lambda functions to simplify **Example 1**. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py tag:partition_lambda %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection`s: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition_test.py tag:partitions %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} ### Example 3: Partition with multiple arguments @@ -117,28 +122,33 @@ identify the partition index to which that bucket corresponds. This `split_dataset` function is generic enough to support any number of partitions by any ratio. You might want to adapt the bucket assignment to use a more appropriate or randomized hash for your dataset. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py tag:partition_multiple_arguments %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection`s: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition_test.py tag:train_test %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/partition.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/partition" >}} ## Related transforms -* [Filter]({{ site.baseurl }}/documentation/transforms/python/elementwise/filter) is useful if the function is just +* [Filter](/documentation/transforms/python/elementwise/filter) is useful if the function is just deciding whether to output an element or not. -* [ParDo]({{ site.baseurl }}/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping +* [ParDo](/documentation/transforms/python/elementwise/pardo) is the most general elementwise mapping operation, and includes other abilities such as multiple output collections and side-inputs. -* [CoGroupByKey]({{ site.baseurl }}/documentation/transforms/python/aggregation/cogroupbykey) +* [CoGroupByKey](/documentation/transforms/python/aggregation/cogroupbykey) performs a per-key equijoin. -{% include button-pydoc.md path="apache_beam.transforms.core" class="Partition" %} +{{< button-pydoc path="apache_beam.transforms.core" class="Partition" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/regex.md b/website/www/site/content/en/documentation/transforms/python/elementwise/regex.md index e8c1df9ce7fea..e2f228f5ecfc5 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/regex.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/regex.md @@ -1,8 +1,5 @@ --- -layout: section title: "Regex" -permalink: /documentation/transforms/python/elementwise/regex/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.matches`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_matches %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 2: Regex match with all groups @@ -94,20 +94,25 @@ To match until the end of the string, add `'$'` at the end of the regular expres To start matching at any point instead of the beginning of the string, use [`Regex.find_all(regex, group=Regex.ALL, outputEmpty=False)`](#example-5-regex-find-all). -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_all_matches %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.all_matches`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_all_matches %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 3: Regex match into key-value pairs @@ -123,20 +128,25 @@ To match until the end of the string, add `'$'` at the end of the regular expres To start matching at any point instead of the beginning of the string, use [`Regex.find_kv(regex, keyGroup)`](#example-6-regex-find-as-key-value-pairs). -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_matches_kv %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.matches_kv`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_matches_kv %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 4: Regex find @@ -152,20 +162,25 @@ To match until the end of the string, add `'$'` at the end of the regular expres If you need to match from the start only, consider using [`Regex.matches(regex)`](#example-1-regex-match). -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_find %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.find`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_matches %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 5: Regex find all @@ -181,20 +196,25 @@ To match until the end of the string, add `'$'` at the end of the regular expres If you need to match all groups from the start only, consider using [`Regex.all_matches(regex)`](#example-2-regex-match-with-all-groups). -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_find_all %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.find_all`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_find_all %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 6: Regex find as key-value pairs @@ -211,20 +231,25 @@ To match until the end of the string, add `'$'` at the end of the regular expres If you need to match as key-value pairs from the start only, consider using [`Regex.matches_kv(regex)`](#example-3-regex-match-into-key-value-pairs). -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_find_kv %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.find_kv`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_find_kv %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 7: Regex replace all @@ -233,20 +258,25 @@ You can also use [backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub) on the `replacement`. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_replace_all %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.replace_all`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_replace_all %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 8: Regex replace first @@ -255,45 +285,55 @@ You can also use [backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub) on the `replacement`. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_replace_first %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.replace_first`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_replace_first %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ### Example 9: Regex split `Regex.split` returns the list of strings that were delimited by the specified regular expression. The argument `outputEmpty` is set to `False` by default, but can be set to `True` to keep empty items in the output list. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py tag:regex_split %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Regex.split`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex_test.py tag:plants_split %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/regex" >}} ## Related transforms -* [FlatMap]({{ site.baseurl }}/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for +* [FlatMap](/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for each input it may produce zero or more outputs. -* [Map]({{ site.baseurl }}/documentation/transforms/python/elementwise/map) applies a simple 1-to-1 mapping function over each element in the collection +* [Map](/documentation/transforms/python/elementwise/map) applies a simple 1-to-1 mapping function over each element in the collection -{% include button-pydoc.md path="apache_beam.transforms.util" class="Regex" %} +{{< button-pydoc path="apache_beam.transforms.util" class="Regex" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/reify.md b/website/www/site/content/en/documentation/transforms/python/elementwise/reify.md index 2c7e84294ecf2..b28b9b8bbea23 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/reify.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/reify.md @@ -1,8 +1,5 @@ --- -layout: section title: "Reify" -permalink: /documentation/transforms/python/elementwise/reify/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `ToString`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/tostring" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/tostring" >}} ### Example 2: Elements to string The following example converts a dictionary into a string. The string output will be equivalent to `str(element)`. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring.py tag:tostring_element %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `ToString`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring_test.py tag:plant_lists %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/tostring" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/tostring" >}} ### Example 3: Iterables to string @@ -82,23 +87,28 @@ into a string delimited by `','`. You can specify a different delimiter using the `delimiter` argument. The string output will be equivalent to `iterable.join(delimiter)`. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring.py tag:tostring_iterables %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `ToString`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring_test.py tag:plants_csv %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/tostring.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/tostring" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/tostring" >}} ## Related transforms -* [Map]({{ site.baseurl }}/documentation/transforms/python/elementwise/map) applies a simple 1-to-1 mapping function over each element in the collection +* [Map](/documentation/transforms/python/elementwise/map) applies a simple 1-to-1 mapping function over each element in the collection -{% include button-pydoc.md path="apache_beam.transforms.util" class="ToString" %} +{{< button-pydoc path="apache_beam.transforms.util" class="ToString" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/values.md b/website/www/site/content/en/documentation/transforms/python/elementwise/values.md index ae79578bb4a28..d6e6c2c9c539f 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/values.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/values.md @@ -1,8 +1,5 @@ --- -layout: section title: "Values" -permalink: /documentation/transforms/python/elementwise/values/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after `Values`: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/values_test.py tag:plants %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/values.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/values" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/values" >}} ## Related transforms -* [Keys]({{ site.baseurl }}/documentation/transforms/python/elementwise/keys) for extracting the key of each component. -* [KvSwap]({{ site.baseurl }}/documentation/transforms/python/elementwise/kvswap) swaps the key and value of each element. +* [Keys](/documentation/transforms/python/elementwise/keys) for extracting the key of each component. +* [KvSwap](/documentation/transforms/python/elementwise/kvswap) swaps the key and value of each element. -{% include button-pydoc.md path="apache_beam.transforms.util" class="Values" %} +{{< button-pydoc path="apache_beam.transforms.util" class="Values" >}} diff --git a/website/www/site/content/en/documentation/transforms/python/elementwise/withkeys.md b/website/www/site/content/en/documentation/transforms/python/elementwise/withkeys.md index f9244e1364868..fd14abd0daf9d 100644 --- a/website/www/site/content/en/documentation/transforms/python/elementwise/withkeys.md +++ b/website/www/site/content/en/documentation/transforms/python/elementwise/withkeys.md @@ -1,8 +1,5 @@ --- -layout: section title: "WithKeys" -permalink: /documentation/transforms/python/elementwise/withkeys/ -section_menu: section-menu/documentation.html --- +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after getting the timestamps: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps_test.py tag:plant_timestamps %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/withtimestamps" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/withtimestamps" >}} To convert from a [`time.struct_time`](https://docs.python.org/3/library/time.html#time.struct_time) @@ -60,16 +60,18 @@ to `unix_time` you can use For more information on time formatting options, see [`time.strftime`](https://docs.python.org/3/library/time.html#time.strftime). -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps.py tag:time_tuple2unix_time %}``` + To convert from a [`datetime.datetime`](https://docs.python.org/3/library/datetime.html#datetime.datetime) to `unix_time` you can use convert it to a `time.struct_time` first with [`datetime.timetuple`](https://docs.python.org/3/library/datetime.html#datetime.datetime.timetuple). -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps.py tag:datetime2unix_time %}``` + ### Example 2: Timestamp by logical clock @@ -77,20 +79,25 @@ If each element has a chronological number, these numbers can be used as a [logical clock](https://en.wikipedia.org/wiki/Logical_clock). These numbers have to be converted to a *"seconds"* equivalent, which can be especially important depending on your windowing and late data rules. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps.py tag:withtimestamps_logical_clock %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after getting the timestamps: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps_test.py tag:plant_events %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/withtimestamps" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/withtimestamps" >}} ### Example 3: Timestamp by processing time @@ -100,21 +107,26 @@ Workers might have time deltas, so using this method is not a reliable way to do By using processing time, there is no way of knowing if data is arriving late because the timestamp is attached when the element *enters* into the pipeline. -```py -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps.py tag:withtimestamps_processing_time %}``` +{{< highlight py >}} + +{{< /highlight >}} -{:.notebook-skip} +{{< paragraph class="notebook-skip" >}} Output `PCollection` after getting the timestamps: +{{< /paragraph >}} -{:.notebook-skip} -``` -{% github_sample /apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps_test.py tag:plant_processing_times %}``` +{{< highlight class="notebook-skip" >}} + +{{< /highlight >}} -{% include buttons-code-snippet.md +{{< buttons-code-snippet py="sdks/python/apache_beam/examples/snippets/transforms/elementwise/withtimestamps.py" - notebook="examples/notebooks/documentation/transforms/python/elementwise/withtimestamps" -%} + notebook="examples/notebooks/documentation/transforms/python/elementwise/withtimestamps" >}} ## Related transforms -* [Reify]({{ site.baseurl }}/documentation/transforms/python/elementwise/reify) converts between explicit and implicit forms of Beam values. +* [Reify](/documentation/transforms/python/elementwise/reify) converts between explicit and implicit forms of Beam values. diff --git a/website/www/site/content/en/documentation/transforms/python/other/create.md b/website/www/site/content/en/documentation/transforms/python/other/create.md index d6ce766a43105..911e4ca249797 100644 --- a/website/www/site/content/en/documentation/transforms/python/other/create.md +++ b/website/www/site/content/en/documentation/transforms/python/other/create.md @@ -1,8 +1,5 @@ --- -layout: section title: "Create" -permalink: /documentation/transforms/python/other/create/ -section_menu: section-menu/documentation.html --- - -{{% /classwrapper %}} +{{< highlight java >}} + +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} `ExtractAndSumScore` is written to be more general, in that you can pass in the field by which you want to group the data (in the case of our game, by unique user or unique team). This means we can re-use `ExtractAndSumScore` in other pipelines that group score data by team, for example. Here's the main method of `UserScore`, showing how we apply all three steps of the pipeline: -{{% classwrapper class="language-java" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} ### Limitations @@ -179,13 +147,13 @@ The `HourlyTeamScore` pipeline expands on the basic batch analysis principles us Like `UserScore`, `HourlyTeamScore` is best thought of as a job to be run periodically after all the relevant data has been gathered (such as once per day). The pipeline reads a fixed data set from a file, and writes the results back to a text fileto a Google Cloud BigQuery table. -{{% classwrapper class="language-java" wrapper="p" %}} +{{< paragraph class="language-java" >}} > **Note:** See [HourlyTeamScore on GitHub](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/complete/game/HourlyTeamScore.java) for the complete example pipeline program. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-py" wrapper="p" %}} +{{< paragraph class="language-py" >}} > **Note:** See [HourlyTeamScore on GitHub](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/game/hourly_team_score.py) for the complete example pipeline program. -{{% /classwrapper %}} +{{< /paragraph >}} ### What Does HourlyTeamScore Do? @@ -214,29 +182,27 @@ Notice that as processing time advances, the sums are now _per window_; each win Beam's windowing feature uses the [intrinsic timestamp information](/documentation/programming-guide/#element-timestamps) attached to each element of a `PCollection`. Because we want our pipeline to window based on _event time_, we **must first extract the timestamp** that's embedded in each data record apply it to the corresponding element in the `PCollection` of score data. Then, the pipeline can **apply the windowing function** to divide the `PCollection` into logical windows. -{{% classwrapper class="language-java" wrapper="p" %}} +{{< paragraph class="language-java" >}} `HourlyTeamScore` uses the [WithTimestamps](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/WithTimestamps.java) and [Window](https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/Window.java) transforms to perform these operations. +{{< /paragraph >}} -{{% classwrapper class="language-py" wrapper="p" %}} +{{< paragraph class="language-py" >}} `HourlyTeamScore` uses the `FixedWindows` transform, found in [window.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/window.py), to perform these operations. +{{< /paragraph >}} The following code shows this: -{{% classwrapper class="language-java" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} Notice that the transforms the pipeline uses to specify the windowing are distinct from the actual data processing transforms (such as `ExtractAndSumScores`). This functionality provides you some flexibility in designing your Beam pipeline, in that you can run existing transforms over datasets with different windowing characteristics. @@ -250,41 +216,33 @@ It also lets the pipeline include relevant **late data**—data events with vali The following code shows how `HourlyTeamScore` uses the `Filter` transform to filter events that occur either before or after the relevant analysis period: -{{% classwrapper class="language-java" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} #### Calculating Score Per Team, Per Window `HourlyTeamScore` uses the same `ExtractAndSumScores` transform as the `UserScore` pipeline, but passes a different key (team, as opposed to user). Also, because the pipeline applies `ExtractAndSumScores` _after_ applying fixed-time 1-hour windowing to the input data, the data gets grouped by both team _and_ window. You can see the full sequence of transforms in `HourlyTeamScore`'s main method: -{{% classwrapper class="language-java" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} ### Limitations @@ -300,13 +258,13 @@ The `LeaderBoard` pipeline also demonstrates how to process game score data with Because the `LeaderBoard` pipeline reads the game data from an unbounded source as that data is generated, you can think of the pipeline as an ongoing job running concurrently with the game process. `LeaderBoard` can thus provide low-latency insights into how users are playing the game at any given moment — useful if, for example, we want to provide a live web-based scoreboard so that users can track their progress against other users as they play. -{{% classwrapper class="language-java" wrapper="p" %}} +{{< paragraph class="language-java" >}} > **Note:** See [LeaderBoard on GitHub](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/complete/game/LeaderBoard.java) for the complete example pipeline program. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-py" wrapper="p" %}} +{{< paragraph class="language-py" >}} > **Note:** See [LeaderBoard on GitHub](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/game/leader_board.py) for the complete example pipeline program. -{{% /classwrapper %}} +{{< /paragraph >}} ### What Does LeaderBoard Do? @@ -339,21 +297,17 @@ As processing time advances and more scores are processed, the trigger outputs t The following code example shows how `LeaderBoard` sets the processing time trigger to output the data for user scores: -{{% classwrapper class="language-java" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} `LeaderBoard` sets the [window accumulation mode](/documentation/programming-guide/#window-accumulation-modes) to accumulate window panes as the trigger fires. This accumulation mode is set by invoking `.accumulatingFiredPanes` using `accumulation_mode=trigger.AccumulationMode.ACCUMULATING` when setting the trigger, and causes the pipeline to accumulate the previously emitted data together with any new data that's arrived since the last trigger fire. This ensures that `LeaderBoard` is a running sum for the user scores, rather than a collection of individual sums. @@ -381,21 +335,17 @@ Data arriving above the solid watermark line is _late data_ — this is a score The following code example shows how `LeaderBoard` applies fixed-time windowing with the appropriate triggers to have our pipeline perform the calculations we want: -{{% classwrapper class="language-java" %}} - - +%}--> +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} Taken together, these processing strategies let us address the latency and completeness issues present in the `UserScore` and `HourlyTeamScore` pipelines, while still using the same basic transforms to process the data—as a matter of fact, both calculations still use the same `ExtractAndSumScore` transform that we used in both the `UserScore` and `HourlyTeamScore` pipelines. @@ -405,13 +355,13 @@ While `LeaderBoard` demonstrates how to use basic windowing and triggers to perf Like `LeaderBoard`, `GameStats` reads data from an unbounded source. It is best thought of as an ongoing job that provides insight into the game as users play. -{{% classwrapper class="language-java" wrapper="p" %}} +{{< paragraph class="language-java" >}} > **Note:** See [GameStats on GitHub](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/complete/game/GameStats.java) for the complete example pipeline program. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-py" wrapper="p" %}} +{{< paragraph class="language-py" >}} > **Note:** See [GameStats on GitHub](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/game/game_stats.py) for the complete example pipeline program. -{{% /classwrapper %}} +{{< /paragraph >}} ### What Does GameStats Do? @@ -432,39 +382,31 @@ Since the average depends on the pipeline data, we need to calculate it, and the The following code example shows the composite transform that handles abuse detection. The transform uses the `Sum.integersPerKey` transform to sum all scores per user, and then the `Mean.globally` transform to determine the average score for all users. Once that's been calculated (as a `PCollectionView` singleton), we can pass it to the filtering `ParDo` using `.withSideInputs`: -{{% classwrapper class="language-java" %}} - - - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +%}--> +{{< /highlight >}} - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} The abuse-detection transform generates a view of users supected to be spambots. Later in the pipeline, we use that view to filter out any such users when we calculate the team score per hour, again by using the side input mechanism. The following code example shows where we insert the spam filter, between windowing the scores into fixed windows and extracting the team scores: -{{% classwrapper class="language-java" %}} - - +%}--> +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} #### Analyzing Usage Patterns @@ -482,39 +424,31 @@ between instances are.* We can use the session-windowed data to determine the average length of uninterrupted play time for all of our users, as well as the total score they achieve during each session. We can do this in the code by first applying session windows, summing the score per user and session, and then using a transform to calculate the length of each individual session: -{{% classwrapper class="language-java" %}} - - - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +%}--> +{{< /highlight >}} - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} This gives us a set of user sessions, each with an attached duration. We can then calculate the _average_ session length by re-windowing the data into fixed time windows, and then calculating the average for all sessions that end in each hour: -{{% classwrapper class="language-java" %}} - - +%}--> +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} We can use the resulting information to find, for example, what times of day our users are playing the longest, or which stretches of the day are more likely to see shorter play sessions. diff --git a/website/www/site/content/en/get-started/quickstart-go/_index.md b/website/www/site/content/en/get-started/quickstart-go/_index.md index f68b16b3758e0..27282db36005d 100644 --- a/website/www/site/content/en/get-started/quickstart-go/_index.md +++ b/website/www/site/content/en/get-started/quickstart-go/_index.md @@ -21,10 +21,7 @@ This Quickstart will walk you through executing your first Beam pipeline to run If you're interested in contributing to the Apache Beam Go codebase, see the [Contribution Guide](/contribute). -- [Set up your environment](#set-up-your-environment) -- [Get the SDK and the examples](#get-the-sdk-and-the-examples) -- [Run WordCount](#run-wordcount) -- [Next Steps](#next-steps) +{{< toc >}} ## Set up your environment diff --git a/website/www/site/content/en/get-started/quickstart-java/_index.md b/website/www/site/content/en/get-started/quickstart-java/_index.md index ea4a0bddf3404..98543c20f8964 100644 --- a/website/www/site/content/en/get-started/quickstart-java/_index.md +++ b/website/www/site/content/en/get-started/quickstart-java/_index.md @@ -25,12 +25,7 @@ This Quickstart will walk you through executing your first Beam pipeline to run If you're interested in contributing to the Apache Beam Java codebase, see the [Contribution Guide](/contribute). -- [Set up your Development Environment](#set-up-your-development-environment) -- [Get the WordCount Code](#get-the-wordcount-code) -- [Run WordCount](#run-wordcount) -- [Inspect the results](#inspect-the-results) -- [Next Steps](#next-steps) - +{{< toc >}} ## Set up your Development Environment @@ -43,9 +38,7 @@ If you're interested in contributing to the Apache Beam Java codebase, see the [ The easiest way to get a copy of the WordCount pipeline is to use the following command to generate a simple Maven project that contains Beam's WordCount examples and builds against the most recent Beam release: -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} $ mvn archetype:generate \ -DarchetypeGroupId=org.apache.beam \ -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ @@ -55,13 +48,9 @@ $ mvn archetype:generate \ -Dversion="0.1" \ -Dpackage=org.apache.beam.examples \ -DinteractiveMode=false -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="shell-PowerShell" %}} - -``` +{{< highlight class="shell-PowerShell" >}} PS> mvn archetype:generate ` -D archetypeGroupId=org.apache.beam ` -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples ` @@ -71,15 +60,11 @@ PS> mvn archetype:generate ` -D version="0.1" ` -D package=org.apache.beam.examples ` -D interactiveMode=false -``` - -{{% /classwrapper %}} +{{< /highlight >}} This will create a directory `word-count-beam` that contains a simple `pom.xml` and a series of example pipelines that count words in text files. -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} $ cd word-count-beam/ $ ls @@ -88,13 +73,9 @@ pom.xml src $ ls src/main/java/org/apache/beam/examples/ DebuggingWordCount.java WindowedWordCount.java common MinimalWordCount.java WordCount.java -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="shell-PowerShell" %}} - -``` +{{< highlight class="shell-PowerShell" >}} PS> cd .\word-count-beam PS> dir @@ -118,9 +99,7 @@ d----- 7/19/2018 11:00 PM subprocess -a---- 7/19/2018 11:00 PM 5945 MinimalWordCount.java -a---- 7/19/2018 11:00 PM 9490 WindowedWordCount.java -a---- 7/19/2018 11:00 PM 7662 WordCount.java -``` - -{{% /classwrapper %}} +{{< /highlight >}} For a detailed introduction to the Beam concepts used in these examples, see the [WordCount Example Walkthrough](/get-started/wordcount-example). Here, we'll just focus on executing `WordCount.java`. @@ -140,58 +119,36 @@ After you've chosen which runner you'd like to use: For Unix shells: -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -Papex-runner -``` - -{{% /classwrapper %}} - +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} -``` +{{< highlight class="runner-flink-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --inputFile=pom.xml --output=counts" -Pflink-runner -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-cluster" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-cluster" >}} $ mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --flinkMaster= --filesToStage=target/word-count-beam-bundled-0.1.jar \ --inputFile=/path/to/quickstart/pom.xml --output=/tmp/counts" -Pflink-runner You can monitor the running job by visiting the Flink dashboard at http://:8081 -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} - -``` +{{< highlight class="runner-spark" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} Make sure you complete the setup steps at /documentation/runners/dataflow/#setup $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ @@ -199,91 +156,55 @@ $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ --gcpTempLocation=gs:///tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs:///counts" \ -Pdataflow-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=/tmp/counts --runner=SamzaRunner" -Psamza-runner -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-nemo" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-nemo" >}} $ mvn package -Pnemo-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \ --runner=NemoRunner --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-jet" >}} $ mvn package -Pjet-runner $ java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \ --runner=JetRunner --jetLocalMode=3 --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} For Windows PowerShell: -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--inputFile=pom.xml --output=counts" -P direct-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -P apex-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=FlinkRunner --inputFile=pom.xml --output=counts" -P flink-runner -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-cluster" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-cluster" >}} PS> mvn package exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=FlinkRunner --flinkMaster= --filesToStage=.\target\word-count-beam-bundled-0.1.jar ` --inputFile=C:\path\to\quickstart\pom.xml --output=C:\tmp\counts" -P flink-runner You can monitor the running job by visiting the Flink dashboard at http://:8081 -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -P spark-runner -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} Make sure you complete the setup steps at /documentation/runners/dataflow/#setup PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` @@ -291,120 +212,68 @@ PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` --gcpTempLocation=gs:///tmp ` --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs:///counts" ` -P dataflow-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} PS> mvn compile exec:java -D exec.mainClass=org.apache.beam.examples.WordCount ` -D exec.args="--inputFile=pom.xml --output=/tmp/counts --runner=SamzaRunner" -P samza-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} PS> mvn package -P nemo-runner -DskipTests PS> java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount ` --runner=NemoRunner --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-jet" >}} PS> mvn package -P jet-runner PS> java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount ` --runner=JetRunner --jetLocalMode=3 --inputFile=$pwd/pom.xml --output=counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## Inspect the results Once the pipeline has completed, you can view the output. You'll notice that there may be multiple output files prefixed by `count`. The exact number of these files is decided by the runner, giving it the flexibility to do efficient, distributed execution. -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ ls counts* -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} $ ls counts* -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} $ ls counts* -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} $ ls /tmp/counts* -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} $ ls counts* -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-dataflow" >}} $ gsutil ls gs:///counts* -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} $ ls /tmp/counts* -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} $ ls counts* -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-jet" %}} - -``` +{{< highlight class="runner-jet" >}} $ ls counts* -``` - -{{% /classwrapper %}} +{{< /highlight >}} When you look into the contents of the file, you'll see that they contain unique words and the number of occurrences of each word. The order of elements within the file may differ because the Beam model does not generally guarantee ordering, again to allow runners to optimize for efficiency. -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ more counts* api: 9 bundled: 1 @@ -414,13 +283,9 @@ The: 1 limitations: 1 Foundation: 1 ... -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-apex" >}} $ cat counts* BEAM: 1 have: 1 @@ -428,13 +293,9 @@ simple: 1 skip: 4 PAssert: 1 ... -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} $ more counts* The: 1 api: 9 @@ -444,13 +305,9 @@ limitations: 1 bundled: 1 Foundation: 1 ... -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} $ more /tmp/counts* The: 1 api: 9 @@ -460,13 +317,9 @@ limitations: 1 bundled: 1 Foundation: 1 ... -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-spark" %}} - -``` +{{< highlight class="runner-spark" >}} $ more counts* beam: 27 SF: 1 @@ -477,14 +330,10 @@ require: 1 of: 11 profile: 10 ... -``` - -{{% /classwrapper %}} - +{{< /highlight >}} -{{% classwrapper class="runner-dataflow" %}} -``` +{{< highlight class="runner-dataflow" >}} $ gsutil cat gs:///counts* feature: 15 smother'st: 1 @@ -495,13 +344,9 @@ Below: 2 deserves: 32 barrenly: 1 ... -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-samza-local" >}} $ more /tmp/counts* api: 7 are: 2 @@ -511,13 +356,9 @@ end: 14 for: 14 has: 2 ... -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} $ more counts* cluster: 2 handler: 1 @@ -528,13 +369,9 @@ Adds: 2 java: 7 xml: 1 ... -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-jet" %}} - -``` +{{< highlight class="runner-jet" >}} $ more counts* FlinkRunner: 1 cleanupDaemonThreads: 2 @@ -547,9 +384,7 @@ governing: 1 overrides: 1 YARN: 1 ... -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## Next Steps diff --git a/website/www/site/content/en/get-started/quickstart-py/_index.md b/website/www/site/content/en/get-started/quickstart-py/_index.md index 81d452f9a6a40..161af15dfab90 100644 --- a/website/www/site/content/en/get-started/quickstart-py/_index.md +++ b/website/www/site/content/en/get-started/quickstart-py/_index.md @@ -21,16 +21,7 @@ This guide shows you how to set up your Python development environment, get the If you're interested in contributing to the Apache Beam Python codebase, see the [Contribution Guide](/contribute). -- [Set up your environment](#set-up-your-environment) - - [Check your Python version](#check-your-python-version) - - [Install pip](#install-pip) - - [Install Python virtual environment](#install-python-virtual-environment) -- [Get Apache Beam](#get-apache-beam) - - [Create and activate a virtual environment](#create-and-activate-a-virtual-environment) - - [Download and install](#download-and-install) - - [Extra requirements](#extra-requirements) -- [Execute a pipeline](#execute-a-pipeline) -- [Next Steps](#next-steps) +{{< toc >}} The Python SDK supports Python 2.7, 3.5, 3.6, and 3.7. New Python SDK releases will stop supporting Python 2.7 in 2020 ([BEAM-8371](https://issues.apache.org/jira/browse/BEAM-8371)). For best results, use Beam with Python 3. @@ -55,21 +46,13 @@ pip --version If you do not have `pip` version 7.0.0 or newer, run the following command to install it. This command might require administrative privileges. -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} pip install --upgrade pip -``` - -{{% /classwrapper %}} - -{{% classwrapper class="shell-PowerShell" %}} +{{< /highlight >}} -``` +{{< highlight class="shell-PowerShell" >}} PS> python -m pip install --upgrade pip -``` - -{{% /classwrapper%}} +{{< /highlight >}} ### Install Python virtual environment @@ -79,41 +62,25 @@ for initial experiments. If you do not have `virtualenv` version 13.1.0 or newer, run the following command to install it. This command might require administrative privileges. -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} pip install --upgrade virtualenv -``` - -{{% /classwrapper %}} - -{{% classwrapper class="shell-PowerShell" %}} +{{< /highlight >}} -``` +{{< highlight class="shell-PowerShell" >}} PS> python -m pip install --upgrade virtualenv -``` - -{{% /classwrapper %}} +{{< /highlight >}} If you do not want to use a Python virtual environment (not recommended), ensure `setuptools` is installed on your machine. If you do not have `setuptools` version 17.1 or newer, run the following command to install it. -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} pip install --upgrade setuptools -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="shell-PowerShell" %}} - -``` +{{< highlight class="shell-PowerShell" >}} PS> python -m pip install --upgrade setuptools -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## Get Apache Beam @@ -121,21 +88,13 @@ PS> python -m pip install --upgrade setuptools A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, create a directory and run: -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} virtualenv /path/to/directory -``` - -{{% /classwrapper %}} - -{{% classwrapper class="shell-PowerShell" %}} +{{< /highlight >}} -``` +{{< highlight class="shell-PowerShell" >}} PS> virtualenv C:\path\to\directory -``` - -{{% /classwrapper %}} +{{< /highlight >}} A virtual environment needs to be activated for each shell that is to use it. Activating it sets some environment variables that point to the virtual @@ -143,21 +102,13 @@ environment's directories. To activate a virtual environment in Bash, run: -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} . /path/to/directory/bin/activate -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="shell-PowerShell" %}} - -``` +{{< highlight class="shell-PowerShell" >}} PS> C:\path\to\directory\Scripts\activate.ps1 -``` - -{{% /classwrapper %}} +{{< /highlight >}} That is, execute the `activate` script under the virtual environment directory you created. @@ -167,21 +118,13 @@ For instructions using other shells, see the [virtualenv documentation](https:// Install the latest Python SDK from PyPI: -{{% classwrapper class="shell-unix" %}} - -``` +{{< highlight class="shell-unix" >}} pip install apache-beam -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="shell-PowerShell" %}} - -``` +{{< highlight class="shell-PowerShell" >}} PS> python -m pip install apache-beam -``` - -{{% /classwrapper %}} +{{< /highlight >}} #### Extra requirements @@ -207,52 +150,30 @@ The Apache Beam [examples](https://github.com/apache/beam/tree/master/sdks/pytho For example, run `wordcount.py` with the following command: -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} python -m apache_beam.examples.wordcount --input /path/to/inputfile --output /path/to/write/counts -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-local" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-local" >}} Currently, running wordcount.py on Flink requires a full download of the Beam source code. See https://beam.apache.org/roadmap/portability/#python-on-flink for more information. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} Currently, running wordcount.py on Flink requires a full download of the Beam source code. See https://beam.apache.org/documentation/runners/flink/ for more information. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} Currently, running wordcount.py on Spark requires a full download of the Beam source code. See https://beam.apache.org/roadmap/portability/#python-on-spark for more information. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} # As part of the initial setup, install Google Cloud Platform specific extra components. Make sure you # complete the setup steps at /documentation/runners/dataflow/#setup pip install apache-beam[gcp] @@ -261,17 +182,11 @@ python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespea --runner DataflowRunner \ --project your-gcp-project \ --temp_location gs:///tmp/ -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} After the pipeline completes, you can view the output files at your specified output path. For example, if you specify `/dir1/counts` for the `--output` diff --git a/website/www/site/content/en/get-started/try-apache-beam/_index.md b/website/www/site/content/en/get-started/try-apache-beam/_index.md index 28dca6377b805..dac27d4cf0c4e 100644 --- a/website/www/site/content/en/get-started/try-apache-beam/_index.md +++ b/website/www/site/content/en/get-started/try-apache-beam/_index.md @@ -19,22 +19,13 @@ limitations under the License. You can try Apache Beam using our interactive notebooks, which are hosted in [Colab](https://colab.research.google.com). The notebooks allow you to interactively play with the code and see how your changes affect the pipeline. You don't need to install anything or modify your computer in any way to use these notebooks. - +{{< language-switcher java py go >}} ## Interactive WordCount in Colab This interactive notebook shows you what a simple, minimal version of WordCount looks like. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} package samples.quickstart; import org.apache.beam.sdk.Pipeline; @@ -70,11 +61,9 @@ public class WordCount { pipeline.run(); } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{< classwrapper class="language-java" wrapper="p" >}} +{{< paragraph class="language-java" >}} Run in Colab @@ -83,15 +72,13 @@ public class WordCount { href="https://github.com/{{< param branch_repo >}}/examples/notebooks/get-started/try-apache-beam-java.ipynb"> View on GitHub -{{< /classwrapper >}} +{{< /paragraph >}} -{{< classwrapper class="language-java" wrapper="p" >}} +{{< paragraph class="language-java" >}} To learn how to install and run the Apache Beam Java SDK on your own computer, follow the instructions in the Java Quickstart. -{{< /classwrapper >}} - -{{% classwrapper class="language-py" %}} +{{< /paragraph >}} -```py +{{< highlight py >}} import apache_beam as beam import re @@ -108,11 +95,9 @@ with beam.Pipeline() as pipeline: | 'Format results' >> beam.Map(lambda word_count: str(word_count)) | 'Write results' >> beam.io.WriteToText(outputs_prefix) ) -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{< classwrapper class="language-py" wrapper="p" >}} +{{< paragraph class="language-py" >}} Run in Colab @@ -121,15 +106,13 @@ with beam.Pipeline() as pipeline: href="https://github.com/{{< param branch_repo >}}/examples/notebooks/get-started/try-apache-beam-py.ipynb"> View on GitHub -{{< /classwrapper >}} +{{< /paragraph >}} -{{< classwrapper class="language-py" wrapper="p" >}} +{{< paragraph class="language-py" >}} To learn how to install and run the Apache Beam Python SDK on your own computer, follow the instructions in the Python Quickstart. -{{< /classwrapper >}} - -{{% classwrapper class="language-go" %}} +{{< /paragraph >}} -```go +{{< highlight go >}} package main import ( @@ -175,11 +158,9 @@ func main() { direct.Execute(context.Background(), pipeline) } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{< classwrapper class="language-go" wrapper="p" >}} +{{< paragraph class="language-go" >}} Run in Colab @@ -188,11 +169,11 @@ func main() { href="https://github.com/{{< param branch_repo >}}/examples/notebooks/get-started/try-apache-beam-go.ipynb"> View on GitHub -{{< /classwrapper >}} +{{< /paragraph >}} -{{< classwrapper class="language-go" wrapper="p" >}} +{{< paragraph class="language-go" >}} To learn how to install and run the Apache Beam Go SDK on your own computer, follow the instructions in the Go Quickstart. -{{< /classwrapper >}} +{{< /paragraph >}} For a more detailed explanation about how WordCount works, see the [WordCount Example Walkthrough](/get-started/wordcount-example). diff --git a/website/www/site/content/en/get-started/wordcount-example/_index.md b/website/www/site/content/en/get-started/wordcount-example/_index.md index 1a2051061d642..b8e1910e4aa55 100644 --- a/website/www/site/content/en/get-started/wordcount-example/_index.md +++ b/website/www/site/content/en/get-started/wordcount-example/_index.md @@ -18,41 +18,9 @@ limitations under the License. # Apache Beam WordCount Examples -- [MinimalWordCount example](#minimalwordcount-example) - - [Creating the pipeline](#creating-the-pipeline) - - [Applying pipeline transforms](#applying-pipeline-transforms) - - [Running the pipeline](#running-the-pipeline) -- [WordCount example](#wordcount-example) - - [Specifying explicit DoFns](#specifying-explicit-dofns) - - [Creating composite transforms](#creating-composite-transforms) - - [Using parameterizable PipelineOptions](#using-parameterizable-pipelineoptions) -- [DebuggingWordCount example](#debuggingwordcount-example) - - [Logging](#logging) - - [Direct Runner](#direct-runner) - - [Cloud Dataflow Runner](#cloud-dataflow-runner) - - [Apache Spark Runner](#apache-spark-runner) - - [Apache Flink Runner](#apache-flink-runner) - - [Apache Apex Runner](#apache-apex-runner) - - [Apache Nemo Runner](#apache-nemo-runner) - - [Testing your pipeline with asserts](#testing-your-pipeline-with-asserts) -- [WindowedWordCount example](#windowedwordcount-example) - - [Unbounded and bounded datasets](#unbounded-and-bounded-datasets) - - [Adding timestamps to data](#adding-timestamps-to-data) - - [Windowing](#windowing) - - [Reusing PTransforms over windowed PCollections](#reusing-ptransforms-over-windowed-pcollections) -- [StreamingWordCount example](#streamingwordcount-example) - - [Reading an unbounded dataset](#reading-an-unbounded-dataset) - - [Writing unbounded results](#writing-unbounded-results) -- [Next Steps](#next-steps) - - +{{< toc >}} + +{{< language-switcher java py go >}} The WordCount examples demonstrate how to set up a processing pipeline that can read text, tokenize the text lines into individual words, and perform a @@ -79,54 +47,42 @@ MinimalWordCount demonstrates a simple pipeline that uses the Direct Runner to read from a text file, apply transforms to tokenize and count the words, and write the data to an output text file. -{{% classwrapper class="language-java language-go" wrapper="p" %}} +{{< paragraph class="language-java language-go" >}} This example hard-codes the locations for its input and output files and doesn't perform any error checking; it is intended to only show you the "bare bones" of creating a Beam pipeline. This lack of parameterization makes this particular pipeline less portable across different runners than standard Beam pipelines. In later examples, we will parameterize the pipeline's input and output sources and show other best practices. -{{% /classwrapper %}} - -{{% classwrapper class="language-java" %}} +{{< /paragraph >}} -```java +{{< highlight java >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.MinimalWordCount -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} python -m apache_beam.examples.wordcount_minimal --input YOUR_INPUT_FILE --output counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} $ go install github.com/apache/beam/sdks/go/examples/minimal_wordcount $ minimal_wordcount -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-java" wrapper="p" %}} +{{< paragraph class="language-java" >}} To view the full code in Java, see **[MinimalWordCount](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java).** -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-py" wrapper="p" %}} +{{< paragraph class="language-py" >}} To view the full code in Python, see **[wordcount_minimal.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py).** -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} To view the full code in Go, see **[minimal_wordcount.go](https://github.com/apache/beam/blob/master/sdks/go/examples/minimal_wordcount/minimal_wordcount.go).** -{{% /classwrapper %}} +{{< /paragraph >}} **Key Concepts:** @@ -143,77 +99,61 @@ excerpts from the MinimalWordCount pipeline. ### Creating the pipeline -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} In this example, the code first creates a `PipelineOptions` object. This object lets us set various options for our pipeline, such as the pipeline runner that will execute our pipeline and any runner-specific configuration required by the chosen runner. In this example we set these options programmatically, but more often, command-line arguments are used to set `PipelineOptions`. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} You can specify a runner for executing your pipeline, such as the `DataflowRunner` or `SparkRunner`. If you omit specifying a runner, as in this example, your pipeline executes locally using the `DirectRunner`. In the next sections, we will specify the pipeline's runner. -{{% /classwrapper %}} - -{{% classwrapper class="language-java" %}} +{{< /paragraph >}} -```java +{{< highlight java >}} // Create a PipelineOptions object. This object lets us set various execution // options for our pipeline, such as the runner you wish to use. This example // will run with the DirectRunner by default, based on the class path configured // in its dependencies. PipelineOptions options = PipelineOptionsFactory.create(); -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} The next step is to create a `Pipeline` object with the options we've just constructed. The Pipeline object builds up the graph of transformations to be executed, associated with that particular pipeline. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} The first step is to create a `Pipeline` object. It builds up the graph of transformations to be executed, associated with that particular pipeline. The scope allows grouping into composite transforms. -{{% /classwrapper %}} - -{{% classwrapper class="language-java" %}} +{{< /paragraph >}} -```java +{{< highlight java >}} Pipeline p = Pipeline.create(options); -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} - - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +%}--> +{{< /highlight >}} -```go +{{< highlight go >}} p := beam.NewPipeline s := p.Root() -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Applying pipeline transforms @@ -238,29 +178,19 @@ The MinimalWordCount pipeline contains five transforms: represents one line of text from the input file. This example uses input data stored in a publicly accessible Google Cloud Storage bucket ("gs://"). -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*")) -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +%}--> +{{< /highlight >}} -```go +{{< highlight go >}} lines := textio.Read(s, "gs://apache-beam-samples/shakespeare/*") -``` - -{{% /classwrapper %}} +{{< /highlight >}} 2. This transform splits the lines in `PCollection`, where each element is an individual word in Shakespeare's collected texts. @@ -272,36 +202,26 @@ lines := textio.Read(s, "gs://apache-beam-samples/shakespeare/*") previous `TextIO.Read` transform. The `ParDo` transform outputs a new `PCollection`, where each element represents an individual word in the text. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} .apply("ExtractWords", FlatMapElements .into(TypeDescriptors.strings()) .via((String line) -> Arrays.asList(line.split("[^\\p{L}]+")))) -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - +%}--> +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} words := beam.ParDo(s, func(line string, emit func(string)) { for _, word := range wordRE.FindAllString(line, -1) { emit(word) } }, lines) -``` - -{{% /classwrapper %}} +{{< /highlight >}} 3. The SDK-provided `Count` transform is a generic transform that takes a `PCollection` of any type, and returns a `PCollection` of key/value pairs. @@ -314,29 +234,19 @@ words := beam.ParDo(s, func(line string, emit func(string)) { of key/value pairs where each key represents a unique word in the text and the associated value is the occurrence count for each. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} .apply(Count.perElement()) -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} counted := stats.Count(s, words) -``` - -{{% /classwrapper %}} +{{< /highlight >}} 4. The next transform formats each of the key/value pairs of unique words and occurrence counts into a printable string suitable for writing to an output @@ -346,113 +256,83 @@ counted := stats.Count(s, words) simple `ParDo`. For each element in the input `PCollection`, the map transform applies a function that produces exactly one output element. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} .apply("FormatResults", MapElements .into(TypeDescriptors.strings()) .via((KV wordCount) -> wordCount.getKey() + ": " + wordCount.getValue())) -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} formatted := beam.ParDo(s, func(w string, c int) string { return fmt.Sprintf("%s: %v", w, c) }, counted) -``` - -{{% /classwrapper %}} +{{< /highlight >}} 5. A text file write transform. This transform takes the final `PCollection` of formatted Strings as input and writes each element to an output text file. Each element in the input `PCollection` represents one line of text in the resulting output file. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} .apply(TextIO.write().to("wordcounts")); -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} - - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +%}--> +{{< /highlight >}} -```go +{{< highlight go >}} textio.Write(s, "wordcounts.txt", formatted) -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} Note that the `Write` transform produces a trivial result value of type `PDone`, which in this case is ignored. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} Note that the `Write` transform returns no PCollections. -{{% /classwrapper %}} +{{< /paragraph >}} ### Running the pipeline -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} Run the pipeline by calling the `run` method, which sends your pipeline to be executed by the pipeline runner that you specified in your `PipelineOptions`. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} Run the pipeline by passing it to a runner. -{{% /classwrapper %}} - -{{% classwrapper class="language-java" %}} +{{< /paragraph >}} -```java +{{< highlight java >}} p.run().waitUntilFinish(); -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +%}--> +{{< /highlight >}} -```go +{{< highlight go >}} direct.Execute(context.Background(), p) -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} Note that the `run` method is asynchronous. For a blocking execution, call the `waitUntilFinish` `wait_until_finish` method on the result object returned by the call to `run`. -{{% /classwrapper %}} +{{< /paragraph >}} ## WordCount example @@ -467,143 +347,85 @@ above section, [MinimalWordCount](#minimalwordcount-example). **To run this example in Java:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -Papex-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --inputFile=pom.xml --output=counts" -Pflink-runner -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-cluster" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-cluster" >}} $ mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --flinkMaster= --filesToStage=target/word-count-beam-bundled-0.1.jar \ --inputFile=/path/to/quickstart/pom.xml --output=/tmp/counts" -Pflink-runner You can monitor the running job by visiting the Flink dashboard at http://:8081 -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://YOUR_GCS_BUCKET/tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://YOUR_GCS_BUCKET/counts" \ -Pdataflow-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=pom.xml --output=counts --runner=SamzaRunner" -Psamza-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} $ mvn package -Pnemo-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \ --runner=NemoRunner --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-jet" >}} $ mvn package -P jet-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WordCount \ --runner=JetRunner --jetLocalMode=3 --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Java, see **[WordCount](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java).** **To run this example in Python:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} python -m apache_beam.examples.wordcount --input YOUR_INPUT_FILE --output counts -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} Currently, running wordcount.py on Flink requires a full download of the Beam source code. See https://beam.apache.org/roadmap/portability/#python-on-flink for more information. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} Currently, running wordcount.py on Flink requires a full download of the Beam source code. See https://beam.apache.org/documentation/runners/flink/ for more information. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} Currently, running wordcount.py on Spark requires a full download of the Beam source code. See https://beam.apache.org/roadmap/portability/#python-on-spark for more information. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-dataflow" >}} # As part of the initial setup, install Google Cloud Platform specific extra components. pip install apache-beam[gcp] python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \ @@ -611,83 +433,47 @@ python -m apache_beam.examples.wordcount --input gs://dataflow-samples/shakespea --runner DataflowRunner \ --project YOUR_GCP_PROJECT \ --temp_location gs://YOUR_GCS_BUCKET/tmp/ -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-jet" %}} - -``` +{{< highlight class="runner-jet" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Python, see **[wordcount.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py).** **To run this example in Go:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ go install github.com/apache/beam/sdks/go/examples/wordcount $ wordcount --input --output counts -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Go SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-spark" %}} - -``` +{{< highlight class="runner-spark" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-dataflow" >}} $ go install github.com/apache/beam/sdks/go/examples/wordcount # As part of the initial setup, for non linux users - install package unix before run $ go get -u golang.org/x/sys/unix @@ -698,33 +484,19 @@ $ wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \ --temp_location gs:///tmp/ \ --staging_location gs:///binaries/ \ --worker_harness_container_image=apache/beam_go_sdk:latest -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-samza-local" >}} This runner is not yet available for the Go SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-jet" %}} - -``` +{{< highlight class="runner-jet" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Go, see **[wordcount.go](https://github.com/apache/beam/blob/master/sdks/go/examples/wordcount/wordcount.go).** @@ -740,7 +512,7 @@ pipeline code into smaller sections. ### Specifying explicit DoFns -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} When using `ParDo` transforms, you need to specify the processing operation that gets applied to each element in the input `PCollection`. This processing operation is a subclass of the SDK class `DoFn`. You can create the `DoFn` @@ -748,20 +520,18 @@ subclasses for each `ParDo` inline, as an anonymous inner class instance, as is done in the previous example (MinimalWordCount). However, it's often a good idea to define the `DoFn` at the global level, which makes it easier to unit test and can make the `ParDo` code more readable. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} When using `ParDo` transforms, you need to specify the processing operation that gets applied to each element in the input `PCollection`. This processing operation is either a named function or a struct with specially-named methods. You can use anomynous functions (but not closures). However, it's often a good idea to define the `DoFn` at the global level, which makes it easier to unit test and can make the `ParDo` code more readable. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // In this example, ExtractWordsFn is a DoFn that is defined as a static class: static class ExtractWordsFn extends DoFn { @@ -772,65 +542,55 @@ static class ExtractWordsFn extends DoFn { ... } } -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} - - -{{% /classwrapper %}} +%}--> +{{< /highlight >}} -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} // In this example, extractFn is a DoFn that is defined as a function: func extractFn(ctx context.Context, line string, emit func(string)) { ... } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Creating composite transforms -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} If you have a processing operation that consists of multiple transforms or `ParDo` steps, you can create it as a subclass of `PTransform`. Creating a `PTransform` subclass allows you to encapsulate complex transforms, can make your pipeline's structure more clear and modular, and makes unit testing easier. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} If you have a processing operation that consists of multiple transforms or `ParDo` steps, you can use a normal Go function to encapsulate them. You can furthermore use a named subscope to group them as a composite transform visible for monitoring. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} In this example, two transforms are encapsulated as the `PTransform` subclass `CountWords`. `CountWords` contains the `ParDo` that runs `ExtractWordsFn` and the SDK-provided `Count` transform. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} In this example, two transforms are encapsulated as a `CountWords` function. -{{% /classwrapper %}} +{{< /paragraph >}} When `CountWords` is defined, we specify its ultimate input and output; the input is the `PCollection` for the extraction operation, and the output is the `PCollection>` produced by the count operation. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} public static class CountWords extends PTransform, PCollection>> { @Override @@ -855,21 +615,15 @@ public static void main(String[] args) throws IOException { .apply(new CountWords()) ... } -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} - - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +%}--> +{{< /highlight >}} -```go +{{< highlight go >}} func CountWords(s beam.Scope, lines beam.PCollection) beam.PCollection { s = s.Scope("CountWords") @@ -879,9 +633,7 @@ func CountWords(s beam.Scope, lines beam.PCollection) beam.PCollection { // Count the number of times each word occurs. return stats.Count(s, col) } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Using parameterizable PipelineOptions @@ -890,18 +642,16 @@ the more common way is to define your own configuration options via command-line argument parsing. Defining your configuration options via the command-line makes the code more easily portable across different runners. -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} Add arguments to be processed by the command-line parser, and specify default values for them. You can then access the options values in your pipeline code. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} You can use the standard `flag` package for this purpose. -{{% /classwrapper %}} - -{{% classwrapper class="language-java" %}} +{{< /paragraph >}} -```java +{{< highlight java >}} public static interface WordCountOptions extends PipelineOptions { @Description("Path of the file to read from") @Default.String("gs://dataflow-samples/shakespeare/kinglear.txt") @@ -916,21 +666,15 @@ public static void main(String[] args) { Pipeline p = Pipeline.create(options); ... } -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} - - - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +%}--> +{{< /highlight >}} -```go +{{< highlight go >}} var input = flag.String("input", "gs://apache-beam-samples/shakespeare/kinglear.txt", "File(s) to read.") func main() { @@ -940,9 +684,7 @@ func main() { lines := textio.Read(s, *input) ... -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## DebuggingWordCount example @@ -951,140 +693,82 @@ instrumenting your pipeline code. **To run this example in Java:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \ -Dexec.args="--output=counts" -Pdirect-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \ -Dexec.args="--output=counts --runner=ApexRunner" -Papex-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \ -Dexec.args="--runner=FlinkRunner --output=counts" -Pflink-runner -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-cluster" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-cluster" >}} $ mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \ -Dexec.args="--runner=FlinkRunner --flinkMaster= --filesToStage=target/word-count-beam-bundled-0.1.jar \ --output=/tmp/counts" -Pflink-runner You can monitor the running job by visiting the Flink dashboard at http://:8081 -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \ -Dexec.args="--runner=SparkRunner --output=counts" -Pspark-runner -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \ -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs:///tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs:///counts" \ -Pdataflow-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \ -Dexec.args="--runner=SamzaRunner --output=counts" -Psamza-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} $ mvn package -Pnemo-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.DebuggingWordCount \ --runner=NemoRunner --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-jet" >}} $ mvn package -P jet-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.DebuggingWordCount \ --runner=JetRunner --jetLocalMode=3 --output=counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Java, see [DebuggingWordCount](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java). **To run this example in Python:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} python -m apache_beam.examples.wordcount_debugging --input YOUR_INPUT_FILE --output counts -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-local" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-local" >}} This runner is not yet available for the Python SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-spark" %}} - -``` +{{< highlight class="runner-spark" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} # As part of the initial setup, install Google Cloud Platform specific extra components. pip install apache-beam[gcp] python -m apache_beam.examples.wordcount_debugging --input gs://dataflow-samples/shakespeare/kinglear.txt \ @@ -1092,83 +776,47 @@ python -m apache_beam.examples.wordcount_debugging --input gs://dataflow-samples --runner DataflowRunner \ --project YOUR_GCP_PROJECT \ --temp_location gs://YOUR_GCS_BUCKET/tmp/ -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-samza-local" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-nemo" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Python SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} - -``` +{{< highlight class="runner-jet" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Python, see **[wordcount_debugging.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_debugging.py).** **To run this example in Go:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ go install github.com/apache/beam/sdks/go/examples/debugging_wordcount $ debugging_wordcount --input --output counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-local" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-local" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-cluster" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-cluster" >}} This runner is not yet available for the Go SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} - -``` +{{< highlight class="runner-spark" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} $ go install github.com/apache/beam/sdks/go/examples/debugging_wordcount # As part of the initial setup, for non linux users - install package unix before run $ go get -u golang.org/x/sys/unix @@ -1179,33 +827,19 @@ $ debugging_wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \ --temp_location gs:///tmp/ \ --staging_location gs:///binaries/ \ --worker_harness_container_image=apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515 -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-nemo" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-jet" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Go, see **[debugging_wordcount.go](https://github.com/apache/beam/blob/master/sdks/go/examples/debugging_wordcount/debugging_wordcount.go).** @@ -1222,9 +856,7 @@ pipeline code into smaller sections. Each runner may choose to handle logs in its own way. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // This example uses .trace and .debug: public class DebuggingWordCount { @@ -1244,21 +876,15 @@ public class DebuggingWordCount { } } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - - +%}--> +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} type filterFn struct { ... } @@ -1273,9 +899,7 @@ func (f *filterFn) ProcessElement(ctx context.Context, word string, count int, e log.Debugf(ctx, "Did not match: %v", word) } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} #### Direct Runner @@ -1293,7 +917,7 @@ that Cloud Dataflow has spun up to complete your job. Logging statements in your pipeline's `DoFn` instances will appear in Stackdriver Logging as your pipeline runs. -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} You can also control the worker log levels. Cloud Dataflow workers that execute user code are configured to log to Stackdriver Logging by default at "INFO" log level and higher. You can override log levels for specific logging namespaces by @@ -1302,16 +926,16 @@ For example, by specifying `--workerLogLevelOverrides={"org.apache.beam.examples when executing a pipeline using the Cloud Dataflow service, Stackdriver Logging will contain only "DEBUG" or higher level logs for the package in addition to the default "INFO" or higher level logs. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} The default Cloud Dataflow worker logging configuration can be overridden by specifying `--defaultWorkerLogLevel=`. For example, by specifying `--defaultWorkerLogLevel=DEBUG` when executing a pipeline with the Cloud Dataflow service, Cloud Logging will contain all "DEBUG" or higher level logs. Note that changing the default worker log level to TRACE or DEBUG significantly increases the amount of logs output. -{{% /classwrapper %}} +{{< /paragraph >}} #### Apache Spark Runner @@ -1340,34 +964,32 @@ all be found under the directory. ### Testing your pipeline with asserts -{{% classwrapper class="language-java language-py" wrapper="p" %}} +{{< paragraph class="language-java language-py" >}} `PAssert``assert_that` is a set of convenient PTransforms in the style of Hamcrest's collection matchers that can be used when writing pipeline level tests to validate the contents of PCollections. Asserts are best used in unit tests with small datasets. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} The `passert` package contains convenient PTransforms that can be used when writing pipeline level tests to validate the contents of PCollections. Asserts are best used in unit tests with small datasets. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-java" wrapper="p" %}} +{{< paragraph class="language-java" >}} The following example verifies that the set of filtered words matches our expected counts. The assert does not produce any output, and the pipeline only succeeds if all of the expectations are met. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-py language-go" wrapper="p" %}} +{{< paragraph class="language-py language-go" >}} The following example verifies that two collections contain the same values. The assert does not produce any output, and the pipeline only succeeds if all of the expectations are met. -{{% /classwrapper %}} +{{< /paragraph >}} -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} public static void main(String[] args) { ... List> expectedResults = Arrays.asList( @@ -1376,35 +998,25 @@ public static void main(String[] args) { PAssert.that(filteredWords).containsInAnyOrder(expectedResults); ... } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} from apache_beam.testing.util import assert_that from apache_beam.testing.util import equal_to with TestPipeline() as p: assert_that(p | Create([1, 2, 3]), equal_to([1, 2, 3])) -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +{{< /highlight >}} -```go +{{< highlight go >}} ... passert.Equals(s, formatted, "Flourish: 3", "stomach: 1") -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-java" wrapper="p" %}} +{{< paragraph class="language-java" >}} See [DebuggingWordCountTest](https://github.com/apache/beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/DebuggingWordCountTest.java) for an example unit test. -{{% /classwrapper %}} +{{< /paragraph >}} ## WindowedWordCount example @@ -1423,91 +1035,55 @@ pipeline code into smaller sections. **To run this example in Java:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount \ -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount \ -Dexec.args="--inputFile=pom.xml --output=counts --runner=ApexRunner" -Papex-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount \ -Dexec.args="--runner=FlinkRunner --inputFile=pom.xml --output=counts" -Pflink-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} $ mvn package exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount \ -Dexec.args="--runner=FlinkRunner --flinkMaster= --filesToStage=target/word-count-beam-bundled-0.1.jar \ --inputFile=/path/to/quickstart/pom.xml --output=/tmp/counts" -Pflink-runner You can monitor the running job by visiting the Flink dashboard at http://:8081 -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount \ -Dexec.args="--runner=SparkRunner --inputFile=pom.xml --output=counts" -Pspark-runner -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-dataflow" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount \ -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://YOUR_GCS_BUCKET/tmp \ --inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://YOUR_GCS_BUCKET/counts" \ -Pdataflow-runner -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} $ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WindowedWordCount \ -Dexec.args="--runner=SamzaRunner --inputFile=pom.xml --output=counts" -Psamza-runner -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} $ mvn package -Pnemo-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WindowedWordCount \ --runner=NemoRunner --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-jet" %}} - -``` +{{< highlight class="runner-jet" >}} $ mvn package -P jet-runner && java -cp target/word-count-beam-bundled-0.1.jar org.apache.beam.examples.WindowedWordCount \ --runner=JetRunner --jetLocalMode=3 --inputFile=`pwd`/pom.xml --output=counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Java, see **[WindowedWordCount](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WindowedWordCount.java).** @@ -1518,49 +1094,27 @@ This pipeline writes its results to a BigQuery table `--output_table` parameter. using the format `PROJECT:DATASET.TABLE` or `DATASET.TABLE`. -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} python -m apache_beam.examples.windowed_wordcount --input YOUR_INPUT_FILE --output_table PROJECT:DATASET.TABLE -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Python SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-spark" %}} - -``` +{{< highlight class="runner-spark" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-dataflow" >}} # As part of the initial setup, install Google Cloud Platform specific extra components. pip install apache-beam[gcp] python -m apache_beam.examples.windowed_wordcount --input YOUR_INPUT_FILE \ @@ -1568,83 +1122,47 @@ python -m apache_beam.examples.windowed_wordcount --input YOUR_INPUT_FILE \ --runner DataflowRunner \ --project YOUR_GCP_PROJECT \ --temp_location gs://YOUR_GCS_BUCKET/tmp/ -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-samza-local" >}} This runner is not yet available for the Python SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-jet" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Python, see **[windowed_wordcount.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/windowed_wordcount.py).** **To run this example in Go:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} $ go install github.com/apache/beam/sdks/go/examples/windowed_wordcount $ windowed_wordcount --input --output counts -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-flink-cluster" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-flink-cluster" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} This runner is not yet available for the Go SDK. -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} - -``` +{{< highlight class="runner-dataflow" >}} $ go install github.com/apache/beam/sdks/go/examples/windowed_wordcount # As part of the initial setup, for non linux users - install package unix before run $ go get -u golang.org/x/sys/unix @@ -1655,33 +1173,19 @@ $ windowed_wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \ --temp_location gs:///tmp/ \ --staging_location gs:///binaries/ \ --worker_harness_container_image=apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515 -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-jet" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-jet" >}} This runner is not yet available for the Go SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Go, see **[windowed_wordcount.go](https://github.com/apache/beam/blob/master/sdks/go/examples/windowed_wordcount/windowed_wordcount.go).** @@ -1712,9 +1216,7 @@ Recall that the input for this example is a set of Shakespeare's texts, which is a finite set of data. Therefore, this example reads bounded data from a text file: -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} public static void main(String[] args) throws IOException { Options options = ... Pipeline pipeline = Pipeline.create(options); @@ -1722,13 +1224,9 @@ public static void main(String[] args) throws IOException { PCollection input = pipeline .apply(TextIO.read().from(options.getInputFile())) -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} def main(arvg=None): parser = argparse.ArgumentParser() parser.add_argument('--input-file', @@ -1738,13 +1236,9 @@ def main(arvg=None): pipeline_options = PipelineOptions(pipeline_args) p = beam.Pipeline(options=pipeline_options) lines = p | 'read' >> ReadFromText(known_args.input_file) -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} func main() { ... p := beam.NewPipeline() @@ -1753,9 +1247,7 @@ func main() { lines := textio.Read(s, *input) ... } -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Adding timestamps to data @@ -1770,29 +1262,17 @@ In this example the input is bounded. For the purpose of the example, the `DoFn` method named `AddTimestampsFn` (invoked by `ParDo`) will set a timestamp for each element in the `PCollection`. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} .apply(ParDo.of(new AddTimestampFn(minTimestamp, maxTimestamp))); -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} -```py +{{< highlight py >}} beam.Map(AddTimestampFn(timestamp_seconds)) -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +{{< /highlight >}} -```go +{{< highlight go >}} timestampedLines := beam.ParDo(s, &addTimestampFn{Min: mtime.Now()}, lines) -``` - -{{% /classwrapper %}} +{{< /highlight >}} Below is the code for `AddTimestampFn`, a `DoFn` invoked by `ParDo`, that sets the data element of the timestamp given the element itself. For example, if the @@ -1802,9 +1282,7 @@ works of Shakespeare, so in this case we've made up random timestamps just to illustrate the concept. Each line of the input text will get a random associated timestamp sometime in a 2-hour period. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} static class AddTimestampFn extends DoFn { private final Instant minTimestamp; private final Instant maxTimestamp; @@ -1827,13 +1305,9 @@ static class AddTimestampFn extends DoFn { c.outputWithTimestamp(c.element(), new Instant(randomTimestamp)); } } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} class AddTimestampFn(beam.DoFn): def __init__(self, min_timestamp, max_timestamp): @@ -1844,13 +1318,9 @@ class AddTimestampFn(beam.DoFn): return window.TimestampedValue( element, random.randint(self.min_timestamp, self.max_timestamp)) -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} type addTimestampFn struct { Min beam.EventTime `json:"min"` } @@ -1859,14 +1329,12 @@ func (f *addTimestampFn) ProcessElement(x beam.X) (beam.EventTime, beam.X) { timestamp := f.Min.Add(time.Duration(rand.Int63n(2 * time.Hour.Nanoseconds()))) return timestamp, x } -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-go" wrapper="p" %}} +{{< paragraph class="language-go" >}} Note that the use of the `beam.X` "type variable" allows the transform to be used for any type. -{{% /classwrapper %}} +{{< /paragraph >}} ### Windowing @@ -1879,61 +1347,37 @@ The WindowedWordCount example applies fixed-time windowing, wherein each window represents a fixed time interval. The fixed window size for this example defaults to 1 minute (you can change this with a command-line option). -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} PCollection windowedWords = input .apply(Window.into( FixedWindows.of(Duration.standardMinutes(options.getWindowSize())))); -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} -```py +{{< highlight py >}} windowed_words = input | beam.WindowInto(window.FixedWindows(60 * window_size_minutes)) -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} windowedLines := beam.WindowInto(s, window.NewFixedWindows(time.Minute), timestampedLines) -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Reusing PTransforms over windowed PCollections You can reuse existing PTransforms that were created for manipulating simple PCollections over windowed PCollections as well. -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} PCollection> wordCounts = windowedWords.apply(new WordCount.CountWords()); -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} word_counts = windowed_words | CountWords() -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} +{{< /highlight >}} -```go +{{< highlight go >}} counted := wordcount.CountWords(s, windowedLines) -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## StreamingWordCount example @@ -1955,52 +1399,30 @@ frequency count of the words seen in each 15 second window. **To run this example in Python:** -{{% classwrapper class="runner-direct" %}} - -``` +{{< highlight class="runner-direct" >}} python -m apache_beam.examples.streaming_wordcount \ --input_topic "projects/YOUR_PUBSUB_PROJECT_NAME/topics/YOUR_INPUT_TOPIC" \ --output_topic "projects/YOUR_PUBSUB_PROJECT_NAME/topics/YOUR_OUTPUT_TOPIC" \ --streaming -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-apex" %}} - -``` +{{< highlight class="runner-apex" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-local" %}} - -``` +{{< highlight class="runner-flink-local" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-flink-cluster" %}} - -``` +{{< highlight class="runner-flink-cluster" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-spark" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-spark" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="runner-dataflow" %}} +{{< /highlight >}} -``` +{{< highlight class="runner-dataflow" >}} # As part of the initial setup, install Google Cloud Platform specific extra components. pip install apache-beam[gcp] python -m apache_beam.examples.streaming_wordcount \ @@ -2010,33 +1432,19 @@ python -m apache_beam.examples.streaming_wordcount \ --input_topic "projects/YOUR_PUBSUB_PROJECT_NAME/topics/YOUR_INPUT_TOPIC" \ --output_topic "projects/YOUR_PUBSUB_PROJECT_NAME/topics/YOUR_OUTPUT_TOPIC" \ --streaming -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="runner-samza-local" %}} - -``` +{{< highlight class="runner-samza-local" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-nemo" %}} - -``` +{{< highlight class="runner-nemo" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="runner-jet" %}} - -``` +{{< highlight class="runner-jet" >}} This runner is not yet available for the Python SDK. -``` - -{{% /classwrapper %}} +{{< /highlight >}} To view the full code in Python, see **[streaming_wordcount.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py).** @@ -2053,34 +1461,22 @@ This example uses an unbounded dataset as input. The code reads Pub/Sub messages from a Pub/Sub subscription or topic using [`beam.io.ReadStringsFromPubSub`](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.ReadStringsFromPubSub). -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // This example is not currently available for the Beam SDK for Java. -``` - -{{% /classwrapper %}} - -{{% classwrapper class="language-py" %}} +{{< /highlight >}} -```py +{{< highlight py >}} # Read from Pub/Sub into a PCollection. if known_args.input_subscription: lines = p | beam.io.ReadStringsFromPubSub( subscription=known_args.input_subscription) else: lines = p | beam.io.ReadStringsFromPubSub(topic=known_args.input_topic) -``` +{{< /highlight >}} -{{% /classwrapper %}} - -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} // This example is not currently available for the Beam SDK for Go. -``` - -{{% /classwrapper %}} +{{< /highlight >}} ### Writing unbounded results @@ -2093,30 +1489,18 @@ This example uses an unbounded `PCollection` and streams the results to Google Pub/Sub. The code formats the results and writes them to a Pub/Sub topic using [`beam.io.WriteStringsToPubSub`](https://beam.apache.org/releases/pydoc/{{< param release_latest >}}/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.WriteStringsToPubSub). -{{% classwrapper class="language-java" %}} - -```java +{{< highlight java >}} // This example is not currently available for the Beam SDK for Java. -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-py" %}} - -```py +{{< highlight py >}} # Write to Pub/Sub output | beam.io.WriteStringsToPubSub(known_args.output_topic) -``` - -{{% /classwrapper %}} +{{< /highlight >}} -{{% classwrapper class="language-go" %}} - -```go +{{< highlight go >}} // This example is not currently available for the Beam SDK for Go. -``` - -{{% /classwrapper %}} +{{< /highlight >}} ## Next Steps diff --git a/website/www/site/layouts/_default/baseof.html b/website/www/site/layouts/_default/baseof.html index a3819f97a00bd..f8fbfb6c04b1d 100644 --- a/website/www/site/layouts/_default/baseof.html +++ b/website/www/site/layouts/_default/baseof.html @@ -19,11 +19,11 @@ {{ partial "header.html" . }}
{{ block "hero-section" . }}{{ end }} - {{ block "pillars-section" .}}{{ end }} - {{ block "graphic-section" .}}{{ end }} - {{ block "logos-section" .}}{{ end }} - {{ block "cards-section" .}}{{ end }} - {{ block "ctas-section" .}}{{ end }} + {{ block "pillars-section" . }}{{ end }} + {{ block "graphic-section" . }}{{ end }} + {{ block "logos-section" . }}{{ end }} + {{ block "cards-section" . }}{{ end }} + {{ block "ctas-section" . }}{{ end }}
{{ partial "footer.html" . }} diff --git a/website/www/site/layouts/documentation/baseof.html b/website/www/site/layouts/documentation/baseof.html new file mode 100644 index 0000000000000..977cb1e1e32b8 --- /dev/null +++ b/website/www/site/layouts/documentation/baseof.html @@ -0,0 +1,41 @@ +{{/* + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. + */}} + + + + + {{ partial "head.html" . }} + + + {{ partial "header.html" . }} +
+
+ + +
+ + + +
+ {{ .Content }} +
+
+ {{ partial "footer.html" . }} + + + \ No newline at end of file diff --git a/website/www/site/layouts/partials/footer.html b/website/www/site/layouts/partials/footer.html index 01fb410bc7c08..cc643116d11d2 100644 --- a/website/www/site/layouts/partials/footer.html +++ b/website/www/site/layouts/partials/footer.html @@ -31,9 +31,9 @@