Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-5] Add Flink Runner #12

Closed
wants to merge 149 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
4e56e53
Initial commit
StephanEwen Jan 28, 2015
38c0c97
Update .gitignore
StephanEwen Jan 28, 2015
3dccdb6
Initial stubs
StephanEwen Feb 11, 2015
2309ab3
change to latest dataflow sdk version
mxm Feb 12, 2015
a286e1a
Parallel Do Function
mxm Feb 15, 2015
92dd104
GroupByKeyOnly
mxm Feb 15, 2015
e8e7099
change flink version to 0.8.0
mxm Feb 16, 2015
8000257
working WordCount and TFIDF
mxm Feb 16, 2015
ce0137b
Add WordCountITCase
aljoscha Feb 18, 2015
9a03b08
Rename Flink Runner to FlinkPipelineRunner, make it obey options
aljoscha Feb 19, 2015
a066ea8
Move type related stuff to types package. Add Javadoc
aljoscha Feb 19, 2015
5ba0c63
Rename utils to wrappers, add Javadoc
aljoscha Feb 19, 2015
0b721ea
Rename package example to examples
aljoscha Feb 19, 2015
4ea0276
Move Flink Functions into package functions, normalize naming scheme
aljoscha Feb 19, 2015
3ff4c88
Add comments for missing Flink Translators, some cleanup in translators
aljoscha Feb 19, 2015
9bae046
Add warnings for features not yet implemented
aljoscha Feb 19, 2015
e5ad983
Normalize DataSet names in FlinkTransformTranslators
aljoscha Feb 19, 2015
a89afc7
Fix bug in TextIO.Read translation, parallelism not correctly set
aljoscha Feb 19, 2015
c81c1fd
Add sideOutput support and ConsoleIO.Write transform
aljoscha Feb 19, 2015
6b2c725
Add translation for Combine.perKey
aljoscha Feb 19, 2015
b10377c
Add TfIdfITCase
aljoscha Feb 19, 2015
b79d28f
Add RemoveDuplicates ITCases
aljoscha Feb 19, 2015
e0a90e3
Change Create translation to serialize to byte arrays
aljoscha Feb 19, 2015
f195639
Add JoinExamplesITCase
aljoscha Feb 19, 2015
f7ab4c4
Add non-working (for-now) WikipediaTopSessionsITCase
aljoscha Feb 19, 2015
8a00a1f
Properly serialize/deserialize PipelineOptions from/to JSON
aljoscha Feb 20, 2015
fd8b78c
Add InputFormat wrapper for Source, along with ITCase
aljoscha Feb 20, 2015
769c9ef
Add Avro support with ITCase
aljoscha Feb 20, 2015
f510147
removed unused textPath variable
mxm Feb 19, 2015
32adfb5
run example with FlinkLocalRunner
mxm Feb 18, 2015
950925c
parse tree: print transform name
mxm Feb 19, 2015
bd9c1dd
add WordCountJoin test (coGroup)
mxm Feb 19, 2015
f60aaa9
add WordCountJoin3Test (three dimensional reduce)
mxm Feb 19, 2015
548386e
optimize CoGroupByKey by using Flink's coGroup
mxm Feb 20, 2015
5192fa2
use only one switch for translating composite transforms
mxm Feb 22, 2015
2f23d22
implement CoderComperator for Flink's coGroup
mxm Feb 23, 2015
ab9a4a6
coGroup optimization: disable until pull request is resolved
mxm Feb 23, 2015
b1c701b
Change version to 1-SNAPSHOT
aljoscha Feb 23, 2015
5e20012
Update Javadoc on CoderComparator and KvCoderComparator
aljoscha Feb 23, 2015
02e1b61
Change Serializers and Comparators to reuse byte arrays
aljoscha Feb 23, 2015
6632ee3
Add warning about KeyedState in DoFn, which is not supported
aljoscha Feb 24, 2015
7426272
Rename WordCount Example, Use Mavne Shade to bundle
aljoscha Feb 23, 2015
2e8d57d
Fix CoderComparators to correctly compare key lengths
aljoscha Feb 26, 2015
12fb543
Add dependency-reduced-pom.xml to .gitignore
aljoscha Feb 26, 2015
b7e4907
Correctly forward EOFExceptions from CoderTypeSerializer
aljoscha Feb 26, 2015
d5ac624
Fix mismatch between DataInput and InputStream in DataInputViewWrapper
aljoscha Feb 26, 2015
83177ad
Add Fake Grouping Key for KV GroupByKey translation
aljoscha Feb 26, 2015
61238d3
Change Flink Grouping Reducer to not materialize values
aljoscha Feb 26, 2015
7b50236
Add Partial Reduce Translation using MapPartition
aljoscha Feb 26, 2015
3a9115e
Move to Flink 0.9-SNAPSHOT add Combine based on MapPartition
aljoscha Feb 26, 2015
2e53371
Rename CoGroupKeyedListAggregator to match Naming Scheme
aljoscha Feb 27, 2015
cf50869
enable CoGroupByKey translator which was disabled due to pending pull…
mxm Mar 4, 2015
ccbf409
integrate now available Google Dataflow classes
mxm Mar 9, 2015
a7e6626
integrate partialGroupReduce operator to properly optimize Combine.pe…
mxm Mar 9, 2015
fba7b27
fix bug that would incorrectly match elements in the coGroup
mxm Mar 9, 2015
2a6b19b
implement a rudimentary normalized key comparator
mxm Mar 10, 2015
8820c09
Optimize CoderSerializer and CoderComparator
aljoscha Mar 2, 2015
555f06f
fix buffer overflow
mxm Mar 11, 2015
e0481e0
add README file
mxm Mar 11, 2015
254cf2a
add license headers
mxm Mar 11, 2015
caa2baa
change name of Flink's new GroupCombine operator
mxm Mar 13, 2015
6af613d
properly integrate Flink's new GroupCombine operator
mxm Mar 23, 2015
d6a728d
README: remove install Flink part due to availability of maven depend…
mxm Mar 23, 2015
40f0222
fix null value issue in Create translator
mxm Mar 23, 2015
d470f96
KvCoderComperator: set the offset correctly
mxm Mar 23, 2015
11e08c0
KvCoderComperator: set the correct context for the encoding
mxm Mar 23, 2015
402d0ee
Maven: remove shade plugin
mxm Mar 23, 2015
eea0dde
improve execution of WordCount example
mxm Mar 24, 2015
dee9923
README: add additional note about cluster setup
mxm Mar 24, 2015
8a9eb97
Flink master changes: rename FlatCombineFunction to GroupCombineFunction
mxm Mar 25, 2015
6d25f9c
adapt Flink Dataflow to run with the latest DataflowJavaSDK
mxm Apr 7, 2015
ed9b10e
instead of using LATEST, pin to a version of the DataflowJavaSDK
mxm Apr 7, 2015
0d12419
Pinned version for SDK too
mbalassi Apr 9, 2015
a932970
Updated Aggregator wrappers to match new Flink Accumulator behaviour
mbalassi Apr 20, 2015
f0097cc
adapt to the latest Flink master
mxm Jun 4, 2015
1f4d2f5
[streaming] add a hint to the streaming branch in the readme
mxm Jun 4, 2015
19ddb2d
add a Travis config file to enable continuous integration
mxm Jun 4, 2015
0a9a776
travis: modify config file to adjust jdk version and test command
mxm Jun 4, 2015
79bf260
travis: remove openjdk8 since it is not supported
mxm Jun 4, 2015
b4092a7
bump Flink version to 0.9.1
mxm Oct 1, 2015
fbbba86
change version number to 0.1
mxm Oct 9, 2015
fc2a25a
remove unused serializer
mxm Oct 11, 2015
b8dcf47
update to Dataflow SDK 1.0.0
mxm Oct 5, 2015
a263080
fix FlinkRunnerResult for new PipelineResult
mxm Oct 5, 2015
72cb4f1
fix UnionCoder
mxm Oct 5, 2015
defc7c9
adapt to 1.0.0 changes (handle AppliedPTransform)
mxm Oct 8, 2015
893ffde
fix renaming of ReadSource to Read.Bounded
mxm Oct 8, 2015
d467af4
adapt to the API changes: get types from context
mxm Oct 8, 2015
abcc06c
adapt to new windowing semantics
mxm Oct 8, 2015
df3a347
update ConsoleIO
mxm Oct 8, 2015
68465f5
remove useless job
mxm Oct 14, 2015
3b4ef97
fix FlinkPipelineOptions
mxm Oct 15, 2015
b2c85e6
fix aggregators
mxm Oct 15, 2015
ad30313
fix accumulators
mxm Oct 15, 2015
4a7de62
unify transform input and output getters/setters
mxm Oct 15, 2015
cdae3d0
re-enable and fix Read.Bounded
mxm Oct 15, 2015
758da40
fix AvroITCase
mxm Oct 15, 2015
add5f3e
re-enable TfIdfITCase
mxm Oct 15, 2015
b43dc70
update UnionCoder
mxm Oct 16, 2015
9ff4bab
properly translate nested composite transforms
mxm Oct 16, 2015
4aa4081
translate composite transform GroupByKey
mxm Oct 16, 2015
d1d7407
fix WordCount IT cases
mxm Oct 16, 2015
601d1b6
fix WordCount example program
mxm Oct 16, 2015
5874990
add missing interface methods for DoFn wrapper
mxm Oct 16, 2015
2a3612d
fix TfIdf example
mxm Oct 16, 2015
8d9ca54
add SideInputITCase
mxm Oct 16, 2015
edaa34b
add MayBeEmptyITCase
mxm Oct 16, 2015
dc9df38
fix WordCount comment
mxm Oct 16, 2015
1d8462a
use correct output type
mxm Oct 16, 2015
2dbfd89
disable Wikipedia test (for now)
mxm Oct 16, 2015
c9404ea
move these examples to test and convert them to test cases eventually
mxm Oct 16, 2015
aa1a933
move JoinExamples to util
mxm Nov 4, 2015
d5881a5
created it case Flattenize
mxm Nov 4, 2015
189ae44
optimize imports
mxm Nov 4, 2015
c36ca14
create ParDoMultiOutput IT case
mxm Nov 4, 2015
2e563f0
port to Flink 0.10.0
mxm Nov 16, 2015
35f104f
Enforce Java >= 1.7, Optimize Imports, Java 7 code compatibility
smarthi Dec 23, 2015
4f13955
make Avro type extraction more explicit
mxm Dec 24, 2015
03dfb1d
[maven] correct license and formatting
mxm Jan 15, 2016
3947750
[tests] refactor ReadSourceITCase
mxm Jan 18, 2016
49d295b
[runner] add translator for Write.Bound for custom sinks
mxm Jan 18, 2016
0d68bb2
[sink] generate unique id for writer initialization
mxm Jan 19, 2016
9d9ccf5
[readme] add a section on how to submit cluster programs
mxm Jan 19, 2016
aac96d7
[readme] add hint on how to submit jar to cluster
mxm Jan 19, 2016
d4a651a
[runner] add streaming support with checkpointing
kl0u Dec 9, 2015
9bfdea2
[readme] add hint that streaming support is available
mxm Jan 20, 2016
8057fc2
[readme] update to reflect the current state
mxm Feb 11, 2016
9e98022
[cleanup] various small improvements
smarthi Feb 11, 2016
b168010
[cleanup] remove obsolete code
mxm Feb 17, 2016
0628bf6
[tests] add streaming mode to TestPipeline
mxm Feb 22, 2016
620e13b
[runner] add Create transform
mxm Feb 22, 2016
7283d7a
[tests] integrate Wikipedia session test
mxm Feb 23, 2016
46c5158
Rearranging the code and renaming certain classes.
kl0u Feb 29, 2016
0cb6deb
Adds javadocs.
kl0u Feb 29, 2016
13e0276
Fixes Void handling
kl0u Feb 29, 2016
3c3a661
Track dataflow 1.0 -> 1.5 changes
Feb 24, 2016
cb3d030
[runner] some small fixes for 1.5
mxm Mar 2, 2016
699562d
Fixes the GroupAlsoByWindowTest.
kl0u Mar 2, 2016
e0b6782
[travis] install snapshot version of SDK before running CI
mxm Mar 3, 2016
a587617
[tests] suppress unnecessary log output
mxm Mar 3, 2016
07ab111
[maven] add project for Runners
mxm Mar 1, 2016
a1bd52b
[flink] adjust root pom.xml to Beam
mxm Mar 4, 2016
0bb73db
[flink] convert tabs to 2 spaces
mxm Mar 2, 2016
6c29910
[flink] change package namespace
mxm Mar 2, 2016
25f44ff
[flink] adjust directories according to package name
mxm Mar 2, 2016
071347e
[flink] update license headers
mxm Mar 2, 2016
4de42e4
[flink] update README
mxm Mar 2, 2016
6747817
[flink] [cleanup] delete .gitignore
mxm Mar 4, 2016
460b58f
[travis] go through all Maven phases (except deploy)
mxm Mar 4, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,5 +31,5 @@ install:

script:
- travis_retry mvn versions:set -DnewVersion=manual_build
- travis_retry mvn $MAVEN_OVERRIDE install -U
- travis_retry mvn $MAVEN_OVERRIDE verify -U
- travis_retry travis/test_wordcount.sh
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
<packaging>pom</packaging>
<modules>
<module>sdk</module>
<module>runners</module>
<module>examples</module>
<module>maven-archetypes/starter</module>
<module>maven-archetypes/examples</module>
Expand Down
202 changes: 202 additions & 0 deletions runners/flink/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
Flink Beam Runner (Flink-Runner)
-------------------------------

Flink-Runner is a Runner for Apache Beam which enables you to
run Beam dataflows with Flink. It integrates seamlessly with the Beam
API, allowing you to execute Apache Beam programs in streaming or batch mode.

## Streaming

### Full Beam Windowing and Triggering Semantics

The Flink Beam Runner supports *Event Time* allowing you to analyze data with respect to its
associated timestamp. It handles out-or-order and late-arriving elements. You may leverage the full
power of the Beam windowing semantics like *time-based*, *sliding*, *tumbling*, or *count*
windows. You may build *session* windows which allow you to keep track of events associated with
each other.

### Fault-Tolerance

The program's state is persisted by Apache Flink. You may re-run and resume your program upon
failure or if you decide to continue computation at a later time.

### Sources and Sinks

Build your own data ingestion or digestion using the source/sink interface. Re-use Flink's sources
and sinks or use the provided support for Apache Kafka.

### Seamless integration

To execute a Beam program in streaming mode, just enable streaming in the `PipelineOptions`:

options.setStreaming(true);

That's it. If you prefer batched execution, simply disable streaming mode.

## Batch

### Batch optimization

Flink gives you out-of-core algorithms which operate on its managed memory to perform sorting,
caching, and hash table operations. We have optimized operations like CoGroup to use Flink's
optimized out-of-core implementation.

### Fault-Tolerance

We guarantee job-level fault-tolerance which gracefully restarts failed batch jobs.

### Sources and Sinks

Build your own data ingestion or digestion using the source/sink interface or re-use Flink's sources
and sinks.

## Features

The Flink Beam Runner maintains as much compatibility with the Beam API as possible. We
support transformations on data like:

- Grouping
- Windowing
- ParDo
- CoGroup
- Flatten
- Combine
- Side inputs/outputs
- Encoding

# Getting Started

To get started using the Flink Runner, we first need to install the latest version.

## Install Flink-Runner ##

To retrieve the latest version of Flink-Runner, run the following command

git clone https://github.com/apache/incubator-beam

Then switch to the newly created directory and run Maven to build the Beam runner:

cd incubator-beam
mvn clean install -DskipTests

Flink-Runner is now installed in your local maven repository.

## Executing an example

Next, let's run the classic WordCount example. It's semantically identically to
the example provided with ApacheBeam. Only this time, we chose the
`FlinkPipelineRunner` to execute the WordCount on top of Flink.

Here's an excerpt from the WordCount class file:

```java
Options options = PipelineOptionsFactory.fromArgs(args).as(Options.class);
// yes, we want to run WordCount with Flink
options.setRunner(FlinkPipelineRunner.class);

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.named("ReadLines").from(options.getInput()))
.apply(new CountWords())
.apply(TextIO.Write.named("WriteCounts")
.to(options.getOutput())
.withNumShards(options.getNumShards()));

p.run();
```

To execute the example, let's first get some sample data:

curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > kinglear.txt

Then let's run the included WordCount locally on your machine:

mvn exec:exec -Dinput=kinglear.txt -Doutput=wordcounts.txt

Congratulations, you have run your first ApacheBeam program on top of Apache Flink!


# Running Beam programs on a Flink cluster

You can run your Beam program on an Apache Flink cluster. Please start off by creating a new
Maven project.

mvn archetype:generate -DgroupId=com.mycompany.beam -DartifactId=beam-test \
-DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

The contents of the root `pom.xml` should be slightly changed aftewards (explanation below):

```xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.mycompany.beam</groupId>
<artifactId>beam-test</artifactId>
<version>1.0</version>

<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>flink-runner</artifactId>
<version>0.2</version>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>WordCount</mainClass>
</transformer>
</transformers>
<artifactSet>
<excludes>
<exclude>org.apache.flink:*</exclude>
</excludes>
</artifactSet>
</configuration>
</execution>
</executions>
</plugin>

</plugins>

</build>

</project>
```

The following changes have been made:

1. The Flink Beam Runner was added as a dependency.

2. The Maven Shade plugin was added to build a fat jar.

A fat jar is necessary if you want to submit your Beam code to a Flink cluster. The fat jar
includes your program code but also Beam code which is necessary during runtime. Note that this
step is necessary because the Beam Runner is not part of Flink.

You can then build the jar using `mvn clean package`. Please submit the fat jar in the `target`
folder to the Flink cluster using the command-line utility like so:

./bin/flink run /path/to/fat.jar


# More

For more information, please visit the [Apache Flink Website](http://flink.apache.org) or contact
the [Mailinglists](http://flink.apache.org/community.html#mailing-lists).
Loading